Thursday, 14 August 2025

Oracle23ai and python - Measuring (Network-)Time.

TL;DR: I wanted a utility-function in python to report the "time spent", notably looking for "Network-Time". It evolved during coding into a time-reporting-function. It Worked. But I learned something new again.

Image: waiting for a bus, you never know when...


Background: (I'm not obsessed with - ) Roundtrips and Latency.

From earlier work, I have a number of exploratory python code-snippets to study and measure "network time" for (python-)programs that talk to the database. It so happened I came across some "Chatty" programs that were running in network-configurations with high Latency. That sparked a lot of tuning or "optimisation" (simple: removing roundtrips). But there is still a lot to be gained, notably by deploying code closer to the RDBMS: By Reducing Latency.

At first, I set out to write two functions:
 - f_rt_time_lost() : to report "time lost" on RoundTrips in the network. 
 - f_run_pings() : to imitate the ping-behaviour: report pingtimes to stdout.

And if I had those two, I might as well create a test-program: db_ping.py..?

In the process of building and testing those, I also found the need for something akin to linux time, to report how much time a program has spent on CPU (on the client side). 
And as a good DBA,  of course I wanted to report on the DB-time from my RDBMS.
So this "project" got a little out of hand, code-wise. I typed up a lot more than I had originally expected.

But to stay focussed: I wanted to "Report" on the time lost (in network...) by various programs.. so most of the effort went into collecting and reporting the basics:

The formula for time "lost in the Network":

    Network-time = 
        RoundTripTime (RTT)  *  number of RoundTrips  (nrRTs)

In between I started thinking of "total time", and would it be possible to put time in perspective: How much is spend "in the App", and how much is spend "in the RDBMS".

The next equation I can think of is this one: The Equation of Time...

    Total time = 
        App-time (spent running code from client-program) 
    + DB-time (spent running SQL or PL/SQL in RDBMS-session) 
    + Network-time (nrRTs * RTT).
[    + Idle- or user-input-time (sleep, keyboard-wait, other?... ). ] 

If I can report on those items I can make fairly clear-cut cases for troubleshooting: Try reducing the biggest time-consumer. And if that happens to be "the network" than we know where to try and improve. 
If it is Not the Network, than we probably still know where to look. And we can better predict what kind of impact any improvement can have.

Note1: At this point, real-techies start to suggest "profilers", or even a fancy "flame graph". But in my environment I dont have the luxury nor the knowledge-base to go and do extensive profiling. I will have to help my colleagues with the tools we have, and with tricks they can understand and implement.

Note2: Anything that runs multi-threaded or uses PX (parallel processing of SQL), will pose additional challenges. I'm deliberately keeping those out of scope.

Note3: Equation of Time.. EoT.. I'm deliberately not going to use that abbreviation.

For now..... 
Most programs already pop out a "summary" of their processing to stdout, and I wanted to add my own little item to that.

What if Every Program could produce that Equation of Time at the end of its run... ?


Some musings on measuring time.

In IT it is always very much "in vogue" to blame network or latency, or "our WAN connection" for any slow response-times or long-running-programs. 
Which is Why I want to Know.. I want to be able to say with more confidence: "this is a network issue". 
And since I am not a "network engineer", not even a proper sysadmin, I need Data, and I need data that I can "Understand" myself and that I can Explain Clearly to Managers and other non-tellectuals.

With the Oracle Database, we are used to AWR (or STATSPACK), and ASH, to report on activity and we know how to use that information for "tuning". But those tools have difficulty covering network activity. 
At session-level I can get the nr of RoundTrips (RTs) from a session, And I can have a session report how much DB-time it spent. 

The two items missing from AWR are, IMHO, App-time and Network-time. 
App time would be the time spend "in the App", e.g. on CPU or other activities at the client-side (some of it maybe even waiting for a keyboard-enter).
Network-time would be the time spent on "transport" to and from the RDBMS. And that Network-time is possibly the least-productive if them all. It is the "waste" that we are trying to eliminate...

I realise full well that AWR, based on the RDBMS, will not be able to measure App-time or Network-time, and I am not blaming Oracle or not providing this data. But I can look for that data myself...


Let's put it in code..

From earlier tweaking, I had some code more or less ready to copy-in. 
For python, I decided on the time module, and using time.perf_counter_ns() as probably the most precise way to measure time (see also older blog on duration.py)

Last year, I already had thrown together some "stopwatch" ideas so there was some re-use of a module called duration.py. (it now also has a function inspired by linux  time command- maybe good topic for later..

In short: I can cover the App-time, both active-time and elapsed clocktime from the python time module.

The functions I used comes from the time module.

time.perf_counter_ns () : a clock counting nanoseconds. The reference point is undefined, hence you need to save the value when your program starts and than check the difference to know how many nanoseconds have elapsed.

time.process_time_ns () : the nr of nano-seconds spent on user- or system-time. See this as "busy-time". When compared to linux time, this should be close to the sum of user+sys time (check this...?)

Then I needed DB-time, and number of RoundTrips: those could be gotten from v$mystat and v$sess_time_model. I used this query:


This gives me the nr of RoundTrips for the session, and the DB-time from both v$mystat and from v$session_time_model. I used the time-model as it returns microseconds and implies to be more precise, but the old-fashioned DB-time from v$mystat, in centiseconds, achieves the same result once the session has consumed a few seconds of time.

To find the RoundTripTime (RTT), I need a ping, or a number of pings. And then multiply ping-time by the total nr of RTs that the session has done.

I sampled RTT using 5 pings, with 1 second in between. More on that later...  

And with the code approached completeness, I had:
App-time: from time.process_time_ns()
DB-time: from the v$sess_time_model
Network-time: from the multiply of RoundTrips x Ping-time
And Total-time: from the wall-clock of time.perf_counter ().

That looked quite complete and realistic. 


The Test-setup and first run...

While coding, I had the function included in ora_login.py , my module with oracle-utilities for use in our python programs (link below).

And for testing, I had the function called at the end of do_sql_times.py, as described in an earlier blog.

This program allows me to run an ad-hoc query against the RDBMS. It uses the dotenv utility to get credentials from the file .env. And it reports on how many microseconds each record takes to "retrieve". I've used that program to measure timings before, notably to show the effects of arraysize and prefetchrows.

A test-run with a query on dual looked like this:


After the Connect, this program notifies that "array" and "prefetch" are set to non-defaults of 1 (to maximise RoundTrips). 

Every record will be a RoundTrip, to maximise the number of round trips and thus maximise Network-time.

The very simple query is just 1 record, and that seems to take 49 ms (I happen to know that could be the approximate RoundTrip time for this system). 

The 2nd part, the "reporting of the time" is interesting: We collect the ping-time, we retrieve DB-time and RTs from the RDBMS, and we can calculate "all the values":


Note that the ping-times to this system are fairly high, and variable. This is the kind of system we have trouble with... 

And then we see the data we want from the programs:
App-time: measured by time.process_time()
DB-time: from the DB
Network-time: pingtime x nr RoundTrips
Idle time: calculated.
Total time: measured from time.perf_counter()

The first time I tested this, I was surprised by the large, 5sec, idle time in a program that I thought did very little. Then I realised: The loop for the 5 pings has a 1-sec sleep in it, and that generates the (approx) 5 sec of Idle-time.

As a separate check, I've used the linux time command to time the run of the program, and the values concur, even for a very trivial, short-running program:

Real: the elapsed time, close to the 8sec reported by python.
User and sys : Together they are +/- close to the App-time repoorted by python.

And the data from time correspond, with a margin of error, to the data reported by my python contraption.

So I now have a function I can call at the finish of a program that  will report how time was spent. And a verification via linux time looks acceptable. 

So far it looks promising, usable (but you can feel a problem coming...)


Test some Longer running queries...

Then I did a very simple test that turnd out funny. Query 100 records (with one RoundTrip per record), and I found this:


Let me point out: 
Each record seems to take between 30 and 60 ms to retrieve (forced to do a RoundTrip every record)
The ping-samples are much longer than that, with an average of 175ms ?
And the Idle time is reported Negative ??

On first impression: the individual records come back (from a RoundTrip) with some 40ms, which is expected on this system. And if it takes 40ms to fetch a records, the RTT should be lower than that.
But the ping-samples are higher and show more irregularity than expected. 

The same test, against a database running nearby (ping-time around 1ms), shows this:


On this easy-to-reach RDBMS, ping-time shows less variance, and the Idle-time is as expected: 5 seconds from 5 call to ping plus sleep-1-sec to sample RT-time.

It seems that on the "irregular system" the sampling of the ping-times (5 pings in 5 seconds) comes back with values that are too high, and possibly more irregular than they are in reality.
And from over-estimating the RTT, the Idle-time is then calculated as ... Negative?

Something is Not Right.


Partial solution: Do the ping "faster".

One apparent fix was to modify the ping-sampling, and remove the sleep-time between pings. This gave more regular and lower RTT times:


But mainly shows that pings (and RoundTrips, RTT) were faster when done at high-frequency. And we dont know at what frequency the regular python-programs do their round-trips, especially while we are trying to Eliminate the RoundTrips in the first place.


Wrap Up: This helps, but there is More... 

This measurement is only Partly Helpful.  First the list of positives...

DB-time: We get a view of how much time is spend in the database. If this time is significant, we can go in and optimize some of the queries. Classic DBA-tuning stuff... (but not our biggest challenge on this system).

App-time: We get an idea of how much time is spent by the python-code and the measure from time.process_time () seems to concur, but not quite, with what the linux time command provides as user + sys time.

RoundTrips: We can record the exact number of RTs per program, and thus have a better measure for its chattiness (but we already knew that and from AWR-data...). In our test runs, the program did 110 RTs in about 9 seconds,  say +/- 12 per second. No idea if that is good or bad, and we also dont quite know in what "pattern" they were done. As it turns out this program did most of the roundtrips in a quick sequence while fetching 100 rows. Other programs will display different patterns...

Network time: Not Quite. This "measure" is still unreliable, mainly due the the uncertainty in measuring ping-time or RoundTripTime (RTT). 
Ongoing experiments will probably be with some "ping" utility.

Current measure: For the moment, I'm reducing (removing) the sleep-time between pings when sampling. But I do not know the behaviour of the pyhthon-programs with regards to their frequency of RTs to the RDBMS, or the pauses between SQL-Net calls.

Root Causes: We suspect this  behaviour is due to the nature of our systems. Most of the Databases run on VMs. But a number of (test)databases are running on containers. The latency (and ping times) vary depending on the locations in the networks and the time of day, and possibly the phase of the moon...  In general, the VMs seem to have more problems than the containers with test-databases...

Dont tell the manager, but .. we dont really "Know Everything" yet.

Main goal for the moment remains: Find and Remove RoundTrips....

-- -- -- -- -- End of this blog post, for now -- -- -- -- -- 

Appendix : links to code and scripts

do_sql_times.py : the program used in testing. 

You can use this program to run "any query" and get timing data out of it. The actual function to report the timing is ora_time_spent ( the_conn) and is defined in the file ora_logon.py.

The utilities used:

ora_login.py : Functions to logon, some utilities, and now also time-reporting.

duration.py : Time-measurement (and some reporting).

prefix.py : Enable a prefixed print to stdout, add timestamp in microsec.

.env : The config file for dotenv.

Beware: the .env now contains non-defaults for arraysize and prefetchrows. If you run with those set to 1 (one), some of your code may cause excessive RoundTrips! 

-- -- -- -- -- End of this blogpost, for real -- -- -- -- --




No comments: