TL;DR: from my obsession with finding and measuring RoundTrips, I got to tweak with pings and RTs. Some findings, some code. And maybe more to investigate if and when that is deemed relevant...
image: queues, roundtrips, time-spent...
Background: RoundTrips, Latency.
When moving our python-code around to find the best (safest, and cheapest) place to run, I got obsessed with network-delays. Notably RoundTrips and ping-times. You can find Proof of my obsession in some of the previous blogs...
From using ping (on linux and mac-OS) and tnsping, and my earlier doodling with the ping-function from the python-oracledb driver, I wanted to incorporate a "latency measurement" into the python (problem-)code to collect data, maybe help advise on where to deploy the programs.
And another logical follow-up would be to build a db_ping(.py) utility for when tnsping was not available (thin driver...). Later.
Mind you: a lot of detailed "network stuff" is way out of my "knowledge zone", but there are Two Things I know:
1) Fast Connections, low latency, makes our programs "run better" and
2) Keep the nr of RoundTrips Down, no matter what.
So I should be measuring and counting those items.
Here are some notes, some warnings, and some code to re-use if you want.
The DBA view: AWR (STATSPACK) does not know...
As a DBA I have a tendency to look for "data" in AWR (Automatic Workload Repository), ASH (Active Session History) and similar information provided by the RDBMS. It will give me things like user-calls and SQL*Net Round Trips, and information on the SQL-statements done by the application. Those tools are Excellent for troubleshooting Database-behaviour, and to advise app-coders on where+how to improve their SQL. But the data from AWR has two flaws for my purpose:
1) For the RoundTrips, AWR does not produce any "ping-times", no Latency is measured.
2) It can only report on things from "inside" the RDBMS. Any "time used" or "time lost" on the client-side is not registered.
To measure Latency, you need something similar to ping. And to measure time consumed by the application, you need to "profile" its behaviour or measure consumed-time and consumed-CPU somehow. Not something that the RDBMS can "see" without additional help. (multi-component-view needed, topic for future blog..?)
Find the ping-time.
Firstly, I felt the need to report the RoundTrip-times before and after various (python)programs.
(Note that good old tnsping was also not available bcse of thin driver...)
I had already, innocently, incorporated something like :
os.system('ping -c10 www.oracle.com ' )
But I quickly found out that ping (and my other old-time favorite: traceroute) were not always present in the (production-, container-) locations where the python code was running. And getting tools "included in the build" was complicated. (tip: nobody seems to worry about the pip3-stuff because "it does not run under root". No further comment, and I am not hacking our own systems... Never).
So I built a few (amateurisch-) functions for checking connectivity and had them included in some of the python-rollouts.
Link to complete source below, but here is my quick code to measure a single ping and a series of 100 pings:
In this code you can see there is a function, f_rt_1ping(), that is is called once at the very start of the "main" program. And then it is called again, in a loop of 100. The result looked like this:
That first ping seemed to have taken of 28ms, and that is what I more or less expected from the linux-pings to this system.
The loop of 100 pings yielded an average of 25ms. Also in expected range.
Note: Please disregard the burps about starting MAIN, and the connection-info: I Need to Be Sure of the address, instance, user etc.. to avoid reporting data from the wrong database (oh yes, that still happens a lot...not kidding - link to older blog..).
Calibrate: make sure there is no overhead.
Because I am slightly paranoid, I wanted to make sure there was no hidden "overhead" in my ping process, I also built a calibrate-function. I would measure the same activity with and without the actual ping. The result was reassuring:
The overhead of code looping around a number of pings is negligible, see program run below. And I +/- knew this from an earlier experiment as well..
Something to Include into python programs...
To collect some data, I experimented with code to do "a sample of pings" inside the regular python programs. It would ping for a pre-determined interval (e.g. say 2 seconds?), or a pre-determined number of pings (say 100?). The output could go to stdout and get collected into the log. In future, I could also envisage storing the data in a table (it is an RDBMS after all).
For now: I was happy to have an approximate RoundTrip time (in ms) in stdout.
Even Better: I also now the RoundTrips and the DB-time..
From earlier work, I already had a way to interrogate V$MYSTATS. This would tell me what the RDBMS thought that my session had done. I could re-use that sampling of V$ to report on the session activity, as seen from the RDBMS. It would tell me how many RoundTrips a program had done.
So near the end of each program (more accurate: before closing the DB-Connection), I could report the number of RoundTrips to stdout.
Let's run the test-program to completion:
With all this data to be "collected, the end-result, to be put at the bottom of a python-program, will look like this:
With just a little extra code, any (relevant) python-program could, once it has done its job, report this data into stdout for examination by the sysadmins. And to myself, the DBA...
Notice that this program (just a test-thing), has done...
- a test of 1 ping, which yielded 27.266 ms RoundTrip time.
- a test of 100 pings, which yielded avg 27.373 ms.
- a test of 2sec calibrate-pinging, with an avg (netto) pingtime of 27.0266.. ms.
And from the session-statistics, we detect:
- 177 RoundTrips.
Also notice: I can use the linux time command to report the real+user time consumed by the program. I woul like to use that somehow. Maybe Later.
Note: I'll ignore the difference between db-time (microseconds from the latest time_model and centiseconds from the old mystats) for the moment. I might ask some ppl who know more about oracle when we meet. And I'll keep an eye on that when running more relevant programs once the code is included...
(And Yes, I'll think about popping this data into some log-table - that is definitely an option! Later)
Making Management happy, with numbers...
So by collecting RT-times and number of RTs, I have "measures" for each program or program-run.
High RT times, high Latency, will cause time lost in "the network".
High nr of RTs, Chattiness, will cause time lost in "the network".
The multiple of the two should be approx total time lost on network-traffic, call that Network-time.
So by collecting these numbers, I can "guesstimate" the time lost in the network. And I can indicate the possible gains from either bringing down the RTs or placing the program closer to the RDBMS.
From the example above: with 177 Roundtrips, at an average of 27ms, those round-trips may have consumed about 177 x 27 = 4779 ms, or 4.8 sec of "network time".
This we can report to our Manager...
I have a slumbering idea to take it even further:
I can try to measure time spent in python on the app, on the container, App-time.
And I can try too measure time spent in the RDBMS (DB-time, from V$STATS or V$SESS_TIME_MODEL.
And for non-threaded, non-parallel processes, the total of time: App-time + DB-time + Net-time => Should add up to running time of the program.
Maybe later. I have some more ideas to tweak code and experiment.
Focus for now is on reducing Chattiness and Hopefully reducing Latency.
Wrap up: It helps to collect Data.
At this point in time, I am just happy to collect "data" and use it as a broad-indicator. Finding the programs with most RTs. And relating the RTs to "sensible values".
For example, if a certain program does Millions of RTs, but only processes a few thousand records (and if we somehow know that with certainty...): something is suspicious and it may need examination of code...
And in our case, any component(-location) with a high latency will also get some attention.
At least by collecting the data, I have arguments to discuss with various colleagues.
More to come. Always more Ideas. Maybe if time.
-- -- -- -- -- End of this blogpost, for now -- -- -- -- --
Appendix : links to code..
tst_netw.py : the test-program used for the demos and screenshots above.
Plus these additional files, needed to import helper functions into test-program:
ora_login.py : login and some other oracle-helper functions.
duration.py : time-keeping and reporting functions - instrumentation
prefix.py : print the filename + timestamp to stdout - instrumentation.
.env : to replace credentials with your own scott/tiger@host:port:service
that code: Take it!
Feel free to copy or plagiarize (do you really need a large disclaimer..?) Just Copy What You think is Useful).
If you have suggestions for improvements: Let me know. I'm always interested in discussion these things if time+place allow. Preferably in the margin of a Good Conference, or on one of the social channels (bsky, slack, email...).
Oh, and you may need to get (pip3) the following:
pip3 oracledb
pip3 dotenv
pip3 datetime
pip3 array
-- -- -- -- -- End of this blogpost, for Real -- -- -- -- --
No comments:
Post a Comment