TL;DR: Running Python against an Oracle 23ai database is Nice, but Dont Overload. I devised a few ways to throttle long running python processes. First trick was to (dynamically) inject sleep-calls. And then I refined that to react on the actual load on the RDBMS.
Summary:
At first, I had an editable parameter-file to vary the sleep-time between loops. But then I refined it: The python program checks regularly on (average-)active-sessions, compares it to cpu_count, and takes a sleep when the system is considered "busy". The parameters to control this are imported by dotenv (link) and can be modified on the fly if needed. This allows long- or always-running jobs to retract to the background when system is busy, and to benefit from quiet intervals when they occur.
And by reacting to Avg-Active-Session and cpu_count values they automatically, dynamically, adjust to the load on the RDBMS, even regardless of the hardware underneath.
Background: Batches running python code.
Sometimes I have to build + run utilities for python-jobs that utilise the RDBMS. Some of those programs can run for Hours, if not Days, while doing loads/extracts/analysis/summaries... One request was to prevent Overload. e.g. to allow the python-jobs to "stop running" at times where other system and users were busywith More Important Work". So in the end, I had to define and measure "busy" and act accordingly.
Note: I used to check with vmstat or sar (sysstat) to detect how busy my system is, or how much (percentage) idle I had left. Nowadays, I prefer to get my data from the RDBMS, for two reasons: 1) there are systems out there with SQL-only access (e.g. no linux prompt anymore...). And 2) system nowadays will run CDBs with multiple PDBs and databases limited by the CPU_COUNT parameter. Your "server" may have plenty capacity (or not), but you can only get 2 CPUs from it... (merits a separate story, try running the 23ai-free-version with the cpu_count parameter hard-set to 2 on an 8 cpu machine - anecdote later...).
Anyway, this all started very simple, with some calls to time.sleep ( n_sec ).
First Attempt: simple sleep( n_sec) in every main loop.
The long running programs are re-startable, which is good-practice anyway because sh*t^h^h^h^h hiccups happen. So when the the system of background-batch-processing needed to yield to more urgent work, we would kill some jobs (stop the containers), and re-start them later. Crude but simple.
But that meant the processing often stopped completely, and sometimes we forgot to re-start...
Next up was: stop the program (or the container running it), edit a sleep-call in some main loop, and re-start it. The program would run at slower pace. That worked. Sort-of.
But it still meant the process had to be killed and edited. And if left alone, it would either run slooow for the rest of the job, thus taking much longer than usual. Or it would need another kill+edit+restart when the system had more process-capacity. Plus, for some processes, the startup-overhead was significant. It worked, but still messy.
Next Attempt: dynamic sleep time, from an editable config-file.
Introducing: Throttle.py. This little module would read the dotenv-file at regular intervals and adjust the sleep-time to a value given in .env.
The code to implement this can be found in throttle.py (link below). The module is meant to be imported, but running it stand alone will perform a test- and demo-sequence.
To "throttle" a running program, now we just had to edit the .env and put in a larger value for THR_SLEEP_TIME:
# throttle, sleeptime
THR_SLEEP_TIME=3
Note: The re-read interval for the dot-env file with the latest (editable) threshold was hardcoded at 60sec to ensure a relatively responsive system. The overhead of reading the file was considered minimal compared to 60sec. If you have to fetch this file from some remote location with significant time (latency), you can consider a larger interval.
We now had several advantages: No more kill+restart, but rather an easy-editable value in the config file (the .env). This worked Fine for a while. It was a bit like opening and closing a valve from a reservoir. So far so good.
But this trick still required manual intervention: ogling the system-load and editing .env. Surely, there was a way to automate and improve on this.
The Database is the real, precious, Bottleneck.
The "load" on the system was, in this case, best monitored from the RDBMS, the Database. Hence, we needed some probe into that system to determine how much spare capacity it had to process our python jobs.
And that is where AAS comes in: Average Active Session (link - Google turned up this rather old webpage, but the idea is actually older still). By using the statistic for DB-time, the database will tell you how much processing was done in 1 unit of time.
AAS = Delta-DB_Time (seconds) / Elapsed_time (seconds).
Knowing AAS is nice, but what is my available capacity ?
Your AAS tells you how many processes (on average) are running. But you need to relate it to your CPU_COUNT to know how "loaded" your system is. Hence we calculate percentage-busy as:
PCT_BUSY = 100 * AAS / CPU_COUNT.
On a system where AAS=3 (e.g. 3 processes busy) and count of 4 CPUs, the pct_busy will be about 75 percent, hence there is only about 25 percent capacity, or one cpu, still idle. But on a system with 24 cpus, an AAS of 3 means it is only 100*3/24=12 percent busy and some 18 CPUs are still available to service more processes.
Using this simple pct_busy as threshold, would more or less adjust to large and small systems. So for the moment, we decided to stick with this easy "percentage" (see notes below on setting threshold..).
So, let's build this...
Putting it to work: the python code.
First to get the relevant info from the database, I needed a query.
And to determine delta_DB_time and delta_epoch_time, I needed two "global" variables to persist over time, to keep db_time and epoch.
Those definitions were simple:
The function to execute the code starts by reading the dot-env and fetching the relevant values from the database.
Note that the parameter-file (dot-env) is read at Every Call to this function. When called too frequently (single-seconds...), the overhead can be significant. Hence the ora_aas_chk() should ideally be called at intervals of at least 30sec or higher.
Now that we have all the information, we can start the logic:
If it is the first call, and there are no "previous" values, we do nothing (pass)
Else, if there are previous values, we calculate the AAS and PCT_BUSY.
Once we have the values, we can determine if a pause is justified. I've left the print(debug) stmnts in for the moment:
The Threshold determines if we need a pause (sleep). And at the end of the function we "set" the global variables that will be used to calculate the "deltas" on the next call. For good measure, we return the total time spent in ms.
If you insist on strictly "sleeping" your program until the pct_busy is below the specified threshold, you can use a while-loop:
while ora_aas_chk ( ora_conn ) > 1000:
# sleep and only continue when pct_busy Below Threshold.
pass
Repeat the call until the return-value is below, say, 1000 ms (assuming your pause-setting is more than 1 sec, as it should be). Any return of more than 1 sec will signify we did pause-sleep. Hence, your database was considered "busy", and you may want to hold back your python-job from processing.
The code: added to ora_login.py
The logical place, at the time, to put this code was in the ora_login.py (link to earlier blog..). In hindsight, I should maybe have isolated it as an individual model. Later.
When running the ora_login.py as standalone program, it performs a self-check, where the output from ora_aas_chk() looks like this:
Notice that the duration of a call to ora_aas_chk () on this system takes about 15ms, which includes both the reading of dot-env and the roundtrip to the database to pick up the values.
This "debug output" is especially relevant when using the check for the first time in a a new environment. It gives you an idea how fast your loops run, and how much overhead the ora_aas takes.
The sleep-time is specified in Seconds, a nudge to linux-sleep, it is recommended to insert calls to ora_aas_chk() in code where you can expect put at least 1sec between calls. More prudent would be to have at least 30 or 60 seconds, or even much longer intervals. Longer intervals between calls avoids both the overhead of the calls, and reduces the risk of flip-flopping between fast-slow states of the python program.
Final Remarks (and more in the appendix)
In short: Now we have something that we can configure to adjusts to actual activity on the database, and thus benefit (automatically) from quiet periods.
Future... ? I have several more ideas to beautify this little snippet of code: I could add an arg* to make it run "silent" (simple). And I would consider peeking into vmstat or sar on the Database-server (not that simple..) to take into account more of the load from other components. Maybe in Future.
But I'm sure most DBAs and sysadmins have similar (but better!) ideas and tools. If this helps some of you even a little, the good work is done.
Feel free to copy the code and modify it. Remember: Your most re-used code is the code written by Yourself.
So then maybe if we meet at some event or conference, you can tell me what I did wrong, and why your version is Much Better. I Want to Learn.
-- -- -- -- -- -- End of this blogpost, for now -- -- -- -- -- --
Appendices: Links to code and some more notes..
Note: the python files should perform a self-test when run stand-alone.
throttle.py : the first attempt, no connection, hence no database required.
ora_login.py : the (runnable) source, requires a connection to test.
.env : an example of dotenv-file (you must edit scott/tiger@localhost, unless....)
Question - why not resource-manager...
Mostly because doing it in python was a hobby-item. Partly because resmgr only kicks in when the system is already fully loaded, and partly from lazyness as none of the DBAs wanted to tweak the resource-plans and consumer-groups.
And of course, we do not run all our programs as SCOTT with full DBA privs...;-)
A Note on setting the ora_aas_threshold_pct:
I've chosen a percentage because many DBAs and Sysadmins think in percentage when they check for load. They will check vmstat, iostat or sar (from the linux sysstat package). These tools give you percentages for user-cpu, system-cpu and idle-cpu. And may old-hand admins check to see if the percentage idle-cpy is still at least in double-digits to see if their system is (over)loaded.
In future, rather than using a percentage threshold, we can consider checking for "cpus-available", calculated as:
CPUS_AVAILABLE = AAS - CPU_COUNT.
But this merits a separate discussion. Maybe later.
A Note on AAS, and the measuring interval.
You can try to measure Active Sessions by query on v$session (status = ACTIVE or otherwise), but that gives you a "point-measure" of the very moment your query runs.
AWR and Statspack do a good job, by measuring AAS over an interval, which is by default One Hour. But a 1 hour average may be a bit long for loads that change more dynamically. And some systems have AWR (or Statspack) not even activated. I needed a value that is reliably-there, and an interval relevant to the system and to the job at hand, hence I started doing my own sampling of DB_Time from v$sysstat. My AAS, which is a ratio of seconds-per-second, can then be calculated as:
AAS = Delta-DB_Time (seconds) / Elapsed_time (seconds).
No comments:
Post a Comment