Friday, 30 October 2009

Oracle Performance Tuning with hit ratios - Risky.

The Ratio, as drawn by Leonardo da Vinci.While he was sitting in front of a fireplace somewhere in New England, Martin Widlake prompted this quicky-post.


First my Confession: I still use "Ratios" in assessing and fixing performance.

I know current wisdom is that ratios and aggregated numbers are not good enough and should not be used in tuning. I agree that ratios can be abused, and can be hugely decepive.

But when guestimating the scalability or performance of an OLTP database-application, I use these three numbers:

Buffer-gets per row, 10 or less.
Buffer-gets per execute, 100 or less.
Buffer-gets per transaction, 1000 or less.

Martin will immediatly see why, IMHO, his result of 380 for the 1st or 2nd ratio was too high.

I picked those numbers for several, Simple,, reasons.

Firstly, these numbers are simple to remember.

Secondly, these numbers can simply be determined. You can get them from v$-views, from traces, from AWR (Licensed), or from statspack (free!).

Finally, these numbers (10, 100, 1000) have generally served me well in places where I looked into heavily loaded OLTP systems. I dont recall a single problem with any statement or transaction that managed to stay within those limits.


Now be careful. Your situation will be different. And you should not strive for these numbers per-se, but rather for happy customers. And whether these numbers really are appropriate for you depends on the number of calls you do, the type of work your users/systems do, on the capacity of your hardware, and the demands placed on your system. Some systems operate comfortable with one zero added to each number.

But if you need millisecond response times, or need to serve thousands of requests per minute, then my advice is: Strive for those numbers for every statement and transaction on critical the path.

Any query or transaction that consistently breaks one of those ratios, should be looked into.

And if your system doesnt meet the numbers, there are three fixes:
You either Eliminate the statements (just dont do it),
Optimize the offending statements (improve where-clause or indexes), or
Contain those statements (do it less frequent, and dont do it at peak-load periods).




A further word of warning on the general danger of "Ratios" is appropriate.

Phi, or the Golden ratio of 1.6180... Maybe we should look for some relevant way to use that, just for fun.We have had a clear demonstration, by Connor McDonald I think it was, that Hit-Ratios are not a good indicator for "performance". Connor was able to "set" the hit-ratio to whatever his manager wanted, with very little impact on performance [link?].

Subsequent other performance specialists, notably Cary Millsap, have indeed proven beyond doubt that ratios and other "aggregate" metrics are not good indicators of performance.

Back in the nineties, my colleagues and myself have used scripts based on numerous book (from the nineties...) to determine all sort of (hit)ratios for cache_size, shared-pool, redo en other settings. The ratios were derived by high-tech scripts we crafted, based on the books we bought. And from the nicely, SQL*Plus formatted, output we determined that we needed more memory, larger shared-pool, more redo-buffers or larger redo-files.

In very few cases did those ratios point us to the real problem and in even fewer cases did the advice make any difference.

In almost every case, the real solution came from B/Estat, tkprof, Statspack, AWR or ASH. And most of the time, we were brought straight to the heaviest query in v$sql.

That query was then either made more efficient (less buffer gets per execute) or was eliminated (just dont do it). In 90% of cases, that solved the problem (and sometimes improved a few ratios as a by-product).


My Conclusions:
Most Ratios are indeed of little use - they do not address the real problem.
But Some ratios can help you, if you know what you are doing (and if you know what your customer is trying to do).

With or without Ratios, you still need to find out what your system does, and use common sense to fix it.


I'll save a few other examples of Ratios for later, notably one I use to find locking problems. I think Martin can already guess which one.

10 comments:

mwidlake said...

Nice one Piet - I can't disagree with your ratios there. I'm looking forward to your next thoughts on the topic.

I have a worse confession. I still use the Cache Hit Ratio and I am of the opinion that it is very useful. I think I'll use my own blog to expand on my dirty confession, I would not want to taint your blog with such things...

PdV said...

Thanks Martin.

Indeed. The good old cache-hit ratio can easily be "shown" to new(bie) and old(-fashioned) dba-groupies. Just point out logical and physical reads from statspack. Then ask them: What would you say the hit-ratios is, and is that good or bad?. Makes them think.

Most modern system (OLTP) have very good cache-hit ratios, 99.9+ percent is not unusual. But they can still experience bad performance (or I wouldnt be looking into them).

I'm looking forward to your blog on the topic.

jonathanlewis said...

Piet,

There is an important difference between your "buffer gets per row" and the historical use of hit ratios which I can best demonstrate with a simple question:

Looking at the instance statistics from a one-hour snapshot, I see:

table fetch by rowid: 2.6 billion
session logical reads: 1.99 billion

According to your figures, does this indicate a problem or not ?

And the answer should be that you can't tell. You can even go to a finer granularity - assume a sprepsql report shows 3 versions and 1,000 executions for a query: the number of buffer gets is 90,000 and the number of rows is 1,000 - do you have a problem ?

Maybe, maybe not - perhaps 900 of your executions are doing 10 buffer gets and 100 are doing 810. (The awrsqrpt shows the work done by each version separately - which improves visibility, of course).

At a fine enough granularity, an observation of the type "for each thing of type X we do N things of type Y" is, indeed, an example of using a ratio - but you need a degree of uniformity in how you measure before the ratio can have a useful meaning.

jonathanlewis said...

... and then I got my row counts out by a factor of 10 - but I'm sure you get the idea.

PdV said...

Welcome, and grinning in agreement at you Jonathan. I got _an_ idea.

And Thanks for pointing out how granularity (and understanding!) is important.

I hope I have conveyed the "limitations" of ratios as well.
"Do we have a problem?" or "did we pinpoint a problem?" depends on more then just a good ratio.

Uniformity of measurement: good idea.
I had not given that any thought yet, as I only use ratios rather casual and tend to improvise. The V$SQL (diff over time) and WRH$SQL_STAT (deltas) probably are good (granular?) places to look.

Also, data-collection may be interesting for comparison. Sys.aux_stat$ comes to mind. But I dont want to return to a situation where we re-publish all sort of high-tech ratio-cripts with blind-to-follow advice attached. I'd rather get ppl to think and understand things better (myself included).


Let me distinguish two situations for my "ratio usage": Acceptance-Testing versus troubleshooting.

When in Acceptance-testing and capacity-planning, I am rather insistant on "good" ratios for gets/exe and gets/row (and plan-stability and LIO-predictability). If gets/exe are high (100s, 1000s, higher?) and the statements are expected to be fired at 1000s/hr, I know I have a potential CPU-problem, possibly a storage-bottleneck, and possibly longer-lasting transaction-locks, possibly a looming concurrency problem as well.
All bad news. Hence I might insist on good ratios, but I am never religious on it (hey, everyone is allowed their mistakes, I know I make many myself).

When in troubleshooting-mode, the ratios I mention are no more then possible indicators of trouble. Just showing the "excellent hit ratio" from a system in trouble to a colleague will make him think and will demonstrate the danger of "aggregates".

In troubleshooting mode, a lot depends on the situation. I would try to look in detail, for example for skews you mention, and for SQL_ids with multiple plans.
But I would also want to have an overview of the system both inside and outside of the database using OEM, Statspack and sar. And I would try to ascertain that the suspect-SQL was indeed related to (business) problem.


Nowadays, the CPU-capacity of some systems is large enough to ignore or hide a problem for quite some time. For example when data doesnt get purged.
But when a system then does reach (CPU)saturation it tends to hit "the knee" fairly hard.
Hence, my attempts to force ppl to write (and test) code for efficiency of individual actions.
Those detailed ratios are helpful there.
Concurrency, notably locking, is next on the list to check, but different topic.

PdV said...

hm... me rambling on. This thing started on impulse a week ago. Typed almost a complete post again. Long sentences too.

But then, the best (presentation-)topics are those with a grain of passion in them.

I hope I simplified or clarified something. Zzzz now.

Anonymous said...

Nice post

Bala Murugan said...

I got a job by saying this answer in my last interview. thanks for awesome help.
I got more idea about Oracle from Besant Technologies. If anyone wants to get Oracle Training in Chennai visit Besant Technologies

www.oracletrainingchennai.in/‎
www.oracletraininginchennai.in/
www.besanttechnologies.com

Bala Murugan said...
This comment has been removed by the author.
Bala Murugan said...
This comment has been removed by a blog administrator.