Friday 8 August 2008

throttling and Triage, where do I make my difficult decisions

Every now and then the discussion about the "processes" parameter flares up.Setting this parameter too low results in ORA-00020 "maximum number of processes 1024 exceeded", but too high a setting will strangle the database-server with processes.

My favorite application manager always wants me to increase the number of processes. For an app-server and for its end users it is bad publicity if ORA-00020 appears in a java-logfiles or even in front of a (web-based) customer (now who is to blame for that type of error handling ??). Hence the knee-jerk reaction to increase the "processes" parameter in the spfile. This demand is often accompanied by the assurance that the application will never (actively) use the high number of connections, but needs the high number to make very sure the error never will (re)appear.

App-jockeys and managers generally refuse to take responsibility for setting or decreasing the upper-limit on the nr of connections in their JDBC connection pool. Some of the more exotic app-servers don’t even respect their own settings, and happily explode the nr of connections to something in the 4-digit range per app-server-instance.

Luckily, these app-servers will generally melt down by themselves, and that saves us from a database-brownout with more disastrous consequences. DCD or active killing then has to take care of the remaining connections, and preferably before the clusterware (automatically, unstoppable) or the operators (eager to stay within SLA) fire up the next application server who will also need to initiate his JDBC pool, hence needs the connections.

However, if we are unlucky, the app-server doesn’t melt down, and the database hits the max-processes, whereby other app-servers with genuine need increase connections will also suffer. Not Good.And one reason why pools should be conservative in changing their number of connections.

For the DBA, it makes sense to set the parameter to a value whereby the database can still operate "normally". Allowing too much processes, even inactive or near-dead ones, makes no sense and consumes unnecessary server-memory, sockets and cpu-cycles.Database and Unix zealots should now pop in and say that the processes-parameter controls many more derived values (transactions, sessions, enqueue_resources) and therefore requires careful consideration than just the shouting of the deployment team. I will stop there by saying: too high a setting is simply not beneficial. Vision the number of CPU’s in your system, and imagine the overhead of keeping a high number of processes alive, whether you use PGA and Workload parameters or not. (I can se a whole set of nerdy comments coming: Fine! As long as we agree on this: processes should be set lower, rather then higher, small is beautiful).

In a legacy Client-server environment, it often makes sense to use shared-server or its predecessor MTS (Multi Threaded Server). The shared-server construct is ideal to handle a large load of relatively quiet connections. As the "clients" in C/S are often unaware of each others workload and existence, it is the database that needs to take on the job of sharing (pooling, queuing) the connections. Note that MTS or Shared-Server is equal to pooling connections on the database-server. Do we want the database-server to be busy juggling connections? (IMHO: only if we have no choice, but in C/S, the Shared server can make sense).

In a J2EE environment, it makes more sense to use dedicated connections. Each call to the database should be handled Fast and the connection should be made available for the next thread that needs it.The database should ideally focus on doing its ACID task, and not be bothered with load- or connection-sharing. The Connection-pool can handle that. The JDBC pool mechanism is the component that should limit the number of connections and take care of the throttling. A JDBC pool should ideally open its maximum (=optimal) number of connections and keep those open, as the creation of a new connection takes time (when in a hurry, you can’t afford to wait for a new connection to be opened up). Provided the nr of processes (connections) is not allowed to go over a workable limit, there is no reason why each connection-pools should not be allowed to pre-create a fair number of connections. But the upper limit should be firmly set to a value where both the app- and the database server can still operate efficiently.

Allowing too many connections under increasing load will result in degraded performance, and worse, in a meltdown of the instance. If that happens, everybody suffers, and not just the last-clicked-user.

Two common causes for high numbers of connections are generally:

Transactions take too long (performance or locking problems, improper use of XA), or
Connections not released back to the pool (sloppy coding or just plain bugs)
Therefore, it makes good sense to include connection-count and connection monitoring in the test-plans, and to monitor (count, plot) connections during live operations. My strategy is to always set processes to a conservative value. I use it protect my database against brownout and meltdown.It is up to the Java (or other app-level components) to use the number of available connections to the best extend, and to provide an error-handling or throttle-mechanism to handle overload.

The approach in short:

Determine how many connections the DB can server while maintaining reasonable response-times.
Set that as max-processes.
Tell your app-servers to stay within the limit (and use max-pool-size).
Monitor the session-high-water limit (does it ever approach the max?)
Do spot-counts of connections (and plot them over time)
Audit-session to know where they come from (find the origin of high numbers).
Bottom line: a surge in traffic should not be allowed to cause melt-down of the database. High volumes should be throttled higher up in the stack (load balancers, web-servers, app-containers).

If the throttling mechanisms are absent, or are not working then I think the database has a legitimate need to keep processes to a reasonable value. Potentially a difficult decision, but someone has to take it. Triage can be painfull, but there is a good reason to do it. The survival of the system can depend on it.

Admittedly, this is a very db-centered approach. Waiting for some debate from the app-guys now.