Monday, 6 October 2008

Backup and Recovery, at what level

Yes, this is Yet Another Replication Discussion (YARD).

Indeed, this beast seems to pop up all the time. Today’s trigger was a discussion with some architects on Disaster Recovery in a SOA environment. I’ll try to give it a nice twist, and I will also keep my lunch-budget in mind.


To the serious reader, this article is actually a DataGuard plug.
But with all the buzzwords that spin around these days, this article can be seen as Art: It makes you laugh, it makes you cry, it makes you THINK.
And maybe those SOA/BEPL types do have some good ideas sometimes.



Marketing Tagline:
With SOA and ESB, the DR-replication can move upwards in the technology stack. (and whatever the flavor of the month is: Grid, anyone ? Cloud computing ? The sky is the limit!).

On an even lighter note:
You should know the difference between an architect and a terrorist. It is at the bottom of this page.


(Health Warning: many buzzword-abbreviations ahead)


In most cases, I will advise anyone with HA (high availability) or DR (Disaster Recovery) plans to first consider a DG (DataGuard) construction. Replication of your data to a remote site offers many (documented and repeated) advantages:

DG offers you an off-site backup (but you should still consider tape or disk backups!)

DG gives you reporting on read-only-opened database (even better in 11g real time...)

DG allows you to use the 2nd or 3rd site/system to do maintenance or upgrades (but be careful, you may want 3 or even 4 system in that case, but you can tick some very fancy uptimes in your SLA box).

As an alternative to DG, most organizations will consider using SAN technology for replication. A surprising number of organizations seem to mix the two to complicate life.

Sideline: Some are even dreaming over sales-brochures (and some are entertained at seminars) on “stretched” clusters. When it comes to Oracle databases and RAC, these stretched-beasts are worth a separate story and a very carefully worded one indeed: I can’t kick powerful vendors/partners/colleagues, but I can’t put customers at risk either (?). Maybe Later. I have learned the hard way to Never criticize powerful vendors; it limits your lunch-invites. See bottom of page for lunch-items.



First on DataGuard

Despite my own confidence in Physical Standby (and current limitations of LogicalStandby), Oracle seems to be moving towards Logical (SQL-apply) rather then Physical (Redo-apply). Because of the lesser ties between primary and replicas, “Logical” will offer more possibilities for maintenance and management. The logical replicas do not need to be physically compatible and can span different versions. A trend we can see with open-source databases as well: logical replication is conceptually more powerful then the redo-apply mechanism originally used in Oracle Standby.Dataguard would merit a whole blog all by itself, but I have a different point to make today. I’ll repeat what I said before: DG is a very good start to improve availability of your system.



Now on SAN replication

The other route chosen by many organizations is to “outsource” the replication to the SAN. The SAN system will take care of block level replication of all storage. Attractive, often simple to set up (just shout at the SAN engineers, they will click and drag until it works), and heavily advertised (and paid for). SAN replication is especially attractive if you have to replicate more then just your Oracle data, as is often the case. The main downside of SAN replication is the complete dependency on the SAN technology, the license cost, and the higher volume of data to be transferred (when comparing to Logical or Physical DataGuard). SAN replication works if you have sufficient bandwidth, low latency and capable storage engineers (not to mention the friendly smiling vendor).

SAN replication and DG replication, if applied properly, can both be very powerful mechanisms. I recommend choosing the one you are familiar with (or get the best license-deal on) and stick with it. I would not recommend using both SAN and DG in an inter-dependent mix (a surprising nr of databases are indeed replicated by both concepts, and that tends to add to the confusion in case of recovery).

For Oracle/RAC on stretched clusters: suffice say that you shouldn’t pay for this yet, but if the (hardware) vendors wants to demonstrate this at fancy locations/parties: do encourage them to invite you. The prices of this technology warrant a Business class ticket (with spouse, and partner-program organized).



Coming to the point: The next alternative.

We now have a 3rd alternative: SOA-layer replication. To most of you this may be old news, or indeed an old concept re-worded: you are correct. Read on. And tell me how I got it all wrong when I became entangled in the latest hype without really understanding the concepts (or even knowing the jargon).

Stack-wise, I would situate SAN replication at the bottom of the technology stack (deep down in the infra-structure layer). And I would place DataGuard somewhat higher up in the stack (closer to the app-layer). Hold this layer-idea for a moment.

Enter stage: the Service Oriented Architecture (SOA) with the concept of an Enterprise Service Bus (ESB). Apart from the number of new abbreviations, what really struck me on SOA were the concepts of “above” and “below” the Layer (anyone read Tom Clancy in the nineties?). For some architects, the database is just a “service” that takes care of persistent storage. They are right (they are, after all architects). Google: “Database + ACID”, possibly add “Codd + rules + 13” for some very basic stuff on databases. They are correct; ACID-ity is the essence of a database.

Now think of replication for DR purposes as a means to store things in multiple locations, and possibly as a means to become technology-independent. And imagine a SOA request to store data to be sent not to a single service, but to two or more services. Borrowing from the DG concept: as long as more then 2 services have acknowledged storage we can consider that the data is safe.

The next level of "Replication", and one that the SOA crowd will understand!

Following this line of thought, the storage layer becomes Yet Another form of the Grid Concept (YAGC?) or Yet Another Computing Cloud (YACC!). And the calling process doesn’t care anymore where the data is stored. Just as your GmailFS file system doesn’t know or care where the data is stored. It knows data is stored, and that is sufficient. (nerds and architects alike: take note of GmailFS, I’m really curious: is Gmail the ultimate ESB?).

Like many architects that refuse to look “below the layer”, we can now state that we have a concept (eeh, a resource?) for replication and that this DR pattern is “Pure SOA” and “fully ESB compliant”. It even relates to “GRID” and will thus gain us brownie points in the Grid-Indexes of various partners (more links below).



Myself, I will of course stick with Physical Standby for the moment. But it doesn’t hurt to explore the possibilities. A manager with some spare budget might even turn it into a seminar (with lunch, I would presume).



The Relevant Links are


(SQL> select url from relvant_stuff order by credibility desc;):



Dataguard and Oracle high availability (valuable stuff here):

http://www.oracle.com/technology/deploy/availability/htdocs/DataGuardOverview.html


Gmail fs (think about this one):

http://richard.jones.name/google-hacks/gmail-filesystem/gmail-filesystem.html


SOA and ESB (…):

http://www-128.ibm.com/developerworks/webservices/library/ws-soa-design1

http://www.managementsoftware.hp.com/products/soa/ds/soa_ds.pdf


More on Grid hype (…):

www.hp.com/techservers/grid/index.html

http://www.oracle.com/corporate/press/2005_apr/emeagridindex2.html



Note for all Alliance Managers: take note of my cross-links to our “powerful partners”, and please keep me on the invite-lists, I do appreciate quality food, I’ll even tolerate the mail and spam for it.



In the pipe-cleaning department or “mopping up loose ends”, I owe you these:

Lunch: There is no such thing as a free lunch (unless you are the lunch).

Architects: The difference is that you can (sometimes) negotiate with the terrorists.

Now shoot me (over lunch?).

Friday, 8 August 2008

throttling and Triage, where do I make my difficult decisions

Every now and then the discussion about the "processes" parameter flares up.Setting this parameter too low results in ORA-00020 "maximum number of processes 1024 exceeded", but too high a setting will strangle the database-server with processes.

My favorite application manager always wants me to increase the number of processes. For an app-server and for its end users it is bad publicity if ORA-00020 appears in a java-logfiles or even in front of a (web-based) customer (now who is to blame for that type of error handling ??). Hence the knee-jerk reaction to increase the "processes" parameter in the spfile. This demand is often accompanied by the assurance that the application will never (actively) use the high number of connections, but needs the high number to make very sure the error never will (re)appear.

App-jockeys and managers generally refuse to take responsibility for setting or decreasing the upper-limit on the nr of connections in their JDBC connection pool. Some of the more exotic app-servers don’t even respect their own settings, and happily explode the nr of connections to something in the 4-digit range per app-server-instance.

Luckily, these app-servers will generally melt down by themselves, and that saves us from a database-brownout with more disastrous consequences. DCD or active killing then has to take care of the remaining connections, and preferably before the clusterware (automatically, unstoppable) or the operators (eager to stay within SLA) fire up the next application server who will also need to initiate his JDBC pool, hence needs the connections.

However, if we are unlucky, the app-server doesn’t melt down, and the database hits the max-processes, whereby other app-servers with genuine need increase connections will also suffer. Not Good.And one reason why pools should be conservative in changing their number of connections.

For the DBA, it makes sense to set the parameter to a value whereby the database can still operate "normally". Allowing too much processes, even inactive or near-dead ones, makes no sense and consumes unnecessary server-memory, sockets and cpu-cycles.Database and Unix zealots should now pop in and say that the processes-parameter controls many more derived values (transactions, sessions, enqueue_resources) and therefore requires careful consideration than just the shouting of the deployment team. I will stop there by saying: too high a setting is simply not beneficial. Vision the number of CPU’s in your system, and imagine the overhead of keeping a high number of processes alive, whether you use PGA and Workload parameters or not. (I can se a whole set of nerdy comments coming: Fine! As long as we agree on this: processes should be set lower, rather then higher, small is beautiful).

In a legacy Client-server environment, it often makes sense to use shared-server or its predecessor MTS (Multi Threaded Server). The shared-server construct is ideal to handle a large load of relatively quiet connections. As the "clients" in C/S are often unaware of each others workload and existence, it is the database that needs to take on the job of sharing (pooling, queuing) the connections. Note that MTS or Shared-Server is equal to pooling connections on the database-server. Do we want the database-server to be busy juggling connections? (IMHO: only if we have no choice, but in C/S, the Shared server can make sense).

In a J2EE environment, it makes more sense to use dedicated connections. Each call to the database should be handled Fast and the connection should be made available for the next thread that needs it.The database should ideally focus on doing its ACID task, and not be bothered with load- or connection-sharing. The Connection-pool can handle that. The JDBC pool mechanism is the component that should limit the number of connections and take care of the throttling. A JDBC pool should ideally open its maximum (=optimal) number of connections and keep those open, as the creation of a new connection takes time (when in a hurry, you can’t afford to wait for a new connection to be opened up). Provided the nr of processes (connections) is not allowed to go over a workable limit, there is no reason why each connection-pools should not be allowed to pre-create a fair number of connections. But the upper limit should be firmly set to a value where both the app- and the database server can still operate efficiently.

Allowing too many connections under increasing load will result in degraded performance, and worse, in a meltdown of the instance. If that happens, everybody suffers, and not just the last-clicked-user.

Two common causes for high numbers of connections are generally:

Transactions take too long (performance or locking problems, improper use of XA), or
Connections not released back to the pool (sloppy coding or just plain bugs)
Therefore, it makes good sense to include connection-count and connection monitoring in the test-plans, and to monitor (count, plot) connections during live operations. My strategy is to always set processes to a conservative value. I use it protect my database against brownout and meltdown.It is up to the Java (or other app-level components) to use the number of available connections to the best extend, and to provide an error-handling or throttle-mechanism to handle overload.

The approach in short:

Determine how many connections the DB can server while maintaining reasonable response-times.
Set that as max-processes.
Tell your app-servers to stay within the limit (and use max-pool-size).
Monitor the session-high-water limit (does it ever approach the max?)
Do spot-counts of connections (and plot them over time)
Audit-session to know where they come from (find the origin of high numbers).
Bottom line: a surge in traffic should not be allowed to cause melt-down of the database. High volumes should be throttled higher up in the stack (load balancers, web-servers, app-containers).

If the throttling mechanisms are absent, or are not working then I think the database has a legitimate need to keep processes to a reasonable value. Potentially a difficult decision, but someone has to take it. Triage can be painfull, but there is a good reason to do it. The survival of the system can depend on it.

Admittedly, this is a very db-centered approach. Waiting for some debate from the app-guys now.