Understanding Service Level Agreements

Ensuring Service Level Agreements Can Be Met

It’s often the case that despite your best efforts, the Service Level Agreements cannot be met, and you won’t discover this until disaster strikes. In order to feel comfortable with the agreements in place, it’s crucial that you anticipate and plan for disaster.
DBA’s often make the mistake of defining disaster too narrowly. Small events can have just as big of an impact as the larger, less likely ones. An appropriate Disaster Recovery plan is one that anticipates a variety of disasters and implements processes to test that the recovery plans are valid through simulation. The next section, Planning for Disaster, addresses these topics.

Planning for Disaster
There are two ways of dealing with the potential for disaster; expect and plan for it, or ignore it, head to the pub and hope for the best. As attractive as that last option sounds, it’s not something I recommend!
Planning and preparing for disaster can be an expensive and complicated process. A good DBA expects disasters to occur and obtains management support and funding to prepare, plan and test recovery plans for a variety of potential disasters. There is nothing like an actual disaster to make or destroy a DBA career!
Let’s consider the types of disasters that may occur, what’s involved in planning for a complete site failure, and the process of simulating disasters in order to test recovery plans.

Likely Disasters
Ask a DBA, or any IT person for that matter, for their definition of disaster, and most of the responses will involve cataclysmic events involving tsunamis, bushfires, landslides and various other events that destroy entire environments and all the hardware and data contained within. Whilst these are valid possibilities, they are rare and unlikely to happen to all but the most unlucky of us!
Far more likely are disasters that occur on a much smaller scale, like accidentally dropping a table (see table 3). Without a plan for dealing with small events such as this, the impact on an organization can be just as large as the earlier examples.

Table 3 “Disasters” include smaller, more frequently occurring events such as disk failure. Whilst not as dramatic as a fire that destroys a whole building, they can have just as much impact$0$0

Preventing and dealing with these issues is beyond the scope of this article. The important point here is for a DBA to anticipate and prepare recovery plans for a range of potential disasters, not just the big ones.

Complete Site Failure
The previous section provided examples of smaller, contained disasters. It’s also important to plan and develop a process for a complete environment failure. Not only will this be required for large scale disaster purposes, it’s also required in minimizing downtime when migrating systems from one physical location to another, or during a side by side upgrade. Whilst such migrations may never actually occur, having such a plan ready to go is a valuable asset, and the process of producing such a plan often leads to a deeper understanding of the systems you manage.
Planning for a complete environment failover is obviously orders of magnitude more complicated than individual disasters. It’s common for database systems to be connected to a number of other systems, possibly those from other organizations via custom interfaces; therefore, a database recovery plan for a complete environmental failover should take into account all connected systems, and should therefore be part of a wider recovery plan.
Whilst each organization will have its own unique recovery process for a complete site failure, there are a number of common items that should be included in all such plans;

Declaration of Disaster. Who declares the disaster? Should failover to the backup environment be automatic or manual? Whilst automatic failover plans sound good. there are typically very expensive and must make a range of assumptions on the definition of “failure”. Most sites, even if they have the technology for automatic failover, prefer a manual declaration and failover method,

Failover Systems. This can high or low tech. Examples include Automatic failover using Synchronous Database Mirroring, or an agreement with another organization to share computing resources when required.

Documentation. Detailed documents should be created and maintained to assist in the recovery process.

Offsite Backups. This can be as low tech as the DBA taking home backup tapes, or a more sophisticated technique involving block level disk replication to a remote site using SAN technology.$0$0$0The sophistication of a disaster recovery plan is usually determined by both Service Level Agreements and the Declaration of Disaster method.

Simulating and Testing Recovery
As well as anticipating disasters and preparing recovery plans, a DBA must obtain funding to simulate disasters and test the corresponding recovery plan. Depending on the disaster and the sophistication of the recovery plan, this can be an expensive exercise.
A DBA should classify potential disasters in order of the most likely to the least likely, and prioritize funding and planning for those most likely to occur. Alternatively, events which will have the largest organizational impact may be given the highest priority. In any case, simulating and testing recovery is crucial in ensuring a proven plan is ready to go when required.
Finally, any plan should be tested on a regular basis, should involve all the relevant people, and ideally occur at a random, unprepared time, much like a fire-drill. Testing on a regular basis will ensure the continued accuracy of the recovery details, and, if nothing else, a random recovery test will liven up the day!
In summary, the idea behind disaster recovery planning is to reduce the surprise factor when it occurs. This article is an excerpt from Rod Colledge’s forthcoming book SQL Server 2008 Administration, published by Manning Publications. All rights reserved.
]]>

Leave a comment

Your email address will not be published.