Wednesday, July 9, 2008

Data center disaster recovery considerations checklist

Looking back on my data center disaster recovery experiences as both an IT director and consultant, I regularly encountered organizations in various stages of defining, developing, implementing and improving their disaster recovery capabilities. Disaster recovery policies and architectures, especially in larger organizations, are complex. There are lots of moving parts: standards and procedures to be defined, people to be organized, technology functions to be coordinated and application dependencies to be identified and prioritized. Add to this mix the challenge of grappling with the inherent uncertainties (chaos) associated with a disaster – whatever the event might be – and the complex becomes even more convoluted.

It is critical to come to an agreement on some fundamental assumptions in an effort to establish and ensure both internal and external (think stakeholders and stockholders). This should be done in order to recognize the need to address these many facets of disaster recovery development. Failure to do so will only lead to significant problems down the road.

I've given many presentations that address the "DR Expectations Gap," in which business assumptions concerning recoverability are often misaligned with actual IT capabilities. It's a fact that without explicit assumptions being clearly identified and communicated, your disaster recovery heroes of yesterday's recovery will become tomorrow's disaster recovery scapegoats.

Key among these assumptions, of course, is establishing classes of recovery in terms of RTO and RPO, but there are also a number of fundamental considerations that need to be measured, weighed and incorporated into the disaster recovery planning process. Here are a few practical planning items whose assumptions must be stated explicitly in order to drive an effective disaster recovery design and plan:

  1. Staff: Will the IT staff be available and able to execute the disaster recovery plan? How will they get to the alternate disaster recovery site? Are there accommodations that need to be made to ensure this? When a disaster recovery event hits, you better understand that some of your staff will stay with their families rather than immediately participate in data center recovery.
  2. Infrastructure: What communications and transportation infrastructure is required to support the plan? What if the planes aren't flying or the cell phones aren't working or the roads are closed?
  3. Location: Based on the distance of the disaster recovery site, what categories of disaster will or will not be addressed? Looking at some best practices, that site should be far enough away to not be affected by the same disaster recovery event – is yours?
  4. Disaster declaration: How does a disaster get declared, and who can declare it? When does the RTO "clock" actually start?
  5. Operation of the disaster recovery site: How long must it be operational? What will be needed to support it? This is even more important if you're using a third party. (e.g. what's in my contract?)
  6. Performance expectations: Will applications be expected to run at full performance in a disaster recovery scenario? What level of performance degradation is tolerable and for how long?
  7. Security: Are security requirements in a disaster scenario expected to be on par with pre-disaster operation? In some specific cases, you may require even more security than you originally had in production.
  8. Data protection: What accommodations will be made for backup or other data protection mechanisms at the disaster recovery site? Remember, after day one at your recovery site, you'll need to do backups.
  9. Site protection: Will there be a disaster recovery plan for the disaster recovery site? And if not immediately, then who's responsible and when?
  10. Plan location: Where will the disaster recovery plan be located? (It better not be in your primary data center). Who maintains this? How will it be communicated?
Obviously, there are many more considerations that are critical to identify and address for successful disaster recovery, but hopefully this tip helped to point you in the right direction.

No comments: