In the previous two posts we presented the case for business continuity and how to use business impact analysis to get started. In this third and final installment we’ll introduce some concepts around resilient design and disaster recovery. In practice, resilient design and disaster recovery are highly customized to each business or organization and more technical than conceptual so I will try to keep this high level.
Disaster recovery (DR) includes the technologies, concepts and practices in place to ensure an organization’s survivability should it face a loss of data or services. DR can be as simple as a nightly data backup or a more complex multi-site datacenter design and failover database clusters. At the business level, the application of DR depends largely on the cost of an interruption versus the cost of recovery. For example, a “mom & pop” trucking firm with one office and a desktop-installed management application may only need to take a nightly backup of their data, knowing if there’s an outage they can restore to another computer in 2-3 hours and their drivers know their routes and assignments 12 hours in advance. A national trucking firm with hundreds of drivers, live shipment and vehicle tracking that transports time-sensitive or perishable goods may not be able to sustain an outage of 2-3 hours and would approach DR quite differently.
Resilience is the ability to withstand threats to availability and recovery readily. Effective DR starts with resilient design. Resilience is commonly measured in uptime. For example, uptime of 99.99 percent means the system or application is unavailable no more than 52 minutes per year. An uptime of 95 percent permits about 18 days of downtime each year and an uptime of 90 percent allows for about 36 days of downtime annually. It’s costly and impractical to design every system for 99.99 percent uptime, so we look back at our business impact analysis (BIA) for cues. The BIA should provide the actual cost of downtime in dollars and this data will be used to compare to the cost of building a resilient information system. (Note, if your organization doesn’t operate 24x7x365 consider adjusting your uptime calculations to include only its business hours. It’s possible that a business operating for 8 out of 24 hours could use the other 16 hours to recover. Conversely, there may be other organizations that can’t afford any downtime in their 8-hour business day.)
The next step is to design your environment according to criticality and risk tolerance. I’ve worked with a few organizations whose stated strategy is to get daily backups and buy new stuff to restore to when a disaster occurs, which represents a high-risk tolerance approach. On the other end of the spectrum are organizations whose executives demand 99.99 percent uptime for all applications but experience sticker shock at the actual cost to achieve that objective. The best approach is one that is risk-informed and employs two to three tiers of criticality such as critical, essential and necessary where tier has its own service level. Typically, the tier designations will also inform which systems are brought online first in a DR event.
The design of the environment will determine selection of resilient practices and technologies and where they get applied. Some common examples of resilient design approaches are included in the table below.
Critical (99.99%) | Essential (95%) | Necessary (90%) | |
Network | Redundant Switching, Redundant Firewalls, Redundant Telecom Providers | Spare Unconfigured Hardware On-hand or Ready-to-order | |
System | Virtualized Clusters, RAID 6, 10 with Hot spares | Spare Configured Hardware (Hot), RAID 5 | RAID 1 With Spare Disk Onsite or Ready-to-order |
Environment | Raised Floor, Redundant Enterprise HVAC, and Emergency Maintenance Contract | Enterprise HVAC, Emergency Maintenance Contract | Box Fan and Stick to Prop the Door Open |
Power | Dual Power Supplies (Dual PDU Configuration), Enterprise UPS, Standby Generator, Emergency Fuel Contract | Dual Power Supplies (Single PDU), In-rack UPS, Standby Generator | Spare Power Supplies, In-rack UPS, Automated Emergency System Shutdown |
Database | Active-Active Cluster, Transaction Log Backups | Active-Passive Cluster, Daily Backups | Daily Database Backups |
Cloud | Redundant Cloud Providers | Multi-cloud Design With One Providers | Run Application Locally |
Datacenter | Hot Site | Warm Site | Cold Site |
Backup | Snapshots Every Four Hours, Near Realtime Replication Offsite* | Daily Snapshots and Nightly Backup Stored Offsite | Nightly Full Backups Stored Onsite |
Testing | Semi-annual Failover Testing | Annual Walkthroughs | Ad Hoc Testing |
*Replication is not the same as backup. Backup, whether a nightly full backup or multi-event synthetic full backup, represents points in time. Replication is near real-time and if the data is corrupted or encrypted by ransomware, will just replicate the damaged data.
When selecting solutions, ensure alignment with the underlying technologies. A highly virtualized enterprise will approach this differently from a non-virtualized one by focusing on capturing the virtual machine states versus hardware level configurations. An enterprise that’s cloud based (IaaS, SaaS, PaaS) will also approach this very differently too. Keep in mind, the cloud is just someone else’s datacenter. There is some resilience built into the default design from the providers, usually at the single datacenter level, but little more. Looking for a bi-coastal cloud datacenter with redundant network paths and multi-master database replication? That will cost you extra. So too will daily backups of your files in common SaaS applications like Microsoft OneDrive and Outlook. Refer to your cloud provider’s shared responsibility matrix if you aren’t sure what you may be responsible for.
Once you have deployed your resilient design, it’s time to create and thoroughly document your disaster recovery plan (DRP). There is no one-size-fits-all approach to a DRP, though there are some freely available templates to help ensure you cover the necessary areas. These often include determining:
- What conditions must be met to declare a disaster?
- Who within the organization can declare a disaster?
- The scope of the DRP (Corporate HQ, remote office, etc.)
- Statement of application tier levels and their corresponding recovery time and recovery point objectives
- Contact information for key stakeholders
- Organizational roles and responsibilities
- Legal or regulatory requirements for notification
- Internal and external communication plan and templated messages
- Plan for out-of-band messaging if the of the primary communication platform is unavailable
- The locations where copies of the DRP will be stored
- How often the DRP should be tested and what type of testing should be conducted
The main DRP document should be relatively high level and non-technical. It’s often necessary to break out detailed tasks into area-specific playbooks. An effective DRP will include playbooks for operations, legal/compliance, marketing, human resources, and of course, IT. Depending on the size of the department, these could be further broken down into sub teams such as infrastructure, software development, networking and user experience to verify the recovery steps from the end user perspective. The playbooks should be sufficiently detailed that another member of the team could complete the tasks if the author or primary subject matter expert is unavailable. The playbooks should also include the expected timing of the tasks including when it should begin, its dependencies and when it is expected to be completed. Together, the main DRP and playbooks should fit together and remain within the recovery time objectives established in the BIA.
The last item, DRP testing, is one of the most overlooked aspects of disaster recovery planning. Regular disaster recovery testing ensures the plan is realistic, actionable, current and can be completed within the expected timeframe. Testing can include failover to a backup site, a tabletop reading with stakeholders and everything in between. The decision around how often to test and which find of test to conduct is dependent on the organization’s risk tolerance and culture. Each test should be documented carefully by a non-participant and those results included in a lessons learned report. Any changes to roles, responsibilities, locations, technologies or timing should be incorporated into the DRP and tested at the next opportunity.
Business continuity planning and its subset, disaster recovery planning, are broad but essential topics for every enterprise or organization. No effective DRP is created overnight so it’s best to establish a realistic time frame to get one in place and work on it steadily over time. The most effective plans include a cross section of participants from within the organization, not just within IT. Visible support from senior leadership can go a long way to reinforce the importance of disaster recovery planning.
Brian Kautz contributed to this article.