This is the second in a three-part series on the importance and application of business continuity and disaster recovery within an organization.
Business continuity is the lifeline that ensures your organization’s survival when, not if, an operational interruption occurs. To some this may sound alarmist, but according to the IDC Worldwide State of Data Protection and Disaster Recovery Survey, January 2022, the cost of business downtime averaged $250,000 per hour (across all industries and organizational sizes). In addition to steep financial costs, reputational impacts erode customer confidence and trust in the product or service.
Business interruptions come in many forms and are a matter of when, not if. The IDC research report showed that software failure (56%) and hardware failure (47%) were the top two reasons leading to a disaster recovery response. Most technology-related business interruptions have simple causes such as device misconfiguration or unintended software deployments. But as businesses increasingly move toward cloud-based services, supplier-related downtime is becoming more common.
In June, automotive dealerships across the United States were crippled by the loss of production systems run by software-as-a-service provider (SaaS) CDK. With one breach, the attackers idled 15,000 dealerships for several weeks, halting most sales, service and parts operations. According to the Detroit Free Press, that outage cost dealerships $1.02 billion in lost business.
In July, the world was rocked by a software error from the cyber-defense company CrowdStrike. That event crippled many Windows computers in use at businesses, including airlines. As of this writing, the economic impact of those business interruptions is unknown, but is expected to be substantial.
The first step in developing a business continuity plan (BCP) is to conduct a business impact assessment (BIA). A BIA identifies critical business operations and sheds light on how long they can withstand an interruption before the organization’s future existence is in question.
It should be led by managers within the organization that are familiar with its overall function in addition to their own focus areas (manufacturing, customer service, legal, HR, IT). The leader does not have to be from within IT, however. The exercise is most effective when it involves contributors from across the organization
The first step is to consider high level business functions for your organization.
Examples of high-level business functions:
- Health & Safety Functions
- Legal & Compliance Functions
- Revenue-generating Functions
- Locations where the above functions are conducted
Next, determine the information-related functions within those areas that are most critical to the organization’s existence.
Examples of critical information functions:
- Service Delivery and/or Production
- Accounts Payable
- Accounts Receivable
- Employee Payroll
- Legal and/or Privacy Compliance
Next, for each of those processes, how long can the underlying process or information remain inaccessible before the business is at risk? Is there a practical workaround or manual process that can be employed if the information system is offline? In many organizations, the information processes below are complex and distributed. It may take some time to identify all the people and systems involved but it is essential to get this information.
Examples of critical business information process interruption questions:
Service Delivery and/or Production Operations – Is there sufficient inventory to support a sustained outage and if so, are logistics processes operational to deliver those products? Is service delivery critical to health & safety, time sensitive or perishable goods?
Accounts Payable – How long will your vendors continue providing services without being paid?
Accounts Receivable – Is there sufficient cash on hand if you are unable to apply payments electronically?
Employee Payroll – How long will your employees continue to work without being paid?
Legal and/or Privacy Compliance – Are there legal, regulatory or contractual requirements for notification?
Consider using a table like the one below and customize it to your own organization by including the location where the information process is conducted and which role within the organization manages the process. (Example data shown) You can also consider including monetary values to the loss of services and information about the workaround and how long you can operate in that alternative mode, if that is available.
Information Process Name | Internal or Outsourced? | Workaround Exists? | Maximum Allowable Downtime |
Service Delivery and/or Production Operations | Internal | No | 24 hours |
Accounts Payable | Outsourced | Not sure | 72 hours |
Accounts Receivable | Outsourced | Not sure | 72 hours |
Employee Payroll | Outsourced | Yes, paper checks | Dependent on payroll schedule |
Legal and/or Privacy Compliance | Internal | Yes, paper documentation, fax |
As you see above, accounts payable and accounts receivable are both outsourced to a third party, such as a cloud service provider (CSP). CSP and their subscribers (customers) operate in a shared responsibility model. This means that depending on what type of service is being delivered/consumed, Infrastructure as a Service (IaaS), Software as a Service (SaaS), Platform as a Service (PaaS), the subscriber responsibilities will vary. It’s essential to understand what your CSP will and will not do for you in the event of an interruption on your side or theirs. In the case of the CDK, many dealerships had not conducted a BIA to consider the impacts of an outage. Often, subscribers don’t think critically about a CSP outage since CSP are often large, amorphous enterprises. However, the cloud is just someone else’s datacenter.
Now that you’ve charted out each critical information process in your organization, and its associated details, consider which people, processes and technologies are needed to sustain each process during a business interruption. A small manufacturing organization would need to consider how to respond if the PC that ran their computerized cutting machine crashed. They might consider having a spare PC on hand (technology) and assigning a technician (people) the task of restoring a current backup of the configuration (process). They would also need to consider whether they must communicate the outage to customers, whether there are related legal or contractual notification requirements and how this would be achieved.
In this second example, you know that at some point one of your facilities will lose utility power due to weather or another cause. Depending on the criticality of the site, you can choose to install a generator (technology), close for the rest of the day (people) or request that employees work remotely (process+technology). The choice you make is entirely dependent on the criticality of the process, type of work being done (and where), the level of security required, and the cost of the lost production minus the cost of the continuity response.
The previous examples are somewhat simple, but the approach scales to organizations of all sizes. What if there is an interruption to the CSP whose services you consume? The CDK incident is just one of many including some big names like AWS, Google and Microsoft who have all experienced outages of varying duration. Some organizations go to great lengths to ensure availability by using multiple providers, though this is costly and technically challenging. Other organizations may have a backup manual/paper process in place, and some may just decide they can’t operate without their CSP. In any event, it’s essential to have a communication plan in place that includes employees and customers. The plan should include an out-of-band method of communication that doesn’t rely on the downed system. Organizations often use old-school call trees or third-party communication platforms or for this purpose. (Note: It’s not difficult to consider a scenario where an organization attempts to use a third-party notification service to communicate a CSP outage but finds they are impacted by the same CSP. Choose your CSP wisely!)
Once the business continuity plan is documented it must be tested periodically. This ensures that the plan is still effective, and everyone involved knows their responsibilities. Testing can include a group tabletop reading of the plan, an immersive live exercise and everything in between.
As you go through this exercise, avoid focusing on black swan events like, “a meteor strikes the datacenter.” A black swan event is one of high impact that was difficult to predict under normal conditions, but in hindsight seems more plausible. Starting business continuity planning with hyperbolic thinking wastes time and can erode confidence in the process. There is a wealth of real-world threat modeling data available from government and industry groups that can provide quantifiable data on the likelihood and impact of various events. Is your facility near a busy rail line transporting hazardous materials that could experience a derailment and chemical spill? Does your office sit near an active fault line subject to regular tremors endangering life and safety? Are you subject to a software monoculture, reliant on several critical third-party software programs to deliver your own services? Does your organization have weak or non-existent background check processes that could allow employee thefts?
Your business continuity planning experience will likely be more nuanced and tailored to the threats and losses unique to your environment. The output of your BIA exercise should list at least the services critical to your organization’s existence, potential threats to those services and workarounds, if any. This information will be essential for the next step, disaster recovery planning and resilient design principles, which we will cover next month.
Brian Kautz contributed to this article.