Plan to fail — how a DR solution is an essential business requirement for any organisation.

Published in

AI+ Enterprise Engineering

7 min readJun 28, 2023

Within many organisations such as financial services companies, government entities and many other industries we see a desire that everything should fail over at the slightest outage and that an entire environment should have an active, active, active type architecture where there is limited downtime whatsoever. However, these can be costly, complex and may not meet the service availability requirements. It is easy to mix and match requirements in terms of resilience between I.T. and the business. The purpose of this blog is to explain what Disaster Recovery is, where it sits in an overall Business Continuity Plan and some of the key considerations that should be taken in to account in a DR plan.

There are many definitions of what Disaster Recovery is but all the definitions centre around the following:

Disaster Recovery is a part of a Business Continuity Plan that focuses on an organisations ability to respond to and recover from an event that negatively affects critical business operations. A Disaster recovery plan will employ policies, tools and procedures to enact the DR plan should a disaster occur. An event is either a physical, technical or man-made disaster that effects the normal running of the business.

This definition has three key elements.

· Business Continuity Plan: A DR plan should be part of the overall Business Continuity Plan (BCP) of an organisation. A BCP is a plan that provides goals, processes and procedures that support a company and its strategy when the unexpected happens. A BCP can include everything from key infrastructure, staffing plans, communications strategy, security and availability plans amongst others. You cannot and should not have a DR plan without a BCP plan.

· Affects Critical Business Operations: Not all services should be treated the same. There are some services that are critical to the running of an organisation and there are some that can tolerate downtime. Knowing which applications are critical to the business is key in defining which applications are part of the DR strategy and which are not. If you look to have a DR solution for all applications then you are moving into the area of having a Highly available solution which although may be what the actual requirements is, will be at a different cost and complexity to a DR solution.

· Disaster that effects normal running of the business: Ideally, you will never have to enact your disaster recovery plan. But as is said, failing to plan is planning to fail. With increasing cyber security threats internally and externally, distributed solutions across multiple environments including on premise and public Clouds, ever changing environmental events & factors and the reality that sometimes mistakes do happen, a DR plan is essential to the continued running of your core business functions.

It is important to show what is not a DR plan. The reason to show this is that it is easy to mix DR requirements with backup requirements, HA requirement or even Cyber resilience requirements when working with a client’s business requirements for their application availability if a disaster occurs.

What a DR plan is not

· It is not a backup plan. Clients should have a clear backup strategy for their data based on the application and business requirements of that data. Backups should be taken of the services based on the service requirements and any regulatory or industry requirements for frequency of backup.

· It is not a High Availability solution. A DR site may only provide a solution for critical applications, but these may have a degraded service over the normal production environment. A DR environment may have an RTO (recovery time objectives) and RPO (recovery point objectives) of minutes, hours or even days whereas a HA solution will have an effective RPO and RTO of zero.

· It is not a fault tolerant solution. A DR plan may have some downtime based on the documented RTOs and RPOs whereas the aim of a Fault Tolerant solution is that it continues to operate without interruption and requires a very specific type of Active / Active / Active architecture that can support this. This type of architecture should be reserved for only the most critical of applications.

· It is not a Cyber Resilience solution. Although the two are linked and a security attack may trigger to enact a DR solution, cyber-Resilience is the ability to anticipate, withstand, recover from, and adapt to adverse conditions, stresses, attacks, or compromises on cyber resources

It is safe to say that there is no single Disaster Recovery strategy for an entire environment. There are different business requirements based on what is needed by the service. A customer facing service that is used by ten’s of millions of people every day must be treated very differently than an internal application that is not necessary for the business to function. Based on this it is key to group the applications into business criticality which are backed by documented RPO and RPO times. Infrastructure teams may have perceived RPO and RTO times, but it is critical that the business, application and infrastructure team agree on the RTO and RPO for those business applications. This collaboration across teams within an organisation is necessary for ensuring that there is a clear understanding of what is part of the DR plan, what it not and who will have responsibility for each element if a disaster happens.

Not all clients in the same industry have the same DR Plan. Different organisations have different risk tolerance levels based on their business decisions and risk appetites. This level of risk may be impacted by specific industry or government regulations that demand certain risks are addressed. An example of this is where the PRA (Prudential Regulatory Authority) and the FCA (Financial Conduct Authority) in the United Kingdom have outlined minimum resilience standards for Critical Third-Party companies such as public Cloud providers to adhere to so that if one provider was to fail, there is a disaster recovery plan that is documented and tested. Another example of risk tolerance by regulatory standards is in India where we have seen some industries request that their DR site is in a different seismic zone within India. Choosing a DR site in this example needs to be a balance of any trade-offs between RPO and RTO and the likes of latency between the primary and DR site. (You can read more about which cities are in which zones here — https://pib.gov.in/PressReleasePage.aspx?PRID=1740656). Finally, it is important that the Disaster Recovery plan meets the stated business requirements of the individual organisation and not be a copy and paste from previous experience or from generic industrial standards.

With no specific architecture being prescribed for all clients or industries, there are some key elements and considerations that should be included in every Disaster recovery Plan. Such considerations are the following

1. Conduct a risk assessment to identify potential disasters and their impact. Different disasters may require different requirements and actions, so a plan needs to be developed for each. Be mindful that manual steps may be required as part of that disaster recovery plan. For example, what if all credit card readers are unavailable in a retail setting? Is your business prepared to take payments in a separate way? Or in an insurance setting, how can you process claims if a claim system is down for more than a day. All these elements need to be considered when creating the risk assessment.

2. Determine critical business functions and how long they can be offline before considerable damage is done.

3. Categorize those services that are critical to the continued running of the business and determine what level of tolerance there is to downtime. Do not designate all services are being critical as this will end up in a very costly DR solution.

4. Set realistic and achievable RTOs for critical systems and data. Many clients have desired RTOs but sometimes these may not be possible without significant. Some applications that support critical services but may not be deemed mission critical may require the same RTOs.

5. Establish realistic and achievable RPOs to prevent data loss and ensure up-to-date information. Ensure that these are clearly documented and that they are tested to ensure that they meet requirements.

6. Ensure that any investment or costs associated with the DR plan are fully explainable, based on best practice and shared and agreed with by all senior stakeholders.

7. Ensure that all disaster recovery processes and procedures are clearly documented and that these are regularly tested and updated so that the business is constantly up to date with the latest threats.

8. Establish a communication plan that outlines how stakeholders will be notified and how employees should communicate during the recovery process

9. Ensure that there is no break in the disaster recovery plan by only working with third parties that supply software, support or services that have detailed and documented DR plans that align to your applications requirements.

10.Employ a vigorous review and testing cycle of your disaster recovery plan so that it is updated accordingly with new threats that are either physical, technical or man-made.

Conclusion — No one size fits all. A DR plan should hopefully never have to be used. However constant testing, updates and reviews need to occur to ensure that when the unexpected happens, your designated systems that are part of your DR plan operate against the documented processes and procedures to enable the level of expected functionality from your DR solution. Failure of your DR plan can lead to reputational damage, fines and loss in revenue. Do not treat a DR plan just as a checkmark in your overall Business Continuity Plan. It is one of the most crucial elements to the running of any organisation.

Plan to fail — how a DR solution is an essential business requirement for any organisation.

Written by Johnome