Large Multi-Site Disaster Recovery Protection Topology

Disaster recovery solutions for virtual infrastructures is a mandatory consideration during planning design of any modern data center nowadays. But if for simple model of the protection of one large data center to remote recovery site is everything pretty clear, in cases of large topologies with a lot of distributed local branches across territorial units such as states, countries or even continents, the multi-site protection may be a quite interesting challenge.

Just in case you don’t read my previous article, where I have already a little bit touched upon disaster recovery topic, I would to remind you that:

In case of the loss of the entire physical site due to natural disasters, technogenic catastrophes or other force majeure, data protection via replication to other availability zones should be considered and accurately designed. Data protections can be designed either with completely zero RTO and RPO or with strictly defined data loss acceptance and clear predictable recovery time. Speaking about storage platforms nowadays, first (zero RPO) can be achieved via synchronous replication and second via asynchronous, when written data arrives to remote site with a little delay, during which data delta loss is acceptable.

For topologies with a lot of branches the design of business continuity solution can lead the architect’s thoughts to these key ideas:

  • instead of using one global remote recovery site, what even sometimes is impossible due latency or etc., as an alternative way of usage of many distributed decentralized recovery sites may be considered, when each act as a bunch for a several closely located protected sites;
  • since protected resources in a bunch are splitted and distributed across territorial branches the likelihood of simultaneously failed branches is tends to zero and thus amount of normally idle compute resources for recovery site may be reduced to an equivalent of only one protected branch;
  • optimizations such as deduplication and compression capabilities may be considered for recovery sites to achieve reduction for storage capacity resources;
  • and my favorite point, we can declare ultra critical level for services by protecting them twice across two different regions, one to local recovery site inside local bunch and one to another nearest bunch. These replications should be separated and due using connectivity with different latency should support independent RPO calculations.

To include these core ideas during building appropriate disaster recovery solution, one of the possible multi-site protection topologies can be a circular protection:

Multi-site Disaster Recovery Circular Protection Topology

Speaking about virtual infrastructures nowadays, to build whole or just parts of proposed above protection topology the final design can leverage a lot of various products available on the market. Let me show you a few specifications below as an examples:

  • As a virtualization platform is used vSphere and as a storage is used Nutanix SDS with asynchronous replication within a bunch. As a disaster recovery automation solution can be used VMware Site Recovery Manager (SRM). And to make one additional replication to nearest bunch for ultra critical services is used vSphere Replication with further manual recoveries, since SRM prevents protect the same VM twice;
  • For cases when vSphere and Hyper-V hypervisors are used in neighboring bunches, as a replication and disaster recovery automation solutions can be used Zerto Virtual Replication.

I would like to say that over 2–3 years all these benefits of all possible kinds of disaster recovery topologies will be simply underlie of leading SDS products.