Business Continuity & Disaster Recovery in the Microservices world

Published in

Walmart Global Tech Blog

10 min readOct 31, 2019

Walmart offers grocery pickup and delivery in nearly every U.S. state, and many countries across the globe. The Integrated Fulfilment system at Walmart consists of apps and backend systems, and enable associates to fulfill for omnichannel e-commerce orders from stores globally.

In recent years, this system has seen tremendous growth in business :

reference: https://techcrunch.com/2019/08/13/walmart-tops-u-s-online-grocery-market-with-62-more-customers-than-next-nearest-rival/

To support this scale, we decided to modernize and re-architect the product.

One key requirement, while doing this, was business continuity. Any production issue in the system impacts our customers across the globe. This system *cannot* go down outside of *guaranteed timelines *.

Disaster Recovery (DR)

The fallacies of distributed computing are a set of observations made by L Peter Deutsch and others at describing false assumptions that people have about distributed systems. In the cloud world, the infrastructure stack is much more dense, and includes components which are out of our control. This means that some of these fallacies make themselves much more apparent.

When a cloud deployment goes down we need the system to continue serving customers from “another place”. Disaster Recovery (DR) is the design-construct which allows this set of services, and related infrastructure components ( like messaging brokers and databases) to be available from a different data center.

High-Level Architecture

Systems like these are typically designed as multiple microservices which collaborate with each others using messaging and API to achieve the desired behavior. Each service has a DB which it owns — thus enforcing isolation of concerns and clear contracts. For the purpose of this discussion, the high-level architecture is depicted below :

A few ‘frontend’ services take requests from apps, and then work with other ‘capability’ services to enable system use-cases. The term “Event Driven” is also applicable for these types of systems , here since each service is loosely coupled with others and only reacts to events (messages).

A specific deployment of this set of microsevices is called a ‘Ring’.

Replication

The first pattern for a DR solution is the availability of the data (database)in a remote datacenter. The naive /easiest way to do this is that the write to the DB in the main region writes to the remote region as well. To be specific, the DB write finishes when the local and remote DB have both persisted the write. The problems here are :

The writes are across WAN links, and these do not offer strict latency SLAs. Hence DB write times increase and non-guaranteed.
The system availability now has other components in the equation: the remote DB, the network link between the local and remote DBs. The composite system reliability is always less than each of the components, and this means that overall system reliability is lowered.

To avoid compromising performance and availability, the normal pattern employed is *asynchronous replication* i.e. the DB writes finish when the local DB commits, and the transactions are *shipped* to the remote DB. On the remote secondary site, the transactions are persisted asynchronously to what is happening in the primary DB.

For example, the following diagram and reference describe the Distributed Availability Groups (DAG) technology employed by SQL Server :

Reference: https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/distributed-availability-groups

One might notice that the messaging is not replicated — this is because it becomes very difficult for a transactional distributed system to stitch the replicated messaging state with the DB state. It’s much easier to use the DB to replay the messages — as described below.

Enter Microservices

In a microservices architecture, each service has it’s own set of DBs. All of them replicate independently of each other :

The problem here is that the overall system state itself is spread across multiple services, but the shipping of transactions by multiple microservices is not co-ordinated along ‘clean consistent lines’. Thus collective state at a specific ‘instant’ / snapshot in the remote set of replicated DBs may not be consistent — or usable when a disaster strikes!

To illustrate the problem, consider the following flow :

Here two DBs are being mutated, and as they are replicated in the remote datacenter, there are 4 possible states in which the two remote DBs can be in :

The first and the last scenarios are Ok ( yes even “losing data” is ok) since the whole system now is in a *last known consistent state *. We can then build mechanisms to reconcile. Much worse is actually the system being in an inconsistent state — scenarios 2,3.

There are 3 problems which can manifest in such cross-service transaction replications :

Broken References: The state in ServiceX refers to a state in CapabilityServiceP which is absent. Thus future interactions might break when the app(client) engineers a flow which forces ServiceX to make a call to CapabilityServiceP — and here the later has no reference to the entity the former is talking about
Split brain: Many times services like CapabilityServiceP offer a consolidated view of the whole system state. It does this by manifesting materialized views for different applications. During DR, it is possible that the materialized view has the data but the source-of-truth does not. This will cause issues like “I see the item on the search page, but when I go to the detail page I get a 404!”
Dangling References: This is the same phenomenon that is commonly described in memory management : the “parent” ServiceX does not have the reference to the object ( “OrderB”) , but the “child” CapabilityServiceP has records about “OrderB”. This is exacerbated by denormalization ( repeating information to avoid joins and increase performance). These dangling references can cause hard-to-debug issues if things are not coded properly. For example in the above example, if CapabilityServiceP is used to give an estimate of the number of orders, then that would be a wrong estimate.

This problem has been investigated previously , in the context of backup, as described in the IEEE paper titled “Consistent Disaster Recovery for Microservices: the BAC Theorem”. BAC stands for Backup-Availability-Consistency and is authored by Pardon and Cesare Pautasso and Olaf Zimmermann. Essentially it is a derivate of the famous CAP theorem and says :

“When backing up an entire microservices architecture, it is not possible to have both avail-ability and consistency.”

Reference : https://ieeexplore.ieee.org/document/8327550

Now that we see the problem, let’s see how we can solve it.

Reconciliation

The main design pattern to get the set of microsevices to last known consistent state is Reconciliation. Essentially, each service keeps an Entity Mutation Log (EML) on what changed along with the timestamp. The EML for the picking service would look like :

sample EML for picking

This type of construct is e also called as a Write-Ahead-LoG (WAL) and used for similar durability guarantees in databases like Cassandra.

Once we have this, on a failover to the remote site, a designated ‘bully’ microservice, replays the mutations in the last n minutes. This n is a tune-able and covers the *bounded staleness* guarantees that are offered by the DB replication technology. Replay in the case of a system described here usually means reconstructing and resending messages where the other services are listening. Each “non-bully”/ downstream service, consumes the relevant message and constructs its state.

Idempotency

A key requirement for reconciliation to work is idempotency, i.e. the services need to be able to handle duplicate messages without changing the final resultant state. This anyway becomes a de-facto requirement for message-driven microservices, since messaging systems (brokers) offer “At-least-once” semantics. Messages can get duplicated on the way to consumers due to various fringe conditions.

Note : some systems like Kafka advertise “exactly-once” semantics, but the fine print is that it’s only possible in very narrow architectures — but that is another medium blog :P ) .

High-Level Solution

So to summarize the discussion so far :

The set of microservices must be deployed in more than 1 place (‘cloud’). Let’s call each deployment a Ring.
Replication and Reconciliation is used along with other constructs like GSLB based load balancing and a Monitor to construct the final solution

The Monitor is a health monitoring and command solution. It has a 2-level architecture — consists of a Worker and Master architecture. The Worker sits in a Ring , and collects stats like CPU/memory utilization, API latencies, DB latencies, disk throughputs etc from the services in Ring. It does that via health check APIs advertised by each component and other monitoring beacons.

There are many frameworks like Spring Boot Actuators which can be used to templatize health checks for each service. This can be augmented by a small wrapper library which does health checks for common components like messaging, and DB. Such a library enables allows consistent metric reporting for the services — something like below :

{
  "app": {
    "state": "UP",
    "name": "servicex",
    "id": "8c13a2@servicex"
  },
  "db": {
    "status": "UP"
  },
  "broker": {
    "status": "UP"
  }
}

The Monitor worker uses these signals to make judgments on how healthy a service is. It then sends a summary along with a detailed report to the Monitor Master. The Monitor Master will then stitch a single-pane-view of Rings across the world and command DR failover if needed.

infra metrics example — Elasticsearch in this case

business metrics example — API requests and response times

These allow rules to figure out if a Ring is unhealthy or not.

All APIs go via a GSLB based load balancer. This enables DNS based failover — the App does not need to change the FQDN it is configured with , rather the VIP that the FQDN refers to changes to the failover site. Essentially every time the app does a DNS lookup, the GSLB system gives the Virtual IP Address (VIP) of a service which is the current active/healthy cluster for that App’s context ( context here refers to things like country etc).

Once a DR failover is initiated, the Monitor master starts a workflow, which can look like below :

Stop the API calls to the faulty primary Ring by redirecting all traffic to a temporary 503 server

2. Do the DB failover

3. Run reconciliation

4. Tweak the GSLB to point the FQDN to the failover ring

It is important to mention that not all failures need DR. There is a cost of failover and we need to selective in flipping the switch as described in the following section.

App Resiliency

When a failover is triggered, the API calls to the backend services will experience discontinuity for some time. Of course we have engineered to make failover time to be within limits, but still the customer experience can be broken during this time.

To help tide over this , and other problems ( like patchy networks), the apps were made resilient so that associates can continue using the apps in the face of discontinuity with the backend.

The pattern employed is described at a high level below :

Essentially the following constructs are employed :

Pre-fetch — fetch all needed resources (e.g images) at startup
Local Storage — store details on the app persistent storage so that it is available across app reboots/restarts
Background processing — a protocol the app and the backend which takes the data in local storage, and syncs it with the backend. This includes id-ing each UI transaction and stitching together an ordered list of transactions at the backend.

Of course there are lot of details which have been glossed over — in particular things like reliability, retransmissions, idempotency, business processes in case the app itself is destroyed etc. Again collateral for a future post .

The Result — RPO & RTO

Two key metrics of a DR solution are Recovery Point Objective (RPO) and Recovery Time Objective ( RTO). In mission-critical apps, both are extremely critical and need to be tuned differently for different use-cases.

With the design above, we were able to demonstrate *instant* RTO and RPO . The actual workflow took a few mins to finish, but since the apps were resilient during the failover the *perceived* RPO and RTO was instant !

Of course in real-world use-cases, some flows will be blocked till the backend is up again . But the number of those scenarios are very few, and the solution allows customers to use the app, even in the face of total backend meltdown.

Cost Optimizations

The design above describes an Active-Passive architecture. The remote deployment is not used for normal operations. The design above was optimized for cost by enabling ‘pairing’ of rings and allowing active-active behavior.

Each ring has a ‘pair’ and when a failover occurs the stores of the pair connect to the one survivor ring . This is depicted below :

Acknowledgments

To see how much our associates love the system checkout the awesome rap song made by our associates :

♫♫ “tap that app” ♫♫

This work was the result of the collaboration between many engineers at Walmart Labs . Major contributors include (alphabetical order) Abiy Hailemichael, Igor Yarkov, Kislaya Tripathi , Nitesh Jain, and Noah Paci.

Business Continuity & Disaster Recovery in the Microservices world

Written by Jyotiswarup Raiturkar