Surviving a cloud provider outage

Published in

Ordergroove Engineering

3 min readJun 3, 2019

https://status.cloud.google.com/incident/cloud-networking/19009

Cloud providers fail.

We put our trust in their infrastructure only to find out that they, too, are prone to periodic interruptions and downtime.

Failures in a single zone or region can happen fairly regularly, but a global shut down of an entire cloud provider is highly unlikely. Luckily, large cloud providers have regions/zones scattered across the globe. Some are clustered closely enough to provide low latency for communication between backend services.

So what can we do to mitigate risks while running on cloud infrastructure?
We can leverage all the available tools to protect ourselves from these pesky zone and regional issues.

At OrderGroove, we rely on our services being available 99.95 %, so an outage lasting a few hours would put us over our SLA and cause unneeded pain for our customers. To ensure we’re not affected by such failures, we deploy our applications across multiple zones as well as regions.

Running a static site across geographically isolated locations is a fairly easy feat. In our case, for each region we spin up a set of VM’s behind a load balancer and configure Akamai GTM (Global Traffic Manager) failover to switch between them.

The fun begins when we need to run a dynamic site with MongoDB as a backend. Here, we have to get creative.

MongoDB is capable of running multiple replicas — in fact, for production setup, MongoDB recommends running a minimum of three replicas. In order to maintain a functional read/write cluster, a majority of nodes have to be online. This setup falls nicely with a multi region/zone setup.

Each zone can have its own MongoDB instance — this provides protection in case of a zonal failure — one zone down means one replica is not accessible. In case of a more drastic failure — when an entire region becoming unavailable keeping the number of instances per region down ensures that majority of nodes survive and the cluster continues to operate as expected. Both scenarios leave us with a functional read/write MongoDB cluster.

Let’s examine an example highly available cluster:

An auto scaling group for application nodes configured across multiple zones ensures that a minimum number of application nodes is always present. In case an instance becomes unhealthy a new instance starts up, gets provisioned using a pre-baked OS image, code gets deployed using chef and thanks to the GCP automation it gets attached to the CLB.

Every app node is aware of every MongoDB replica, this ensures that in case of a DB node failure a new master gets elected and applications continue to operate as though nothing had happened. Even with a loss of any single region a majority of DB nodes remain healthy and MongoDB continues to operate in read write mode.

To ensure that your primary zone always has lowest latency we configure MongoDB replicas with weights — this ensures that the master always runs in primary region unless there are no healthy instances.

This can be accomplished by configuring your replicas with:

cfg = rs.conf()
{
 "_id" : "replicaset0",
 "members" : [
  {
   "_id" : 0,
   "host" : "mongo-central-01-a.c.project.internal:27017",
   "priority" : 0.75
  },
  {
   "_id" : 1,
   "host" : "mongo-central-02-b.c.project.internal:27017",
   "priority" : 0.75
  },
  {
   "_id" : 2,
   "host" : "mongo-east4-01-b.c.project.internal:27017",
   "priority" : 0.5
  },
  {
   "_id" : 3,
   "host" : "mongo-east4-02-c.c.project.internal:27017",
   "priority" : 0.5
  },
  {
   "_id" : 4,
   "host" : "mongo-east1-01-b.c.project.internal:27017",
   "arbiterOnly" : true
  }
 ]
}

and then executing:

rs.reconfig(cfg)

In conclusion

While cloud providers fail, and on occasion fail hard, they also provide tools and solutions that enable us to build even more resilient platforms than it was possible with physical infrastructure. Ultimately, an application could be hosted across multiple cloud providers giving you ability to withstand even a global outage.

Surviving a cloud provider outage

Written by Niko Levitsky