Guaranteeing High Availability — how we do it at OrderGroove

Published in

Ordergroove Engineering

6 min readJun 22, 2018

Even the most robust systems experience outages from time to time. They can occur due to routine maintenance operations or sudden system crashes. Fortunately, however, there are steps you can take to mitigate the downtime risk in both planned and unplanned situations.

In this post, I will introduce the concept of availability, explain why and how we implement high availability at OrderGroove, and share some learnings from our experience that might be helpful for anyone considering to undertake a similar journey.

Availability is generally measured and expressed as an uptime percentage of a system over a period of time (a month or a year).

Translation of common availability percentages to downtime in mintues

Availability requirements for a system are determined by analyzing the cost of downtime of that service to the business. e.g. For an online retailer, 30 minute of downtime to its checkout flow would likely result in a significant loss of revenue, but a 60 minute outage of an email delivery service will be less critical.

High Availability Setup at OrderGroove

We have done rigorous downtime cost analysis for different services in our platform. For business critical services, our SLA guarantees 99.95% availability. We achieve this by implementing an Active-Passive setup that spreads our infrastructure across three geographically distinct regions. Our active data center setup is in Chicago, our failover or passive data center is in Virginia, and a third, light weight, internal use only cluster is setup in Dallas. A simplified version of the architecture is shown in the diagram below:

OrderGroove highly availability architecture

Part 1: Routing traffic to the appropriate target data center

At the top most layer our DNS record resolves to an Akamai Edge server. The edge server internally uses another Akamai component called the Global Traffic Management (GTM), to determine the appropriate traffic target from which to serve a request. It directs all end user traffic to the primary data center unless that fails, in which case all traffic will be directed to the failover data center. GTM pings health check endpoints on both data centers continuously to keep track of target availability.

There are a few configuration variables to watch out for when configuring GTM:

Time to live (TTL): This determines how long a DNS record will be cached for so that, in case of failover, an already cached record will still keep pointing to the older record.
Failover delay: This determines how long the GTM will wait for before switching over to the backup data center once it detects a failure on the active data center.
Fail back delay: When in a failover mode, this determines how long GTM will wait before routing the traffic back to the active data center, after sensing it has returned to a healthy state.

Too small of a value for any of these variables risks false positives where GTM will switch traffic between the data centers at random blips, while a bigger value risks a larger time window during which traffic is not being routed to the preferred data center.

Part 2: Database replication

High availability setup requires redundancy in the data layer. Multiple copies of databases on separate servers within a data center provide fault tolerance against failure of a single machine. Similarly replicating data across data centers mitigates risk of a data center outage.

The particular service taken as an example uses MongoDB as its data store. Due to the large volume of data, this service uses a sharded MongoDB cluster comprising of 5 shards. Each shard is a replica set consisting of five nodes.

There is one primary node that gets all the write operations and maintains a log of all operations in a collection called the oplog. There are multiple secondary nodes that replicate data from the primary by applying all operations from the primary’s oplog. In the case where the primary node goes down, any of the secondary nodes is capable of becoming a primary. There is also a non data-bearing node in the replica set called the arbiter. An arbiter does not have a copy of data and it cannot become a primary. Its only purpose is to participate in the election process to help elect a new primary. All members periodically exchange heartbeats, in the form of simple pings, with each other to stay up to date on the accessibility of other replica set members.

Communication between replica set members

In order to elect a new primary, the election process in a MongoDB replica set requires a majority of the nodes to be available. Therefore to protect against data center failure we have to distribute replica set members strategically so as to maximize the likelihood that remaining members of the replica set can still form a majority.

At OrderGroove, we keep the same number of data-bearing nodes in our active and failover data centers. Finally, we keep an arbiter in a third data center. This makes the total number of nodes in the replica set to be odd and thus making sure that there is never a tie of votes in the process to elect a primary. This distribution also allows us to mitigate the risk of data center failure. In case of any one data center failing, a majority of the nodes (more than half) will still be available and therefore will be able to elect a new primary.

While setting up a high availability MongoDB replica set, it is important to set an appropriate priority for each of the replica set members. The priority setting defines both timing and outcome of the election for a primary. Higher priority nodes are more likely to call an election and are more likely to win. We set the highest priority for the nodes in our primary data center. This ensures that a primary node is always elected from this data center first as long as one of these nodes is still available. The nodes in the failover data center have lower priority, making these nodes the backup only in the event that all the nodes in the primary data center are unavailable.

Key learnings

Here are a few takeaways from our high availability implementation work that might be helpful for someone with similar service availability concerns. Each of these warrants a separate blog post of its own but I will highlight them here:

Carefully select the actions that should be executed synchronously, and be on the lookout for opportunities where tasks can be deferred to asynchronous execution. This will drive the infrastructure footprint requiring high availability to a bare minimum. Consider, for example, a service that captures checkouts. It creates orders and sends a confirmation email to the customer. Instead of implementing high availability for the entire workflow, it might be acceptable to only guarantee capture of the incoming requests. As long as requests are successfully queued up, the rest of the workflow can happen asynchronously as it is less time critical.
If you don’t keep the backup data center warmed up, when failover is triggered, you can possibly experience higher than usual latency. One way to address this cold start problem is to set the failover data center to rerun read queries from the active data center so that caches and databases are primed for efficient query response.
Finally, you should develop a cadence for testing your high availability setup via a forced failover. Chaos monkey is a great tool to facilitate this exercise, but periodically doing a manual forced failover can be a good starting point as well.

Guaranteeing High Availability — how we do it at OrderGroove

Part 1: Routing traffic to the appropriate target data center

Part 2: Database replication

Key learnings

Written by Ali Ahsan