Fail-over of systems at Jet: A matter of just a few clicks

Prathamesh Bhope
Dec 17, 2018 · 13 min read
Figure 1: Failover + Systems at Jet + Matter of a just a few clicks

Systems at Jet

General Tech Stack

Figure 2: Technologies at Jet

Tiering Model

All of Jet’s microservices are categorized into three tiers. These tiers are based on the time it takes for the service to have an impact on the customer.

  • Tier 2 — Services which handle important backend systems like Order Processing, Inventory Updates, etc. Failure in these services would not have a direct visible impact in a few minutes, but would affect customer promise, which is measured in hours.
  • Tier 3 — Services handling backend systems, which, if they failed, would impact the system functionality in between 2 and 5 days.

Multi-region

One of the main benefits of running systems on public cloud provider like Azure is that it provides us an option to run on multiple regions. At Jet, we run our systems across two different regions — DC 1(Primary) and DC 2(Secondary). Our systems at Jet are geo-replicated across these two data-centers.

Figure 3: Geo-replication of systems at Jet
  • Hot/Warm (Active/Passive)— Most of the Tier 2 services actively serve traffic in primary region, but the secondary region has data in-sync with the primary region. The region is already configured and ready to serve traffic in case the primary region fails.
  • Hot/Cold — Most of the Tier 3 services are configured and serve traffic only in the primary region. In case of failure in the primary region, the secondary region must be configured and then activated to serve requests.

Failover

  • Fast Mean-Time-To-Remediate (MTTR) — Since we can isolate a region, every time we face a failure scenario, we can ask this question — is the issue seen in both regions, or is it isolated to a single region? If the answer is a single region, we can immediately fail over to isolate the region where the issue is occurring and send all traffic to the other region. This remediates (but not resolves) the issue in just a few minutes.
  • Planned Events — We have systems which run actively on both regions and they have to be updated at times. While an update is going on, we can create a maintenance window by moving traffic away from the region that is being updated.
  • Gives us the ability to scale
  • Provides failover flexibility
  • Is agnostic of operator’s knowledge

A Matter of a Few Clicks

At Jet, we speak the language of cloud — scalability, easy operability, resiliency and robustness. We needed our failover process to have all of these qualities. That led us to write an awesome tool — FCC.

FCC Tech Stack

FCC comprises of the components shown in Figure 4.

Figure 4: FCC Tech Stack
  • React — To write the front end
  • Consul — This Hashicorp product used for service discovery, server health monitoring and storing execution details
  • Azure — To host the physical hardware of FCC

Design

Figure 5: FCC Design

Functional Setup

Figure 6: FCC Functional Set-up

Nodes

A node is the basic functional unit of FCC. It is a logical representation of a system at Jet like Event Store, Nomad, SQL DB, Cosmos DB, etc. FCC has separate controllers to interact with each of these systems. A node can also be a group of other nodes. Each node can have multiple states which are defined in its configuration. The FCC controller refreshes the state of every node every few seconds.

Figure 7: Nodes view on FCC UI
Figure 8: Sample Node configuration
source_definition {
    source_type = "event_store"
    source_name = "event_store"
    consul_server = "<primary_dc>"
    vault_server = "<"secondary_dc>"
    consul_key = "Jet/Failover/EventStore_A"
  }

Plans

A plan is a set of steps which instructs the FCC server how to transition the state of a node from one state to another. Every step in a plan leads to a defined Checkpoint. A plan can change the state of just a single node or a group of nodes. All of the steps of a failover, with their interaction with multiple nodes, can be grouped into a single plan. This way, a single plan could failover an entire Tier-1 system along with its related data stores.

Figure 9: Sample Plan view on FCC UI
Figure 10: Traffic shift sample plan
Figure 11: Highlighted states denoting validation

Complex Plans

Some complex plans can have as many as 50 checkpoints, with each checkpoint interacting with multiple complex nodes. Let’s take a look at a schematic view of one example of a complex failover plan.

Figure 12: Tier 1 fail-over with entire storage plan

Execution

We have mentioned multiple times so far about Plan execution. Execution is really one of the logical components of FCC. An Execution holds the details of a plan that is currently being executed. FCC can hold at the most one Execution at any given time. This means that only one Plan can be executed at a time. FCC doesn’t allow parallel executions.

Achievements via FCC

All the hard-work put into designing and implementation of FCC has paid off very well. We have achieved some significant milestones because of FCC. Some of these include:

  • Shifting of Entire Datacenter from one region to another takes less than 5 minutes!
  • Full datacenter failover exercises can be executed multiple times a month, with no downtime.
  • We are fully agnostic of operator expertise. All you need to do is know which plan to execute and rest is taken care of by FCC.
  • Failover plans are centrally managed and stored in Git. This adds version control over the failover steps, and prevents dangerous invariant configurations.

Goals for next generation of FCC

  • Operator Usability — We are trying to improve the operator experience while interacting with nodes and plans. We also plan to make plans accessible via APIs, add the ability to view recently run plans on the UI, etc.
  • Node interaction model — Right now, the configurations of nodes and plans, controllers and servers all run as a single executable. We are planning to decouple the server from the controllers and the nodes/plans configuration. This change would mean that we would not have to restart the server every time a new plan or a node is introduced.
  • Multi-tenancy — We want to be able to execute multiple plans at the same time. We would need to implement appropriate ACLs and locks to avoid race conditions.
  • Open Source — The current version of FCC is somewhat Jet specific. After we have achieved all above features, we are planning to contribute it to the open source community.

Key Design Takeaways

There were lot of thoughts and feedbacks taken into consideration while designing FCC and we have learned some important lessons along the way. These learnings and design takeaways are not specific to FCC. They are true for any case where you want to have managed fail-overs. FCC is merely just a tool, the success of which lies on these strong foundational points. These guidelines will help whether you are trying to automate failovers, or just do them in a more controlled fashion.

  • Granularity of steps — Always try to make the steps in the master playbook as granular as possible to achieve simplicity. This may sound counter-intuitive. But trust me, more granular your steps are, easier it becomes to bunch it up into a script or set of instructions which a tool/software can carry out.
  • Defined states — Always define the expected states of the systems you are interacting with at each step. This ensures you move forward only when current step is fully complete.
  • State Validation — Always have some form of validation of the defined states at every step.
  • Automation — Finally, try to avoid manual interaction by automating all the steps in playbook. This can be achieved by a tool like FCC.

Our culture

All this being said, does it mean our success story of fail-overs at Jet is all about running FCC plans? Absolutely not, for it’s is our culture here that drives these processes:

  • Layout — We (CP team) sit with each application team to design the layout of the failover of their systems. We then convert these steps from the layout into FCC Plans.
  • Practice — We have a schedule of planned failover exercises. Teams have to perform failover of their systems once every 5–6 weeks. This involves active engagement of both the Cloud Platform and the failover candidate team.
  • Follow-up — After every failover exercise, we publish a summary, which includes corresponding follow up items to improve the systems as well as the FCC experience. The resulting action items are scheduled to be completed in the following sprint.

Acknowledgements

Successful design and implementation of FCC and Fail-overs in general was made possible by efforts of many individuals from Cloud Platform Team at Jet.

Jet Tech

Sharing our engineering org’s learnings & stories as we build the world’s best experience to shop curated brands and city essentials in one place.

255

255 claps
Prathamesh Bhope

Written by

Software Engineer @ Jet (Walmart Labs)

Jet Tech

Jet Tech

Sharing our engineering org’s learnings & stories as we build the world’s best experience to shop curated brands and city essentials in one place.