Fail-over of systems at Jet: A matter of just a few clicks
Ever felt frustrated when your plans for a fun weekend are sabotaged by heavy rains? We all have gone through it. But, at least we have weather forecasts to warn us of these possibilities. It’s a similar story with cloud technology, except that we don’t get any warnings.
Jet was born in the cloud, which means there is constant threat of unpredictable failures. But, that’s OK! We’ve got it covered. One of the top priorities of the Cloud Platform team at Jet has always been to ensure that our systems are resilient to failure of systems on the cloud, and to failures of the cloud infrastructure itself. But wait, what systems are we talking about? Let’s take a step back and break up that title a bit.
I am going to go over each of these sections separately. But, before we discuss the fail-over of systems, it makes more sense to talk about what do the systems at Jet look like.
Systems at Jet
General Tech Stack
Microsoft Azure is our primary cloud provider. Our entire tech stack runs on Azure. We use a bunch of Azure’s managed services like Redis, SQL, Virtual Machine Scale Sets, CosmosDB, etc. Event Store is our main data store, which stores our data as immutable events.
If you are interested in knowing more about our Event Stores and event sourcing platform in general, there is an awesome post written by a colleague at Jet, which you can read here.
Jet uses a microservice oriented architecture, which means all the application teams write their functionality as groups of microservices. Hashicorp’s Nomad is our primary scheduler. It schedules our microservices onto clients hosted on Azure. We also use other Hashicorp products like Consul for service discovery and Vault for storing secrets. We use NGINX and Envoy for our proxy services. Kafka is our main streaming platform for asynchronous communication between microservices.
All of Jet’s microservices are categorized into three tiers. These tiers are based on the time it takes for the service to have an impact on the customer.
- Tier 1 — Services which, if they failed, would have a visible impact on customers in less than 5 minutes.
- Tier 2 — Services which handle important backend systems like Order Processing, Inventory Updates, etc. Failure in these services would not have a direct visible impact in a few minutes, but would affect customer promise, which is measured in hours.
- Tier 3 — Services handling backend systems, which, if they failed, would impact the system functionality in between 2 and 5 days.
This tiering model helps us not only to categorize the services, but also to determine SLAs for the services, prioritize work, especially in case of a failure scenario.
One of the main benefits of running systems on public cloud provider like Azure is that it provides us an option to run on multiple regions. At Jet, we run our systems across two different regions — DC 1(Primary) and DC 2(Secondary). Our systems at Jet are geo-replicated across these two data-centers.
Figure 3 does a good job of summarizing how our systems and services are geo-replicated across the two data centers. However, it’s important to note that not all of our systems and services are geo-replicated. Based on the the availability and presence in each region, we categorize our services as:
- Hot/Hot (Active/Active) — Tier 1 services serve traffic and process requests to both regions (Primary and Secondary)
- Hot/Warm (Active/Passive)— Most of the Tier 2 services actively serve traffic in primary region, but the secondary region has data in-sync with the primary region. The region is already configured and ready to serve traffic in case the primary region fails.
- Hot/Cold — Most of the Tier 3 services are configured and serve traffic only in the primary region. In case of failure in the primary region, the secondary region must be configured and then activated to serve requests.
“With great power comes great responsibility”
Azure bestows upon us the power to run multi-region. It’s our responsibility to ensure that we’re making full use of these multiple regions to achieve maximum uptime. This is possible by allowing our systems to failover to the secondary region in case of failure. But wait, how exactly does failover help here?
- Tolerate Regional Failure Modes — There could be problems with a system in a region, including Azure infrastructure failures, failures in our own infrastructure, or even failures in our own code. Failover can help us isolate this region from live traffic until the issue with the problematic system is resolved.
- Fast Mean-Time-To-Remediate (MTTR) — Since we can isolate a region, every time we face a failure scenario, we can ask this question — is the issue seen in both regions, or is it isolated to a single region? If the answer is a single region, we can immediately fail over to isolate the region where the issue is occurring and send all traffic to the other region. This remediates (but not resolves) the issue in just a few minutes.
- Planned Events — We have systems which run actively on both regions and they have to be updated at times. While an update is going on, we can create a maintenance window by moving traffic away from the region that is being updated.
Historically, our fail-overs involved a predefined master plan with tons of manual steps. Multiple teams whose systems are involved in the failover scenario had to gather together to follow the master plan and compile their myriad scripts. And, since we would touch so many systems during a failover, it was imperative that we did manual checks after every step to ensure our systems were at the expected state after every action. Sounds tedious, right? It sure was!
With such complex process, would it be possible for a new person with less knowledge of systems to perform or participate in this failover, whether practice or an emergency? No, not a chance. This was especially true if there was a real, big-time failure incident which had taken down the entire region. Even the most experienced engineers dread this scenario.
Moreover, on the cloud, we keep evolving our systems, and adding vertical and horizontal scaling. This means that fail-overs should scale as well. Considering it’s already such an involved process, it is an arduous task to make these fail-overs scale.
Given all of these, at Jet, we needed something that
- Speeds up the failover process
- Gives us the ability to scale
- Provides failover flexibility
- Is agnostic of operator’s knowledge
Something that could be run in a matter of few clicks.
A Matter of a Few Clicks
At Jet, we speak the language of cloud — scalability, easy operability, resiliency and robustness. We needed our failover process to have all of these qualities. That led us to write an awesome tool — FCC.
Why the name, you’d ask. FCC stands for Failover Control Center. It’s Jet’s global state monitoring and failover automation tool. In reality, it’s just a state machine which monitors and modifies the state of systems globally. It also prevents dangerous invariant configurations of our systems and can visually diagnose failure points in Jet’s architecture.
FCC Tech Stack
FCC comprises of the components shown in Figure 4.
- Golang — To write the backend of FCC, consisting of server, engine, controllers and the API
- React — To write the front end
- Consul — This Hashicorp product used for service discovery, server health monitoring and storing execution details
- Azure — To host the physical hardware of FCC
We run a high availability setup with three servers. Consul Key-Value pair functionality is used to store two keys — Leader and Execution. All three servers try to claim leadership. The one which is successful grabs the Consul lock and logs itself as the leader, updating the value of Leader key in consul with its IP:Port.
The incoming request for an execution is received by the load balancer, which gets the current Leader from Consul and forwards the incoming request to it. Only the Leader is allowed to perform an execution of actions. When the Leader kicks off the execution of requested operation, it logs the details in the Execution key of Consul. All of the logs of the outcome of each step of execution can be viewed on the FCC UI.
Functionally, FCC is comprised of three components — Nodes, Plans and Execution.
A node is the basic functional unit of FCC. It is a logical representation of a system at Jet like Event Store, Nomad, SQL DB, Cosmos DB, etc. FCC has separate controllers to interact with each of these systems. A node can also be a group of other nodes. Each node can have multiple states which are defined in its configuration. The FCC controller refreshes the state of every node every few seconds.
The configuration file of a node (a .hcl file) has three sections. The point to note is that a node is just a logical representation of a system. When you are interacting with a node, you are actually changing its state.
To do that, FCC needs the address of the system represented by the node. The system can derive the location address by looking at the
consul_key value in the
source_definition section in the node configuration.
source_type = "event_store"
source_name = "event_store"
consul_server = "<primary_dc>"
vault_server = "<"secondary_dc>"
consul_key = "Jet/Failover/EventStore_A"
consul_key is the key in Consul where FCC can find the address of the system represented by the node, in this case, Eventstore_A.
The node configuration also includes the
source_type of the node. This allows FCC to invoke appropriate controller for interacting with this node.
A plan is a set of steps which instructs the FCC server how to transition the state of a node from one state to another. Every step in a plan leads to a defined Checkpoint. A plan can change the state of just a single node or a group of nodes. All of the steps of a failover, with their interaction with multiple nodes, can be grouped into a single plan. This way, a single plan could failover an entire Tier-1 system along with its related data stores.
As you can see in the Figure 9, the FCC UI makes it possible to expand a plan and hit Execute to kick it off. C1, C2 represent the checkpoints in the plan. C1 is the starting state and C2 is the final state which when achieved, the plan’s execution will finish.
Let’s see an example of a Plan which shifts traffic from the primary region to secondary.
The plan above starts at checkpoint C1, defined as the named nodes under jet.com being in any state, and then interacts with three nodes to change their states to achieve final checkpoint C2, where all traffic has been shifted to the Secondary region.
Let’s take a look at another plan which shifts traffic back to HOT-HOT, where the Primary and Secondary Datacenter both serve live traffic.
Note the orange highlighting in Figure 11. One of the most important features of FCC is Checkpoint Validation. Every time a plan is executed, as it progresses through the checkpoints, it validates the states of all the nodes within that checkpoint to ensure they have reached the expected state. Once it ensures the desired states have been reached, FCC highlights them, so the user knows that the plan is progressing well.
It is only when FCC has ensured that all the nodes at a checkpoint has reached to their expected states that it marks it that checkpoint as complete, and only after that does it proceed to the next checkpoint. This way, we can be sure that once FCC marks a Plan as finished, it has actually transitioned the states of all the nodes in it. Isn’t that awesome?
Some complex plans can have as many as 50 checkpoints, with each checkpoint interacting with multiple complex nodes. Let’s take a look at a schematic view of one example of a complex failover plan.
Figure 12 shows an example of one such complex plan. This plan interacts with almost all the Tier 1 systems across its checkpoints, and it fails over the entire Tier 1 flow along with the associated data stores. This means that, after this plan finishes execution, we have failed over entirely onto the secondary Data Center. And all the user needs to do is hit that Execute button! There is a caveat to this though — if there are any failures while changing states of one or multiple systems mid-way through the plan, at any checkpoint, we need to manually investigate and fix all problems. FCC doesn’t provide an auto roll-back option at this point of time.
We have mentioned multiple times so far about Plan execution. Execution is really one of the logical components of FCC. An Execution holds the details of a plan that is currently being executed. FCC can hold at the most one Execution at any given time. This means that only one Plan can be executed at a time. FCC doesn’t allow parallel executions.
Achievements via FCC
All the hard-work put into designing and implementation of FCC has paid off very well. We have achieved some significant milestones because of FCC. Some of these include:
- Shifting Customer Experience Traffic from one region to another takes less than one minute!
- Shifting of Entire Datacenter from one region to another takes less than 5 minutes!
- Full datacenter failover exercises can be executed multiple times a month, with no downtime.
- We are fully agnostic of operator expertise. All you need to do is know which plan to execute and rest is taken care of by FCC.
- Failover plans are centrally managed and stored in Git. This adds version control over the failover steps, and prevents dangerous invariant configurations.
Goals for next generation of FCC
- Operator Usability — We are trying to improve the operator experience while interacting with nodes and plans. We also plan to make plans accessible via APIs, add the ability to view recently run plans on the UI, etc.
- Node interaction model — Right now, the configurations of nodes and plans, controllers and servers all run as a single executable. We are planning to decouple the server from the controllers and the nodes/plans configuration. This change would mean that we would not have to restart the server every time a new plan or a node is introduced.
- Multi-tenancy — We want to be able to execute multiple plans at the same time. We would need to implement appropriate ACLs and locks to avoid race conditions.
- Open Source — The current version of FCC is somewhat Jet specific. After we have achieved all above features, we are planning to contribute it to the open source community.
Key Design Takeaways
There were lot of thoughts and feedbacks taken into consideration while designing FCC and we have learned some important lessons along the way. These learnings and design takeaways are not specific to FCC. They are true for any case where you want to have managed fail-overs. FCC is merely just a tool, the success of which lies on these strong foundational points. These guidelines will help whether you are trying to automate failovers, or just do them in a more controlled fashion.
- Master Playbook — Always have one! The playbook consists of all the steps that need to be performed when you want to fail over a specific system.
- Granularity of steps — Always try to make the steps in the master playbook as granular as possible to achieve simplicity. This may sound counter-intuitive. But trust me, more granular your steps are, easier it becomes to bunch it up into a script or set of instructions which a tool/software can carry out.
- Defined states — Always define the expected states of the systems you are interacting with at each step. This ensures you move forward only when current step is fully complete.
- State Validation — Always have some form of validation of the defined states at every step.
- Automation — Finally, try to avoid manual interaction by automating all the steps in playbook. This can be achieved by a tool like FCC.
All this being said, does it mean our success story of fail-overs at Jet is all about running FCC plans? Absolutely not, for it’s is our culture here that drives these processes:
- Ownership — The Cloud Platform (CP) team at Jet owns the management of the fail-overs. We encourage teams to onboard on FCC.
- Layout — We (CP team) sit with each application team to design the layout of the failover of their systems. We then convert these steps from the layout into FCC Plans.
- Practice — We have a schedule of planned failover exercises. Teams have to perform failover of their systems once every 5–6 weeks. This involves active engagement of both the Cloud Platform and the failover candidate team.
- Follow-up — After every failover exercise, we publish a summary, which includes corresponding follow up items to improve the systems as well as the FCC experience. The resulting action items are scheduled to be completed in the following sprint.
If you like the challenges of building complex & reliable systems and are interested in solving complex problems, check out our job openings.
Successful design and implementation of FCC and Fail-overs in general was made possible by efforts of many individuals from Cloud Platform Team at Jet.