Expedia Group Tech — Software
Auto-Healing MongoDB cluster in AWS
How to set up a MongoDB cluster that is resilient to hardware failures and simplifies maintenance
Setting up a distributed database cluster requires careful consideration of many key aspects, including resiliency. In this post, I’ll explain how we (Praveer Gupta and I) set up a MongoDB cluster that is resilient to failure of the underlying hosts.
Why is Auto-Healing required?
If you want to set up a database cluster in AWS, using EC2 instances is a popular way of doing so. But these instances are ephemeral and are occasionally lost or need to be replaced due to:
- Planned maintenance like OS/AMI upgrades, capacity changes, etc.
- Unplanned activities like instance termination by AWS to handle degraded performance/failed hardware
Restoring the database to its original state under both these scenarios involves considerable manual effort and sensitive operations. Hence, a setup that heals/recovers from such failures automatically, would save time and prevent errors. The same setup would also decrease the effort required in performing planned maintenance.
What makes it possible?
All distributed databases and cloud providers provide somewhat similar features, hence it’s possible to use this approach to implement auto-healing for different distributed databases across multiple cloud providers.
This section lists the features of MongoDB and AWS that we exploited. How they are used is detailed in the following sections.
- MongoDB servers in a replica set (RS) replicate data among themselves to provide redundancy. In the event of a server getting reconnected after being disconnected for some time, the replica set is able to detect how far behind the server is and automatically sync the remaining data.
- Each individual shard/partition can be set up as a replica set. Network partitioning in one shard’s RS does not impact any other shard.
- Elastic Block Store (EBS) provides a way to segregate data from compute. Termination protection on EBS volumes prevents them from getting deleted when the instance they are attached to gets terminated.
- If an autoscaling group spans multiple Availability Zones (AZs), AWS ensures uniform distribution of servers across all of them. This means that a replacement for a server in uswest2-a is also launched in uswest2-a.
- Route53 is a DNS service. We create an A record having the hostname and IP address in Route 53 for each server in the cluster.
- Tagging provides a simple way to associate properties with resources, which can also be used to identify those resources.
How is auto healing achieved?
The solution is presented here in three sections: The first two sections show how we set up the cluster topology across EC2 instances and then configure each instance. The last section details how auto recovery is achieved and how we reduced the recovery time.
Cluster topology setup
We used AWS CloudFormation to provision EC2 instances. The same may be achieved through other solutions like Terraform.
- We set up a sharded cluster with 2 shards (each as a replica set with 3 members), a config server replica set of 3 members and 3 router servers.
- Each of these have their own Autoscaling Groups and Launch Configurations. The min and max value in each ASG were set to 3. Hence when an instance gets terminated, its replacement is brought up by AWS Autoscaling.
- Each of the 3 servers in the ASGs are set up in a different Availability Zone (AZ) (uswest2-a, b and c).
MongoDB process setup
Once an EC2 instance is launched (either when the cluster is first set up or during a replacement instance launch), the following workflow configures the instance.
- AWS creates the EC2 instance.
- AWS calls the initialization lifecycle hook (
UserDatascript in the Launch config)
- The hook downloads a bootstrap script stored on S3.
- The bootstrap script installs Ansible and its dependencies.
- The bootstrap script configures the instance by running the Ansible playbook in local mode.
Further points to note:
- Instance configuration and MongoDB process setup is done by the Ansible playbook that we wrote.
- Each Launch Configuration passes its details like shard name, etc. as parameters to the bootstrap script, which are then used in the Ansible playbook.
Another automation framework, or even shell scripts, could also be used to achieve the same results.
Putting it together
Due to the arrangement of our topology and the fact that AWS autoscaling launches a replacement in the same AZ as the original, any newly launched server knows its coordinates in the cluster. By coordinates, I mean which Shard and Availability Zone it belongs to. How this information is used is mentioned below.
- The hostname of each server is updated by the Ansible playbook using its coordinates. For example, a member of the first shard(srs0) in uswest2a is named like mongo_srs0_uswest2a. The EBS volumes attached to the instance are also tagged using these coordinates.
- The Ansible playbook also updates the server’s IP address in its A record in Route53. The config for each RS contains host names of its members, rather than their IPs, and any replacement server gets automatically added to the relevant RS without manual intervention.
- Because we use MongoDB replica sets, MongoDB does the job of replicating missing data in the newly added instance. Once the sync completes, the instance starts to function as expected.
- In the extremely unlikely event that all 3 members of a replica set get terminated simultaneously, EBS termination protection ensures that the data is not lost, and the cluster can restore to a last known good state. Not using EBS volumes (with termination protection) would have resulted in loss of data.
Speeding up the auto healing process
Since the replacement server launched by Autoscaling has a new EBS volume attached, MongoDB needs to perform a full sync to replicate data onto it. This process is slow and used to take ~4 hours in our case.
An optimisation we used is to preserve the data of the server that died, so that only the delta needs to be re-synced on the replacement server and the process can complete in a few minutes.
We achieved this by finding the EBS volumes that were attached to the terminated instance (using its tags) and attaching them to the new instance before starting the MongoDB process there. The volumes created with the new instance are deleted. Thus, when the new server joins the RS, only the delta gets synced. This logic is also part of the Ansible playbook.
The proof is in the pudding
We deployed this solution about 18 months ago. In that time, we’ve had approximately 6 instance failures in our cluster. Our deployment was able to handle all of these with ZERO downtime, ZERO data loss and ZERO manual intervention.
We were also able to downscale the original EBS volumes since they were over-provisioned. We simply updated CloudFormation with the new EBS size, and terminated the instances one by one, until the entire fleet was replaced. This too incurred no downtime.