Published in


Sleep better at night! Here is how to respawn a MongoDB Server automatically!

Who should read it?

Why should you read it?

  • Build knowledge base on the above situation
  • Steps required to restore the fault tolerance value > 0
  • Automate the process in AWS

What will you learn?

  • Worst case scenarios when fault tolerance of replica sets is 0
  • Typical process followed by operations team to resolve issue
  • Ways to speed up the process replacement via automation
  • Pros and Cons of the re-spawn approach
  • Scope for enhancements to fit your needs

What’s the solution?

  • CloudWatch, Lambda to detect a dead/deteriorating EC2 instance
  • Auto Scaling to re-spawn and replace a dead server
  • Ansible & Resource Tags to associate disk volumes to EC2 instances
  • Route53, Lambda to resolve/update the DNS/server names
  • Let the new replica set member sync up automatically


Cautionary note

Source code


  • Diagnosis of the server issue
  • Analyze if replacement of the server is warranted
  • If replacement of server is required
  • Re-configure the MongoDB replica set
  • Data synchronization on the new server
  • Update application connection string to use new server

High availability

The solution — Manual version

  • Replace the unresponsive server with the a new replacement server
  • Reconfigure the replica set with new replacement server
  • Wait for data synchronization to complete
  • Update the application connection string with new server
MongoDB Server Replacement

Faster data synchronization

  • Un-mount the MongoDB data volume — Disk 3 from Server 03
  • Mount the data volume — Disk 3 on the replacement server Server 04
  • Use of separate data volume: The MongoDB data files are written onto separate disk volume than underlying Operating System disks so that you could un-mount/re-mount elsewhere.
  • Re-mountable disks: EBS volumes can be attached to replacement EC2 instance. However, the EC2 Instance Store / Ephemeral disks that are physically attached and cannot be mounted on to another instance.
  • Disk or data is not corrupted: If the disk volume or MongoDB data files are corrupted, then re-mounting the same disk volume on the replacement server would still be in the same corrupted state. To fix corrupted disk/data you may have to full initial sync on a new disk volume

The solution — Automated version

  • Creating a replacement servers
  • Reconfiguring the replica set with newer member
  • Mounting of re-used disk volume & data synchronization

Automate: Mounting of data volumes

AWS Resource Tags

Avoid replica set reconfiguration

  • VPC Setup & Components
  • Dynamic DNS

VPC Setup & Components

AWS VPC Components

Dynamic DNS

Sequence of events while spawing a new server
  1. Server 03 (ip- in the replica set is dead
  2. The other members of replica set see’s the Server 03 as 'Unreachable'
  3. Fault tolerance of replica set is now dropped to 0, with only 2 out of 3 member are up and running.
  4. A new replacement server Server 04 (ip- is provisioned
  5. The Ansible script on new server copies various tags from Disk 03 onto itself (Server 04)
  6. The script updates hostname of server, finds & mounts the Disk 03 using resource tags
  7. Cloud Watch event for ‘EC2 instance state: running’ on Server 04 is triggered
  8. AWS Lambda function subscribed to above Cloud Watch event runs the Python code
  9. The Python code fetches CNAME / ZONE resource instance tags on Server 04 and Updates A, PTR, and CNAME records in Route53 (replacing Server 03 with Server 04)
  10. The DNS name, skamon_demoapp_rs03, is now resolved to Server 04
  11. The replication process starts and keeps all the members synchronized and the fault tolerance of the replica set will be back to 1

Automate: Provisioning the replacement server

  • Create and configure a new auto scaling group
  • With the EC2 instances of your MongoDB replica set in the group
  • Set the group to always maintain a min/max count = replica set member count
  • Use AMI with all required tools and softwares in the launch configuration

Use cases & Scope for improvement

Other use case scenarios

Scope for improvement

Cross data center support

Network Partition

Multiple Data Centers with Network Partition

Multiple failures at same time

AWS credentials

Eliminate cron jobs

Closing thoughts



Elijah McClain, George Floyd, Eric Garner, Breonna Taylor, Ahmaud Arbery, Michael Brown, Oscar Grant, Atatiana Jefferson, Tamir Rice, Bettie Jones, Botham Jean

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store