MongoDB cluster migration with zero downtime

Sašo Matejina
Nov 2, 2016 · 4 min read

Two months ago we decided to migrate our clusters to Atlas, a new MongoDB cloud service provided by MongoDB, Inc. The reasons for our move to Atlas include a better security model with at-rest encryption and simple but powerful cluster management. We run several MongoDB clusters varying in size from 50GB to 500GB that are growing fast and requiring more and more speed.

Migrating databases of this magnitude is hard and it takes a lot of preparation to get it right! No matter how well prepared you are the moment before you do the switch you’re going to be nervous and wondering if there is a chance for any data corruption.

Checkr provides the background check API for a variety of on-demand companies like Uber, Instacart, Postmates and many others. It’s crucial for us to provide them with excellent and responsive services.

The standard way to migrate data with MongoDB was to use the traditional dump and restore method which would have required hours of downtime. We decided to find another solution and experiment more with live syncing and hot swapping to a secondary cluster.

It seemed like a crazy idea at first but after writing some demo migration tools we saw that it’s possible and that it could handle the load and stay in sync with the primary. We created a new tool called Go-Sync-Mongo that we decided to open-source so it could help other people trying to solve the same problem.

You can do your migration in 4 straightforward steps, and you start with dumping the database.

$ mongodump --out ./checkr-logs --username $SRC_USER --password $SRC_PASSWORD --host "rs-0/src.host1:27017,…/admin?replicaSet=rs-0" --oplog

Restore the dump to the new cluster.

$ mongorestore --host "rs-0/src.host1:27017,.../admin?replicaSet=rs-0" --ssl --username $DST_USER --password $DST_PASSWORD --authenticationDatabase admin --dir checkr-logs/ --oplogReplay

After your dump and restore are done, you are still missing some records that can be synced with Go-Sync-Mongo tool

$ go run main.go sync --src "mongodb://src.host1:27017,.../admin?replicaSet=rs-0" --src-username $SRC_USER --src-password $SRC_PASSWORD --src-ssl=true --dst "mongodb://dst.host1:27017,.../admin?replicaSet=rs-1" --dst-username $DST_USER --dst-password $DST_PASSWORD --dst-ssl=true --since 1477670219

As you can see the tool allows you to set source/destination host and a since flag that tells the app how far back to go in oplog. It first replays all selected oplog commands and once at the end of the oplog it starts tailing and replicating all new commands to the destination cluster. Depending on the oplog size you can go back as far as you want as each operation in the oplog is idempotent.

While the sync is running you can use the status command that takes the last inserted _id from each collection and counts all the records that were created before it. That way you can compare the record count between the SRC and DST clusters and see if you are missing any.

$ go run main.go status --src "mongodb://src.host1:27017,.../admin?replicaSet=rs-0" --src-username $SRC_USER --src-password $SRC_PASSWORD --src-ssl=true --dst "mongodb://dst.host1:27017,.../admin?replicaSet=rs-1" --dst-username $DST_USER --dst-password $DST_PASSWORD --dst-ssl=true --since 1477670219+----------------------+----------+-------------+------+
| DB | SOURCE | DESTINATION | DIFF |
+----------------------+----------+-------------+------+
| checkr-logs | 65766327 | 65766327 | 0 |
+----------------------+----------+-------------+------+

After your clusters are in sync it’s time to flip the switch by redirecting all writes and reads to the destination cluster.

Some will need to deploy new configs and some will do it with a change of an ENV var. After this is done you can run the status again and you should see a negative diff.

+----------------------+----------+-------------+--------+
| DB | SOURCE | DESTINATION | DIFF |
+----------------------+----------+-------------+--------+
| checkr-logs | 65766327 | 65767327 | -1000 |
+----------------------+----------+-------------+--------+

Keep the oplog sync running for a couple of minutes so that in case any writes still come in they will be synced to the new cluster.

Go-Sync-Mongo is a great tool that made it possible for Checkr to provide undisrupted traffic to our apps even when doing cluster migrations. It’s awesome because the same procedure can be used for a DB that is 50GB or 500GB+ in size. It takes more time to do the dump and restore but once in sync it only takes a second to switch the traffic to the new cluster.

Checkr Engineering

Build a fairer future by improving understanding of the past

Sašo Matejina

Written by

Disrupt, or be disrupted!

Checkr Engineering

Build a fairer future by improving understanding of the past