Restoring MongoDB Replica Set Case Study

5 min readAug 15, 2019

There’s nothing more terrifying and disastrous than discovering your production database is not working. There’s nothing more nerve-wracking than trying to recover it while keeping everything else running somewhat smoothly. This is exactly what happened to us recently.

Some background

It was a quiet, peaceful weekend. Or at least it seemed so. We deployed a change to the system that caused an increase in the alerts we received the day before. The engineer who worked on it noticed a connection problem to our database and restarted it. However, the MongoDB running inside a Kubernetes pod got a timeout mounting the storage volume and stayed in a ContainerCreating status. Having a replica set and 2 members still up and running kept the system functioning like normal. As if to add insult to injury, he turned off the monitoring while fixing it and forgot to re-enable it. Everything seems to work properly and he finally called it a day.

First and second attempts

I noticed it the next workday morning and started the pod, praying for a quick recovery. Unfortunately, MongoDB doesn’t save the oplog for that long and the node reported it couldn’t find a source to sync from. Having the traditional recovery method ruled out, the next option was to clear the data directory and letting the member initialize everything from scratch, a few hours of downtime but we’ll soon be operating back at 100% capacity.

Let’s hold on for a minute and review our case. We have a MongoDB database configured as a replica set with 3 members, each on a dedicated machine managed by Kubernetes and using an SSD storage containing nearly 2TB of data each. All of this connected to a system that handles thousands of requests per second.

Copying the entire data would take several hours; And it did, up to the point when the connection failed for some reason. The good part is that there’s a retry. The bad part is that the same error occurred again, and again…

Third attempt

Time to think of a more creative solution to the problem. We know that the problem happens on the biggest collection in the database containing millions of documents. Maybe it’s a good time to do an information diet. We already have a TTL to expire old documents which are not relevant anymore. We lowered the expiration time and cleaned more than 50% of the data.

As you’ve probably guessed by now, that too didn’t work. There’s just too much data to replicate.

Fourth and fifth attempt

We came to the realization that syncing the database from scratch in a reasonable time is just not possible, so how do we create a new member that already has the data and could just connect to the other members and continue as if nothing ever happened? According to the MongoDB documentation, there are 2 ways to restore a replica set:

Using the initial sync — the tried and failed way we just did.
Cloning the storage of an existing member.

Alright, let’s review how the second method is being done:

Short version: We need to clone the storage, copy it to a new machine and then start up a MongoDB server as a standalone, remove the old configuration, start the server as replica set and initialize a new configuration. Then do the same for the other members and add them to the replica set. This is for a completely new cluster, not the same case as we need to create a new member in a functioning replica set.

Having the servers managed by Kubernetes offers the benefit that the storage is independent of the machine and can be easily duplicated. In newer versions of k8s, it is possible to create a snapshot directly, but we used the cloud provider’s API to perform the operation.

We followed the instructions up to the point of configuring a new replica set. Instead, we added it to our existing one. As expected, it connected the other members of the replica set, downloaded the configuration and figured out it needed to sync the data. So it did the obvious thing: delete all existing collections and start an initial sync. Back to square one.

In the end, we decided to do the same procedure without cleaning the local collection; Turns out it needed the history to detect it has the same data and use it instead of dropping everything and starting from scratch. As for the replica set configuration, it detected that the hostname changed and switched role to a secondary for the missing member.

Going over the guide again, I now see that it has two cloning steps. The first one is for the primary node of a new cluster. In that case they’re creating a new replica set, of course it needs to start anew, there’s no connection to previous servers. The second step is once the new primary is configured, we clone the storage and configuration of that server. The new member, a secondary, is using the configuration of the primary that was just set up to understand it’s a replication.

Conclusion and lessons

Recovery is never easy. Don’t get to the situation that you have to recover a database. Really, do everything you can to avoid it:

Never, ever, no matter what, turn off the alerts.
Always check that everything work, don’t just see that that error is fixed.

If you do happen to be in that situation, make sure to minimize any more potential problems, like other people or tools that may access the resources. Next, try the standard way to do the recovery and make sure you understand the process and why it is done. Don’t just copy commands and follow the guide, it may be outdated or not the same case you are facing. Make sure it is relevant and that you comprehend the reasoning behind each step.