Migrating to MongoDB Atlas with no downtime

Our experience in migrating the MongoDB production clusters to Atlas

Flavio Alberti

Published in

THRON tech blog

8 min readMay 29, 2019

Introduction

THRON is a multi-tenant platform that delivers services to its customers by using MongoDB as one of its main databases. We started using MongoDB about 9 years ago, when version 1.8 was available, because of its potential as a document (no sql) db about scalability, error tolerance and also because of how easy it was to improve the db schema. After 9 years this decision is still proving to be good.

In this article we describe how we migrated our MongoDB clusters (which was managed by us, on our AWS infrastructure) to the managed Atlas service, while keeping high availability of services and preventing users to detect any change.

What is ATLAS

Atlas is a DB as a Service solution from MongoDB, it provides an administration console where customers can easily deploy clusters, change configuration, sizing and security parameters on different Cloud providers (AWS, Google Cloud,MS Azure). One of the advantages of this approach is the “PayAsYouGo” business model, which allows to optimise costs for time-limited workloads.

Why migrating to a DBaaS

We believe in the values of SaaS, we like to eat our own cake too, switching towards a DB as a Service allows us to have higher flexibility in adding new clusters, resize existing ones and to reduce management costs. It’s also worth mentioning that MongoDB evolved their strategy and MongoDB Professional has now a lower priority compared to the “as a service” offering.

Atlas allows MongoDB to:

prevent MongoDB deployments on unsupported architectures;
reduce configuration errors by customers;
apply optimisations and fixes that might be complex to maintain (when they involve kernel, filesystem, linux process management, etc)
reduce compatibility problems: granting proven solutions deeply integrated with the main three Public Cloud vendors;

For our specific needs, we also identified the following benefits, compared to the previous MongoDB Professional:

automated cluster management: it’s possible to reconfigure a cluster as needed in a very easy and quick way. It is possible to vertically scale Cluster (up and down), horizontally scale Cluster (adding and removing nodes) as well as easily deploy security patches and minor releases on cluster nodes.
easily restore a new cluster from an existing backup: this feature is not just important for disaster recovery purposes, it’s also often used in development or quality assurance stages to test scripts on real data. The ability to create a new cluster with real data within few minutes and paying just for the time you use it is a precious help.

ATLAS vs MongoDB

Being a managed service, Atlas doesn’t provide the same flexibility or feature set of a self managed cluster. During our MongoDB years we begun leveraging some features that were available on MongoDB but not on Atlas, such as:

replicaset advanced settings: Some replicaset configurations are not available, such as master node priority and which nodes can elect the master. It’s also not possible to access clusters in “administrator” mode and reconfigure them through commandline “rs.conf({…})”;
ReadPreference with tag preferences: some of our applications were using ReadPreference directive and Tags in order to partition read queries to specific slave nodes. We defined different cluster groups and different tags to balance queries to specific nodes to control performance and resource allocation;
Arbiter nodes: the smallest replicaset cluster in Atlas has 3 nodes and all of them are master-eligible, so they all contain data. It’s not possible to define a replicaset cluster composed by arbiter nodes. In some (limited scenarios) we were using this setup to manage clusters with less data nodes.

Migration strategy

Migrating towards a different infrastructure while keeping our direct management of the DB would have been simpler, because we would have had control over configurations, cluster topology and networking: this would have enabled us the possibility to create a new node on the new infrastructure, synchronise it and then, gradually, remove nodes from the old cluster while adding nodes to the new one.

Unfortunately this migration strategy is not possible when moving towards an “as a Service” architecture where you are forced to create a new cluster from scratch, migrate data from one cluster to the other, ensure the synchronisation is completed and ultimately migrate the query workload… this also means you have to update all the applications that use the DB.

This means there’s a critical “toggle” moment where you move production load from the old infrastructure to the new one, this calls for a carefully defined migration process. We ensured to make the following steps:

plan a technical assessment with MongoDB architects: you are not alone, don’t do it alone. A technical review made with MongoDB architects may lead the identification of new ideas or highlight some detail that you might be overlooking or underestimating;
establish peak performance on “old” architecture and test the “new” one against the same load to size it: we performed load tests on both MongoDB and Atlas to verify performance patterns and ensure we are correctly sizing the Atlas infrastructure;
accurately define migration plan: toggling from “old” to “new” means there will be a downtime, you have to ensure it’s as short as possible and it has to be done at the least critical moment;
ensure your monitoring and alerting are perfectly working during migration: changing architecture means you might need to add new probes and alarms because you want to be sure everything is perfect during the critical migration phase, even those things that you stopped monitoring over time because “they just work all the time”;
accurately define rollback plan: always prepare for worst case scenario. This is a critical factor in managing the migration, you have to prepare a plan that has to be ready after each single step of the migration to manage a potential failure of the main strategy;

The worst case scenario we identified was to experience infrastructure or performance issues on Atlas after completing the migration. In this scenario we couldn’t just go back to the previous MongoDB cluster because of the time required to sync data between 2 environments. We chose to manage this scenario thanks to an additional mongomirror synch process that feeds data from Atlas to a new backup cluster in the old infrastracture.

Update production MongoDB-related apps to match Atlas features

Before migrating data to Atlas we had to update production environment to match the feature set of Atlas, since the two solutions are not equivalent. We were using some “advanced” configurations in our replicasets and we also implemented custom query routing policy by using node tags. Removing those features dependencies was not just about configuring the DB but it was also about re-designing part of applications or architecture to ensure performance and availability despite removing the query routing.

Ensuring same performance on new cluster

Measuring Atlast performance might seem a waste of time, especially because Atlas is hosted on AWS too and provides a 1-to-1 match regarding instance sizes, but doing it ensured us that there were no surprises or overheads that could affect cluster performance given the same sizing.

Atlas instance naming is different from EC2 one but it’s easy to spot the similarities: Atlas M50 node has 32GB of RAM like an EC2 m4.2xlarge node on AWS, and so on.

How did we test production load on Atlas cluster? By using mongoreplay and mongomirror.

By using mongoreplay we recorded one full day of queries against each production cluster node, by using mongomirror we copied data from the “old” cluster to the “new” one. Mongomirror was transferring towards the Atlas cluster the whole write traffic (insert and update); mongoreplay was used to replicate the read traffic to the same destination cluster.

After few days of mongomirror+mongoreplay we just had to compare the monitoring graphs to ensure there were no anomalies or different behaviours between the cluster in Atlas and the production cluster in AWS.

Data migration

In order to keep data on sync between the two clusters, you could use either LiveMigration or Mongomirror.

LiveMigration is an agent, managed by Atlas that doesn’t require any installation from the customer and keeps data synch active up to 72 hours.
We chose to use a more “low level” approach by using Mongomirror (that require to be installed) which reads oplogs from source and replicates them on destination cluster. Mongomirror has no limitations and can be kept active as long as necessary for the migration

Switch-off the old one / Switch-on the new one

The actual “go live” phase has been quite easy thanks to all the preparation and tests that had been done before. In order to reduce the downtime to the least duration possible we automated all the procedures involved in the switching activity.

Our applications are all deployed in containers managed through AWS ECS; we scripted all the updates of the applications to change the database connection string as fast as possible so that we could just switch db by deploying the new containers.

We then compared the source (mongoDB) and destination’s (Atlas) oplog to ensure we didn’t miss any write request or performed duplicate writes.

Always remember to immediately stop mongomirror once the connection string’s update has been performed. If you don’t, some write requests might overwrite destination cluster data.

Conclusion

Carefully identifying all the failure scenarios and planning for each one of them was positive for the team morale and helped removing fears and uncertainty.

The most important part of the migration process is to plan for each single failure scenario and automate as much as possible the actions to manage those scenarios.

All our clusters are now using Atlas and we are reaping all the benefits we were looking for:

development teams are more independent in managing their clusters, before Atlas devops skills were often required to perform DB maintenance/evolution;
Creating new clusters from existing backups is much easier and it greatly helps development and testing phases, especially when you need to ensure there are no regressions;

What was not good

some costs are difficult to plan when designing the migration, a clear example of this is the bandwidth consumption;
during migration we had to cope with lots of warnings caused by excessive disk usage since there was low memory cache hits in early minutes after migration;
it’s hard to estimate available disk burst performance because we are currently using Atlas instances with no provisioned IOPS;

Overall it was a success, no support tickets or internal alerts were generated by this db switch and we are satisfied with it. It is not possible to write all the details in this article so feel free to contact us to discuss this in further detail, or leave a comment.