Building a Repeatable Solution for Cosmos DB to MongoDB Migration In-House

Alexander Komyagin
4 min readAug 6, 2024

--

When I managed Professional Services at MongoDB, or built state-of-the-art database migration tools while on the Product team, I saw many users moving production workloads from Azure’s Cosmos DB to MongoDB. Cosmos DB with MongoDB API is marketed as MongoDB compatible: “Azure Cosmos DB for MongoDB makes it easy to use Azure Cosmos DB as if it were a MongoDB database.” A migration then should be a walk in the park?

The Reality of Compatibility

Cosmos DB for MongoDB is compatible with a limited subset of MongoDB’s wire protocol, suitable only for the simplest MongoDB applications. This limited compatibility can lead to higher costs and unexpected challenges.

My original assumption was that MongoDB tools (mongodump, mongorestore, mongomirror, mongosync, etc.) would work with Cosmos out-of-the-box or with minimal changes. I couldn’t have been more wrong.

Building a repeatable solution for migrating from Cosmos to MongoDB is hard. It’s going to take time to build it. More time to test it. And still more to support it. And then keep fixing it. Easily 6 to 8 months of work for a team. The more data and load you have, the longer it will take. It’s never too late to switch, but if you can, it pays off to do it early.

Let’s explore how to build a solution for such a migration.

Key Concerns for Migration

Migrating production workloads involves addressing several critical concerns:

  1. Avoiding Excessive Downtime
  2. Speeding Up the Migration
  3. Preserving Data Integrity
  4. Implementing a Rollback Plan

Recommended Approach for Small Datasets

If you have less than 100GB, a full dump-restore approach is advisable. It might take a few hours or more, but you’ll avoid many complexities. For Cosmos DB serverless offerings, this is the only option as they don’t support Change Streams (MongoDB’s Change Data Capture mechanism).

Avoiding Downtime

For larger datasets, avoiding downtime during migration is crucial. A live migration approach involves:

  • Initial Data Copy: Bulk copy the data from the source to the destination.
  • Capturing Changes: Use Change Data Capture (CDC) to capture changes from the beginning of the data copy process and apply them after the copy is done.
  • Replicating Changes: Continue replicating changes until the lag between the destination and the source is small enough to cut over.
Architecture for Cosmos DB to MongoDB Atlas migration
Execution flow

However, Cosmos DB’s limited support for Change Streams complicates this:

  • No Delete Events: Implement safe-deletes in your application using a special field to mark deleted documents and set the TTL field for automatic cleanup. You will need to clean up those documents on the MongoDB side as well. In any case, you will likely need to change your application to make it all work.
  • Per-Collection Change Streams: There’s no global or per-database Change Streams. Write code or scripts to coordinate parallel Change Streams for multiple collections. I don’t recommend having more than 10–15 parallel Change Streams on Cosmos DB.
  • Specific Pipeline Options: Use exact Change Stream pipeline options as specified; otherwise, it won’t work. They also always return the full document, so you need to replace it on the destination. Gradual cutovers (allowing writes on the destination early assuming no conflicts) are practically out of the question unless you have a reliable way to check data integrity.
  • No Timestamps: Implement tombstones or other sequencing methods to calculate replication lag.

Ensure to coordinate Change Streams and initial data copy to avoid data loss due to race conditions.

Speeding Up the Migration

Migration speed depends on:

  • Latency
  • Level of Parallelization
  • Provisioned RU/s on the source
  • Capacity on the destination (CPU, RAM, disk IO)

Latency to Cosmos DB can significantly impact throughput. Parallelize reads from different namespaces. If you can manually identify sensible split point(s) in the range of your _id values, use parallel range queries for large namespaces to increase speed. Sadly, Cosmos DB’s implementation of the aggregation $sample command does not work correctly, so you can’t do statistical sampling to determine split points programmatically.

Resumability

Track synced namespaces and store resume tokens from Change Streams to avoid restarting the migration process from scratch. This is particularly important during the initial testing phase.

Data Integrity

With no dbHash support in Cosmos DB, you can rely on heuristics like count and size, and perform spot checks.

Rollback Plan

How to make the process reversible? Your solution may not work with MongoDB as a source without additional work. Common issues that you will run into are Cosmos batch write limits and rate limiting.

Observability

Expose and monitor metrics to identify bottlenecks and ensure direct visibility into the migration process. Bottlenecks can differ between clusters, environments, and sometimes between runs.

Cost Management

A migration will cost you money, but don’t rush to overprovision the RU/s in Cosmos DB. Better start with the destination and make sure it’s not a bottleneck. In general, the greedy person pays twice. It’s worthwhile to get the migration over with as quickly as possible.

Conclusion

Migrating from Cosmos DB to MongoDB is challenging but manageable with the right approach. Addressing key concerns around downtime, speed, data integrity, and rollback plans ensures a smooth transition.

Good luck and happy migrating! Feel free to reach out with questions or comments.

By following these guidelines, you can build a robust, repeatable migration solution, minimizing downtime and ensuring data integrity during the transition from Cosmos DB to MongoDB.

UPDATE: Since writing this post, my company, Adiom, has developed an Open Source tool dsync to help address many of the above and other critical challenges of database migration. You can read the updated version of the post on our website.

--

--

No responses yet