How We Do MongoDB Migrations at Coinbase
At Coinbase, we use a wide variety of datastores, including MongoDB, Postgres, and Redis. One of the things that has consistently been troublesome for us is MongoDB migrations. Up until about last week, our general migration process looked like this:
- Write a migration, either via a lazy migration library or a regular task, and add it to the codebase. Migrations generally add new fields, update data to reflect changed assumptions in the codebase, or remove unused fields.
- Get the code audited and approved by another engineer
- Find one of the very limited number of engineers who has access to production, and ask that person to run the migration.
- Put together a script to iterate over all of our records — sometimes in the tens of millions — and make any updates. Have that same engineer with production access run this script on our production application.
- Cleanup the migration with yet another custom script.
- If anything goes wrong, find the same engineer from step 2 and have them stop the migration.
This is a very ad-hoc process, and ad-hoc processes tend to fail when scaled. For example, when a 50+ person engineering team has only a handful people with production access, it can be easy for migrations to take up a big chunk of their time. It’s also far too easy to make a typo in a script and cause a production outage.
What did we do?
First, we planned out what we considered to be an ideal process for migrations. We came up with a few key ideas:
- Migrations should not require manually running a single line of code on production.
- Migrations should be easily accessible to all engineers, not just the few cleared to access our production data. We always strive to keep the amount of engineers with production access low; allowing migrations without this access will allow us to further cut down.
- Error handling should be present at all steps in the flow: not an afterthought in a quickly-written script. We need to assume that our boxes can die at any time, even in the middle of a migration.
- All migration code should be unit-tested, such that bugs are caught early on in the flow.
- Aside from writing the code, there should be little-to-no manual work in the migration process.
Migrations should not require manually running a single line of code on production.
After a few weeks of work, we implemented a process that achieves all of these goals. It’s all condensed into a panel with a few simple selections:
This panel is pretty simple, but also very powerful. All of our engineers can now run a migration with a simple, four-step process. We have a few pages of docs with all the nitty-gritty details, but I’ll spare you the fluff and give an easy summary — you probably didn’t come to Medium to read engineering docs.
- Use our backend admin panel to trigger a cleanup, making sure migration metadata is wiped.
- Add the migration to the codebase, with unit tests, after going through our internal code-review process.
- Kickoff a migration using the admin panel.
- Once the migration is done, the engineer will receive a notification via our error tracking software. They can then remove the migration and kickoff a final cleanup.
That’s it. All errors are handled in the background, and engineers are notified should anything go wrong with migrating the collection. There’s also a one-click killswitch to stop the migration, in case our production database needs maintenance.
Challenges
We ran into quite a few challenges when implementing automatic migrations. One example that comes to mind is preventing user error. If we opened up migrating all classes to all engineers, it would only be a matter of time before someone accidentally kicked off a migration job on a huge collection and obliterated our database.
To prevent this from happening, we took a three-pronged approach:
- Require explicit whitelisting of a given class before it can be migrated by our library. This ensures that all classes must go through our code-review process before being migrated.
- Pace all migration jobs, such that only a certain amount of documents are processed per second. This makes sure that we don’t hammer the database with millions of updates made in fast-succession.
- Added a killswitch for engineers to use in case the unexpected happens.
Final Notes
Hopefully this gives some insight into how we think about engineering at Coinbase. Generally, our goal is to automate away as much manual work as we can, reducing errors and saving engineer time to work on more productive projects.
Our goal is to automate away as much manual work as we can, reducing errors and saving engineer time to work on more productive projects.
If these kind of projects make you excited, we’re hiring! You can check out our careers page for more info.