Designing a Data Migrator in Go

Our Journey of migrating millions of records from MongoDB to Postgres

First Attempt — The quick and dirty way

Our first attempt was to evenly distribute the load to an equal number of Go-routines and handle them independently. The process involved the following steps:

  • Dividing it into an equal number of chunks and distributing the load between multiple workers. For instance, 1 million records can be divided into a chunk of 100K and can be handled by 10 Go-routines.
  • Let each worker thread query the source database (Mongo in our case), transform and save their part of data to the destination database (Postgres in our case).
Image for post
Image for post
Distributing workload to an equal number of Go-routines

The average time to migrate 1 million records on a steady database was ~30 minutes. That means, theoretically, it would take more than 5 hours to migrate 10 million records if we do it during low traffic hours. And, we had a lot more to migrate.

Why not just increase the number of workers? Just because you can, doesn’t mean you should. Our workers were already fast enough, and increasing them did not help us much but it made it worse.

Debugging the tool

Since we were confident that our workers were fast enough, we started by checking database metrics in an attempt to figure out the issue. We were using Postgres’ multi-row insert to insert and update on conflict, in a batch of 1000 records at a time. So, for a million records we would run 1000 transactions in Postgres, which seemed perfectly fine and the metrics didn’t show anything suspicious.

Image for post
Image for post
Postgres: 99th Percentile metric for our insert/update query
Image for post
Image for post
Postgres: Average total time vs Number of calls per minute
Image for post
Image for post
Mongo: Average execution time in seconds per read operation
Image for post
Image for post
Mongo: Average documents returned per second
Image for post
Image for post
Mongo: System CPU usage
Image for post
Image for post
Mongo: The ratio of the number of indexed items scanned to the number of documents returned by queries

Identifying the issue

At this point, we had a pretty good idea on what’s happening. Postgres was fast enough to insert millions of records a minute but was sitting idle most of the time as Mongo was not fast enough to generate the data. Moving forward, we need to keep the following things in mind.

  • We may not always have a privilege to make any changes in the source database. Also, it may not always be practical to add indexes for every possible query permutations.
  • We need to make sure that we consume as little resource of a source database as possible.
  • Our transform layer is fast enough and the average bulk insert time in Postgres is around 16ms for 1K records.

Rethinking the design

We revised our previous design by adding a new layer whose only job was to provide data to the workers to work on.

  • N number of worker threads, that is connected to Postgres and is responsible for transforming, batching and storing the data in parallel.
  • If there is any data available, the worker that is free can pick it up and process it further.
Image for post
Image for post
Adding a layer to limit interaction with MongoDB
An example implementation of the revised approach: https://play.golang.org/p/rrLajaJDOIu

Testing the new design

We ran the tool from a few thousand to millions of records, multiple times and voila, it was faster than we expected.

The average time to migrate 1 million records was ~25 seconds now. 5 million records took ~125 seconds on average.

This was a huge improvement over the previous one. Remember that it took ~30 minutes on average before to migrate 1 million records. We are 75 times faster!

Checking DB metrics

Average insertion time for Postgres is increased from 16ms to 100ms but we are consuming more than 2K calls/min as opposed to 40 before.

Image for post
Image for post
Postgres: Average total time vs Number of calls per minute
Image for post
Image for post
Mongo: Average execution time in seconds per read operation (300ms)
Image for post
Image for post
Mongo: Documents returned over a given sample period (53k/s)
Image for post
Image for post
Image for post
Image for post
Mongo: System CPU usage (18%), Query targeting (1)

Can we do better than this?

Some extra performance improvement was achieved by tweaking a few Postgres configs. For instance, turning off the auto-vacuum, and using a unique key instead of the primary key during the time of migration reduced the overall migration time from ~25 seconds to ~20 seconds for a million records. Also increasing the batch size can give you some performance boosts with the cost of using a bit extra memory.

Image for post
Image for post
…and we were not an exception!
Image for post
Image for post
Alternate design with separate layer to communicate with Postgres

Takeaway

Distributing the load concurrently to multiple threads may not always make your program faster. Program efficiency depends on multiple different factors one of them being the limitations of the dependent systems. It doesn’t matter if your program is amazingly fast if the dependent systems are not fast enough to process them at a similar rate.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight.

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox.

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store