Designing a Data Migrator in Go

Our Journey of migrating millions of records from MongoDB to Postgres

As a part of improving users experience within our company’s ecosystem, we decided to extract a piece of important functionality from a huge monolith mess to a separate independent service. Once the service was ready, the next step was to migrate millions of data from the old database, hosted in MongoDB, to a new one in Postgres. Not to mention the source database had many redundant and duplicate data, we also wanted to clean during this ETL process.

Since we were already using Go for most of our services, it was obvious for us to use the same language to build our migration tool. But this article will mostly discuss the design decisions we made along the way and will focus less on the coding part.

First Attempt — The quick and dirty way

Our first attempt was to evenly distribute the load to an equal number of Go-routines and handle them independently. The process involved the following steps:

  • Getting the total number of records that we needed to import.
  • Dividing it into an equal number of chunks and distributing the load between multiple workers. For instance, 1 million records can be divided into a chunk of 100K and can be handled by 10 Go-routines.
  • Let each worker thread query the source database (Mongo in our case), transform and save their part of data to the destination database (Postgres in our case).
Distributing workload to an equal number of Go-routines

We tested with a few thousand records and it seemed to work perfectly fine. After-all how hard can it be, right? But, our excitement only lasted for a short period of time. We ran our tool with 100 thousand to a million record, and the migration got slower and slower.

The average time to migrate 1 million records on a steady database was ~30 minutes. That means, theoretically, it would take more than 5 hours to migrate 10 million records if we do it during low traffic hours. And, we had a lot more to migrate.

Why not just increase the number of workers? Just because you can, doesn’t mean you should. Our workers were already fast enough, and increasing them did not help us much but it made it worse.

Debugging the tool

Since we were confident that our workers were fast enough, we started by checking database metrics in an attempt to figure out the issue. We were using Postgres’ multi-row insert to insert and update on conflict, in a batch of 1000 records at a time. So, for a million records we would run 1000 transactions in Postgres, which seemed perfectly fine and the metrics didn’t show anything suspicious.

Postgres: 99th Percentile metric for our insert/update query

The average insertion time was close to 16ms, that means, the whole insertion of 1 million records should complete in nearly 16 seconds.

Postgres: Average total time vs Number of calls per minute

Another metric showed that we were executing 35 to 40 calls/min on average, i.e, we were inserting a maximum of 35 to 40K records in a minute. But based on the previous metric we should have been able to insert more than that in just a second.

So where were we spending all this time in? The only thing that was remaining was to check Mongo metrics, and you guessed it right, the results were not as we expected.

Mongo: Average execution time in seconds per read operation

The chart above shows the average execution time per reading operation over a selected sample period. And the chart below shows the average number of documents returned per second over the same sample period.

Mongo: Average documents returned per second

We were waiting up to 12s for a read operation to complete. The average execution time was increasing linearly while the documents returned per second was decreasing at the same ratio. This was completely opposite of what anyone would expect to happen. And to make the matter even worse we were consuming 100% of the system resources.

Mongo: System CPU usage

When CPU usage is above 75%, which is considered high, it is likely that queries are unoptimized and the indexes are possibly missing. But, this was not the case for us since we were using a simple fetch query with minimal aggregation.

Mongo: The ratio of the number of indexed items scanned to the number of documents returned by queries

Next step was to check the index performance. The chart above shows that the query had to scan up to 1700 documents to find one that’s returned. A high number indicates that operations may be inefficient since many documents have to be scanned before returning one document. Ideally, the ratio of scanned documents to returned documents should be close to 1.

We made sure that we had indexes wherever applicable. The index performance improved a bit but even with proper indexes, time to migrate was still the same, ~30 minutes for a million records on a steady database.

Identifying the issue

At this point, we had a pretty good idea on what’s happening. Postgres was fast enough to insert millions of records a minute but was sitting idle most of the time as Mongo was not fast enough to generate the data. Moving forward, we need to keep the following things in mind.

  • The main bottleneck seems to be data access time from MongoDB which is our source database, serving millions of customers in production.
  • We may not always have a privilege to make any changes in the source database. Also, it may not always be practical to add indexes for every possible query permutations.
  • We need to make sure that we consume as little resource of a source database as possible.
  • Our transform layer is fast enough and the average bulk insert time in Postgres is around 16ms for 1K records.

We now need to tweak our approach to somehow reduce the load in the source database and feed as much data as possible to the destination database.

Rethinking the design

We revised our previous design by adding a new layer whose only job was to provide data to the workers to work on.

  • A single thread that is connected to MongoDB and is responsible for fetching and providing data to the workers. Thus, reducing the load in MongoDB.
  • N number of worker threads, that is connected to Postgres and is responsible for transforming, batching and storing the data in parallel.
  • If there is any data available, the worker that is free can pick it up and process it further.
Adding a layer to limit interaction with MongoDB

Unlike the previous design, we now only have one thread that interacts with the source database. This thread is entirely responsible to get data from Mongo in batch and feed it to the workers. As soon as the data is available, any worker that is free can pick it up, transform it as required, and insert to Postgres in a batch of 1K records at a time. If the data generator is fast enough, no workers have to sit idle and the destination database can be used to its max capacity. This approach also removed the step of getting total records upfront like in the previous one.

So how do we fetch data in batches? Fortunately, the MongoDB server returns query results in batches so we didn’t have to do anything special here. For example, if the total number of returned docs is 10K and you specify the batch size as 1K, you will have 1 find and 10 getMore commands sent from the client to the mongo server to complete the request. And mongo is smart enough to prefetch these data efficiently.

By default, the batch size will not exceed the maximum BSON document size which is 16 MB. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. The subsequent batch size can be from 4 to 16 megabytes. These values can be configured in mongo, but in our case, we went with the default one.

An example implementation of the revised approach:

Testing the new design

We ran the tool from a few thousand to millions of records, multiple times and voila, it was faster than we expected.

The average time to migrate 1 million records was ~25 seconds now. 5 million records took ~125 seconds on average.

This was a huge improvement over the previous one. Remember that it took ~30 minutes on average before to migrate 1 million records. We are 75 times faster!

Checking DB metrics

Average insertion time for Postgres is increased from 16ms to 100ms but we are consuming more than 2K calls/min as opposed to 40 before.

Postgres: Average total time vs Number of calls per minute

We managed to reduce load to the MongoDB as expected. Note that we haven’t added any extra indexes or tried to optimize the database in any way as we did before in our initial approach.

Mongo: Average execution time in seconds per read operation (300ms)

Average execution time is reduced to 300ms and the DB is returning 53K records per second as opposed to 420 records in 12s previously.

Mongo: Documents returned over a given sample period (53k/s)
Mongo: System CPU usage (18%), Query targeting (1)

The average system CPU usage is ~20% and query targeting is 1, which means that the DB has to scan only 1 document to return the one that’s required.

Can we do better than this?

Some extra performance improvement was achieved by tweaking a few Postgres configs. For instance, turning off the auto-vacuum, and using a unique key instead of the primary key during the time of migration reduced the overall migration time from ~25 seconds to ~20 seconds for a million records. Also increasing the batch size can give you some performance boosts with the cost of using a bit extra memory.

…and we were not an exception!

An alternate design can be to add a new layer to communicate with Postgres instead of inserting the data directly from the workers. This layer can collect processed data from the workers, group them, and queue them up to insert to the destination database. This way we can control the number of workers to process the data and number of workers to interact with the database independently. Especially, if the workers need to do some heavy lifting task, we may want to have more of them while still limiting the number of workers to interact with the DB.

Alternate design with separate layer to communicate with Postgres

For us, we didn’t get any performance improvements with this pattern as our workers were already fast enough.


Distributing the load concurrently to multiple threads may not always make your program faster. Program efficiency depends on multiple different factors one of them being the limitations of the dependent systems. It doesn’t matter if your program is amazingly fast if the dependent systems are not fast enough to process them at a similar rate.

Therefore, it is always better to spend some time to properly analyze and evaluate the limitations of the dependent systems and make sure to design our system keeping those limitations on the mind.

…always confused 🙃