Photo by NASA on Unsplash

How to update 63 million records in MongoDB 50% faster?

Dmytro Harazdovskiy
Shelf Engineering
Published in
9 min readNov 2, 2022

--

Table of contents

· Intro
· Setup
· Testing — 1 million records update
· More realistic example — 10m
· How can we improve this result?
· Additional hacks
· Updating whole collection — 63m
· Remedy for the cursor
· Wrap up

Intro

Following my previous article, we already have a 63🍋 records collection ready to use.

Therefore an apparent real-life scenario for this data is to add some kind of KPI or flag on documents to reuse it later. Imagine we can only do it after the insertion, not during it🙂.

How can we handle updates in such an extensive collection? Let’s find out which approaches would work out the best!

Setup

As usual, I’m running Mongo Atlas Service with the M30 tier.

Configs of Atlas cluster for experiments

Also, we are going to work with already inserted data — 63m records. collection. Example of the record:

Origin

How to insert data?

Testing — 1 million records update

Let’s start with the easiest possible situation to test — add one static field to 1 million records using updateMany . But how can we select one million out of 63?

Reviewing the example of data below, there is language field in it. I made a simple aggregation:

db.getCollection("63mil-collection").aggregate([
{ $group: { _id: '$language', count: { $sum: 1 } } },
{ $sort: { 'count': -1 } }
])

Just to be aware of how many reviews per language we have:

Spread of reviews in the dataset (done with Exploratory)

To run an updated test let’s use reviews written in the polish language — 1,341,264 records.

Origin

Nothing special here, adding isPolish flag and testing the performance of mongo:

❯ node update/1mil/update.js
Connected successfully to server
Updating 1,3mil data: 5:41.827 (m:ss.mmm)
{
acknowledged: true,
modifiedCount: 1341264,
upsertedId: null,
upsertedCount: 0,
matchedCount: 1341264
}

Mongo did not drop the connection, even though it took 5 minutes to complete 💪 What about metrics?

Metrics updating 1m with a single field
Metrics updating 1m with a single field

Wrapping up these results — even adding one filed without any pagination hit Mongo pretty hard with 1m documents. The system CPU was almost 100% meaning a single updateMany occupied 1 CPU core of the M30 cluster! This approach is terrible and I can suggest using it to update no more than 100k docs, another way you can screw your prod instance up.

More realistic example — 10m

Now we need to calculate a simple KPI for a subset of data — all the reviews written in Chinese should have a popularity score. Let’s define popularity to be votes_up + votes_funny + comment_count .

For this scenario, 10m records must be queried then looped and saved with the popularity score for each record. Right?

Let’s use MongoDB Cursor for reading and bulk write operations for saving data. Also, I need to mention that I’ve added an index tolanguage field in MongoDB to let us query faster. Let’s take a look at the script:

Origin

Line 16 — created a cursor with all records of the Chinese language. I don’t know why but in the dataset it was written as a schinesenotchinese .

Line 31 — creating a stream out of the cursor to consume data from it.

Line 32 — subscribing to data event to consume one record at a time in doc variable.

Line 34 — pushing every record into the accumulator array. Also, each of them is processed with getBulkOperations that creates a updateOne bulk write operation.

Line 35 — every time operations reaches 20000 elements, the stream is paused bulk operations are executed and the stream is resumed.

What about the results?

node sequential/10m/sequential.js
Connected successfully to server
Cursor: 0.736ms
Calculating cursor size
docs count: 5:47.590 (m:ss.mmm)
{ totalDocs: 10440462 }
objectCounter: 20000
objectCounter: 40000
objectCounter: 60000
.......
Script took: 1:30:56.312 (m:ss.mmm)

Damn, that’s a lot! We still need to wait more than an hour to just add one calculated field. Let’s check Mongo:

M30 During 10m record field updates
M30 During 10m record field updates

However, Mongo metrics for both Users and System CPU were about 50%. Connections were also pretty stable ~ 45. During countDocuments query Disk IOPS and Queue Deps increased sharply but later recovered. Also within half an hour, there were about 4 cursors and later they were dropping gradually.

How can we improve this result?

By reading my previous article you definitely know that:

Power is in parallelization

Therefore I replaced the section with bulkWrite to be done in parallel.

await radash.parallel(5, bulkChunks, (operations) => col.bulkWrite(operations))

and updated CHUNK variable to 100k. There is not much to show in code changes, the full script can see here.

Results:

❯ node update/10m/update-100k-parralel.js
.....
Done with 10500000 records
Script took: 57:56.312 (m:ss.mmm)

Much better now 36% time improvements, but still we need to wait about an hour. Also, I’ve captured logs from the console and figured out that it took Mongo 15 minutes (26%)just to make updates, as a result, querying took most of the time — 47 minutes(74%)

Metrics:

Not many changes to Mongo, in the current scenario basically all metrics were similar apart from Process CPU increased from 45 to ~60 percent.

I guess it’s still too long, we need to squeeze all the juice from our script!

Additional hacks

I’ve found one interesting property in mongo drive for the cursor.

This can be used to outsource read load from master to slave replica and make some room for write operations. What changes were done in the code to use it?

const client = new MongoClient(process.env.MONGO_CLUSTER_M30, {
useNewUrlParser: true,
useUnifiedTopology: true,
serverApi: ServerApiVersion.v1,
readPreference: 'secondary',
});

When you create a connection specify readPreference the option to have a value secondary . With this parameter, Mongo will read data from the slave replicas. I’ve made a test using this parameter, but it only improved the metrics, time was the same.

The process CPU was reduced from 60 to 20 percent. System CPU from 60 to 30% however other half was taken by a replica. This improvement will help us to run multiple update processes.

Updating whole collection — 63m

Not sure we can reuse the previous script, in this case, 🙂 since it can take about 6 hours to complete.

As you see listing data need to be improved first of all. What about parallelization? Seems like a perfect idea, we can easily implement reading data in chunks. The size of the collection is stable and it can be divided into these chunks:

--from=0 --to=10500000
--from=10500000 --to=21000000
--from=21000000 --to=31500000
--from=31500000 --to=42000000
--from=42000000 --to=52500000
--from=52500000 --to=63000000

Now slights tuning for script

const {from = 0, to = 500_000} = require('minimist')(process.argv.slice(2));
.....
const cursor = col.find({}, {
votes_up: 1,
votes_funny: 1,
comment_count: 1
}).batchSize(1000).skip(from).limit(to);
......

Okay seems really easy, testing time!

Running chunk update with writes!

And now you might become suspicious since data reading is taking a long time according to a bigger skip value. Gotcha!

The last process stopped because of MongoNetworkError. And this place is when the tire meet’s the road.

There is a caveat in using skip/limit params in a cursor. While going through the doc you might notice the following:

The skip() method requires the server to scan from the beginning of the input results set before beginning to return results. As the offset increases, skip() will become slower.

That’s why the last cursor dropped. Therefore we can not use the current approach for parallel fetching and updating.

Remedy for the cursor

Since we are using Mongo you definitely know about _id . But I bet you might not know about the

Even I didn’t until I read the documentation. Apart from its structure, there is one very interesting property of ObjectId; which is — ObjectId has a natural ordering

What does it mean? It simplifies that we can apply all the less-than and all the greater-than you want to it. If you don’t believe me, open Mongo shell and execute the following set of commands

> ObjectId("5936d49863623919cd56f52d") > ObjectId("5936d49863623919cd56f52e")
false
> ObjectId("5936d49863623919cd56f52d") > ObjectId("5936d49863623919cd56f52a")
true

Using this property of ObjectId and also taking into consideration the fact that _id is always indexed, we can devise the following approach for pagination:

  1. Fetch a page of documents from the database
  2. Get the document id of the last document on the page
  3. Retrieve documents greater than that id, repeat step 2

Now let’s take a look at the rewritten script using _id pagination:

Origin

I’ve removed some imports from this gist to reduce lines of code, so view the fully working script in the origin. Okay, what is happening here?

Line 19 — fetching first chunks of records and saving the last document id on line 29.

Lines — 26-27 perform a bulk write operation on previously processed docs to operations. write() function chunks 10k to 5 chunks per 2k each and writes them parallelly.

Lines — 30–32 call a recursive function processBatch with the id of the last elements

Lines — 32–49 processBatchfunction does basically all the same actions described before and calls itself recursively after each write of fetched docs. Every new call will offset the requested docs to be after the last processed element, and so on. It’s basically custom cursor realization.

Line — 42 processBatchthe function will stop recursively calling itself after reaching CLI passed--to flag count.

I decided to go nuts running this script — let’s do 12 parallel processes! It will evenly spread the load and allow process docs evenly.

node update/50m+/objectid/update.js --from=0 --to=5250000
node update/50m+/objectid/update.js --from=5250000 --to=10500000
node update/50m+/objectid/update.js --from=10500000 --to=15750000

node update/50m+/objectid/update.js --from=15750000 --to=21000000
node update/50m+/objectid/update.js --from=21000000 --to=26250000
node update/50m+/objectid/update.js --from=26250000 --to=31500000

node update/50m+/objectid/update.js --from=31500000 --to=36750000
node update/50m+/objectid/update.js --from=36750000 --to=42000000
node update/50m+/objectid/update.js --from=42000000 --to=47250000

node update/50m+/objectid/update.js --from=47250000 --to=52500000
node update/50m+/objectid/update.js --from=52500000 --to=57750000
node update/50m+/objectid/update.js --from=57750000 --to=63000000

Results:

Running 12 parallel update scripts on 63m record collection

It took just 3 hours and 10 minutes! It’s almost 50% faster than running the previous cursor version of the script. The last process (red pointer of screenshot) took about 10 minutes to load the first 10k docs with a skip of 57750000 records! And it did not fail but this time impacted the overall script run. The faster script without initial skip offset took 2h 30m (first top-left terminal window).

Now let’s take a look at Mongo:

M30 cluster updating 63m records
M30 cluster updating 63m records

Both System and User CPU on the master node were in the range of 85–100%. Secondaries were also under the load since we've configured readPreference: ‘secondary’ on the previous section because there are a lot more read operations in this migration rather than writes.

Second plot: There were about 90 connections and then they started dropping gradually when the scripts were finishing. Also, IOPs on secondaries were higher because of tons of reads.

Wrap up

Summarizing all the experiments:

  • Mongo Atlas can easily cope with updating records under 1 million. Even updateMany will succeed in minutes. But be aware of the short spike in CPU usage to about 100%. After passing the 500k docs to update please consider cursor to spread the load.
  • Updating 10m is a bit tricky, use a cursor and parallelized writes. Also, configure chunks and concurrent updates wisely to not overwhelm the server. Regard running them overnight with a reduced load.
  • 63 million — be very careful with that count of records! Only suggest running updates during maintenance time. The cursor will work but really slowly, think about _id pagination and parallel processes if you need to complete migration fast and load is not a problem.

Big thanks for you reading up to this moment, hope It will help you become a better dev.

Read more:

Follow me on Twitter, get connected on LinkedIn, and see code examples on GitHub!

--

--