Mission Possible: Resize MongoDB Capped Collections Without Downtime
Why our MongoDB had almost no available space on HD and how we could avert the imminent disaster without downtime.
TL;DR MongoDB supports capped collections with a specified maximal size but once created, you can’t resize it without downtime. Here’s how we got around it.
Recap: Capped Collections
As written in MongoDB’s documentation:
Capped collections are fixed-size collections that support high-throughput operations that insert and retrieve documents based on insertion order. Capped collections work in a way similar to circular buffers: once a collection fills its allocated space, it makes room for new documents by overwriting the oldest documents in the collection.
Capped collections are (nearly) perfect to store documents for a certain period or a well-defined quantity. Neither you nor MongoDB needs to actively delete the oldest documents. They will just be overwritten by the newest ones, once the capped collection reaches its max size.
In this regard, they are superior to TimeToLive (TTL) indexes because capped collections don’t require an index in order to find and eventually delete the oldest documents. Capped collections also don’t require to scan the whole collection for “outdated” documents as TTL indexes do once per minute. Thus, the database needs less RAM and fewer CPU cycles which increases the general database performance.
Speaking of performance, if you need to visually spot and analyze slow MongoDB operations, I suggest you to checkout idealo’s MongoDB slow-operations-profiler, that has been open-sourced on github.com. I’ve also written about it in another story.
Our Capped Collection
We need to store log data for at least 90 days, the longer the retention time the better. We are using three dedicated bare-metal servers in a replica set for this purpose. Each server has 376 GB RAM, 56 CPU cores, and 5.8 TB SSD’s in Raid 10.
The Max Size Problem
Since we wanted to store the maximum quantity of log data, we wanted to limit the capped collection at 80% of our HD. Naively one would use the following command to create the capped collection, given that 5101733952992
Bytes are 80% of 5.8 TB.
db.createCollection("offerMapping", {capped:true, size:5101733952992})
The problem is that size
defines the maximum size of uncompressed data but MongoDB stores data compressed on disk for many years already.
Historically Grown
MongoDB’s first and only storage engine MMAP stores data uncompressed on disk, so the size
parameter made sense at that time.
However, the current storage engine WiredTiger, stores data compressed on disk. WiredTiger has been introduced to MongoDB in version 3.0 already and became the default storage engine since version 3.2, published in 2015.
The MMAP storage engine has been deprecated since version 4.0 which was published in 2018.
Today’s Problem
In order to create the capped collection with the correct size, you need to estimate how good WiredTiger can compress your data.
Around one year ago, we made such tests and calculated that MongoDB could compress up to 26 TB of uncompressed data to fill 80% of our 5.8 TB hard disk. Naively one would use the following command to create the capped collection, given that 28587302322800
Bytes are 26 TB.
db.createCollection("offerMapping",{capped:true, size:28587302322800})
What’s wrong with this?
Well, even if this capped collection is the only collection on the server, you still have to add the sizes of the oplog and indexes. Both are stored compressed on disk by WiredTiger.
Even though we calculated all these parameters (almost) correctly, our server ran almost out of available disk space.
How could this happen?
This could happen because the content of the documents stored in the collection became less compressible within time. For the same reason, the index sizes increased.
Both led to the fact that the HD would have had almost no space left.
That’s the reason why it makes no sense to indicate the size
of uncompressed data for a capped collection while the default MongoDB storage engine WiredTiger stores data compressed on disk.
I’ve created a feature request at mongodb.org with a high priority. A short time later it has been downgraded to only Minor
priority by MongoDB staff which means that it will probably never be implemented regarding all the “minor” feature requests that are older than 10 years already.
I’d be glad if you could vote for it!
It Could Be So Easy
A workaround would be to simply resize the capped collection. That’s already possible since MongoDB version 3.6 - but only for the oplog. The oplog is also just a capped collection, so I wonder, why MongoDB engineers did not implement this feature in general for all capped collections. Do they think that’s good enough to guess the max size of compressed data, to know how well WiredTiger may compress the documents now and in the future?
As of today, there is no documented way to resize capped collections without downtime even though such unresolved feature requests exist for 10 years already, (e.g. SERVER-1864, which has also a priority of only Minor
).
The standard way to resize a capped collection is by creating a new capped collection with the right max size, stop all writing database clients, copy all documents from the old to the new collection, restart database clients and finally delete the old collection.
In our case, we would have to copy more than 25 TB of uncompressed data! Do the engineers at MongoDB really want us to be offline for that long?
Our Workaround
Initially, we naively tried to create a smaller capped collection with the same name on one of the replica set members, which we restarted in maintenance mode as a standalone server. Then we wanted to mongodump
the last 90 days of log data from the replica set and pipe it to mongorestore
to the standalone server. The last step would have been to add the standalone server back to the replica set.
This did not work because MongoDB uses since version 3.6 immutable UUID’s to identify its collections internally. The collection UUID remains the same across all members of a replica set.
Unfortunately, the createCollection
command does not support passing a UUID, so the collection will be created with a random UUID which differs from the collection UUID of the other replica set members. Once the standalone member comes back to join the replica set, MongoDB refuses to replicate due to the wrong collection UUID and shuts down the newly added server.
Thus, mongodump
and mongorestore
to the rescue:
When mongodump
writes data to disk, it creates for each collection a collectionName.metadata.json
file that belongs to a collection named collectionName
. Such a file looks like this:
{"options":{"capped":true,"size":{"$numberLong":"28587302322800"}},"indexes":[{"v":2,"key":{"_id":1},"name":"_id_","ns":"changelog.offerMapping"},{"v":2,"key":{"offerId":1.0},"name":"offerId_1","ns":"changelog.offerMapping"}],"uuid":"fbb83cda241d45779dc88983351d5447"}
As you can see, besides the indexes that will be recreated after having restored the collection, there is also the size and the UUID of the collection. We can modify the size
to our new size.
Since I was just interested to get this metadata.json
file, I added the -q
parameter to mongodump
to match no documents:
mongodump -h ${HOST}:${PORT} -d ${DB} -c ${COLLECTION} -u ${USER} -p ${PASS} -q '{ "_id" : "foo" }' --out /data/backup/
Once modified the size
within the metadata.json
file, I could restore the (empty) collection with mongorestore
and its parameter --preserveUUID
:
mongorestore -h ${HOST2}:${PORT2} -d ${DB} -c ${COLLECTION} -u ${USER} -p ${PASS} --drop --preserveUUID /data/backup/${DB}/${COLLECTION}.bson
Now, the new capped collection having the same UUID but having another size has been created on the standalone server. We can now dump our last 90 days of log data and restore it on the standalone server on-the-fly:
mongodump -h ${HOST}:${PORT} -d ${DB} -c ${COLLECTION} -u ${USER} -p ${PASS} --archive -q '{_id: {$gt: new ObjectId("5eaab2e00e55cebb702911e1")}}' | mongorestore -h ${HOST2}:${PORT2} -u ${USER} -p ${PASS} --maintainInsertionOrder --archive
In order to pipe mongodump
to mongorestore
we have to use the --archive
flag, which in turn makes obsolete the -d
and -c
parameters of the mongorestore
command.
Very important is the parameter --maintainInsertionOrder
because if you omit it, mongorestore
will insert documents in an arbitrary order which is dangerous for a capped collection where the first inserted documents will be the first overwritten ones when the capped collection gets full (FIFO). If the insertion order is not kept during restore, then documents to be overwritten will not be the oldest ones.
Once mongodump
and mongorestore
have finished, you end the maintenance period and put back the standalone server into the replica set. It will then automatically replay the oplog and comes in sync eventually.
As for us, we have reduced the capped collection to 50% of its previous size, which on the one hand is large enough to store at least the log data of the last 90 days, and on the other hand, allows enough space for a growing amount of log data per day or an even lower compression factor.
As you can see in the screenshot, only slightly more than 30% of disk space was occupied when the log data of the last 90 days was restored. Until then the capped collection continues to grow slowly until it reaches its maximum size, which, given the current compression factor, should be about 50% of the total disk capacity.
Final Steps
For the rest of the replica set members you can, of course, proceed as you did with the first one but here is an even easier way: Since one replica set member (here ${HOST2}
) has already the correctly sized capped collection, you can simply use it as a donor and restore the data on the next standalone member (here ${HOST3}
):
mongodump -h ${HOST2}:${PORT2} -d ${DB} -c ${COLLECTION} -u ${USER} -p ${PASS} --archive | mongorestore -h ${HOST3}:${PORT3} -u ${USER} -p ${PASS} --maintainInsertionOrder --archive --drop --preserveUUID
If you do it this way, it’s important that mongorestore
uses the --drop
flag to drop the still existing old capped collection. This time you need also to add the flag --preserveUUID
to preserve the collection UUID when mongorestore
creates the new capped collection on the standalone server.
Did You Say ObjectId?
You may have wondered that I used ObjectId
to retrieve the last 90 days of documents. Our documents have a timestamp field but it’s not indexed so it would take very long to query it.
However, our primary key _id
of the capped collection is an ObjectId
which contains the timestamp in its first 4 Bytes. Knowing this, we can calculate in the mongoshell an ObjectId
that corresponds to a date 90 days ago:
> var now = new Date()
> var d = new Date(now.getTime()-(90*24*60*60*1000));
> var dhex = (Math.round(d.getTime()/1000)).toString(16)
> var oid = new ObjectId(dhex + "0000000000000000")
Then we need a second ObjectId
that differs a bit, say one hour, because it’s unlikely that the first calculatedObjectId
exists due to the added padding of zeros.
> var d2 = new Date(d.getTime()+(60*60*1000));
> var d2hex = (Math.round(d2.getTime()/1000)).toString(16)
> var oid2 = new ObjectId(d2hex + "0000000000000000")
With these two ObjectId's
we can get a document which has been created within the time span of both of them:
> db.offerMapping.find({_id:{$gt:oid, $lt:oid2}}, {_id:1}).limit(1)
{"_id" : ObjectId("5eaab2e00e55cebb702911e1")}
And that’s the _id
I used in my query to get the last 90 days of documents.
And that’s it! I hope you enjoyed reading and gained some new helpful insights!
If you found this article useful, give me some high fives 👏🏻 and share it with your friends so others can find it too. Follow me here on Medium (Kay Agahd) to stay up-to-date with my work. Thanks for reading!
Btw. idealo is hiring: Check out our vacancies here.