Preface
In this article I’m going to cover the necessary procedure in order to salvage and repair your MongoDB database when the “WiredTiger.wt” file is corrupted. I’ll walk you through the recovery journey that we had for a few days and how we overcame it.
All the internet references on how to remedy such a disaster suggest that whenever the aforementioned file is missing/corrupted and you have no backups — you can pretty much kiss your data goodbye and start mourning.
Also, all possible solutions to MongoDB-related data corruption refer to a broken/corrupted collection or table, not one of the ‘WiredTiger’ files.
In this case you can use the ‘wt’ command-line utility that might be able to restore your corrupted collections.
But here we are specifically addressing a situation in which your collections exist, but the metadata file (“WiredTiger.wt”) is missing/corrupted.
When that happens, even the folks at Atlas Support would tell you that you’re screwed since this file is crucial to MongoDB collections, as stated by this answer from StackOverflow:
One thing is agreed upon everyone: the file cannot be re-created/restored by conventional tools that are at your disposal. No matter how hard you try.
And we did. Believe you me.
However, restoring this file is not the only method to recover your data.
If your data is critical to you and you still wish to be able to access it again, then go ahead and make some coffee, stick around and read through. It will be long and tedious, but I guarantee it’ll worth your while. It happened to our team and now I’m here writing about it for all the poor bastards that might get into the same predicament.
Our Problem
We’re running MongoDB version 3.2.7 on a standalone instance hosted at AWS and perform disk snapshot backups, twice-a-day. At the very same day that we planned to migrate the server into a ReplicaSet in Kubernetes — the server had an unknown issue and apparently didn’t shut-down gracefully. As a result, the ‘WiredTiger.wt” file size was suspiciously round at 4096 bytes and MongoDB failed to start with the following error:
[1617558227:741955][14734:0x7f921ad91740], file:WiredTiger.wt, connection: WiredTiger.wt read error: failed to read 4096 bytes at offset 73728: WT_ERROR: non-specific WiredTiger error
Unfortunately, all the backups that we had contained the same file size and were rendered useless at that point. The root cause is still unbeknownst to me (perhaps the data existed only in-memory and was never dumped into the file?)
We scorched the Internet looking for potential solutions, even reached out to Atlas Support with a lucrative offer to help us out — but to no avail. Everything seemed lost.
I called an old friend of mine whom I knew had a DBA consulting company back in the day called Brillix. He answered right away and referred me to their co-founder and CEO to try and solve the problem. That CEO is a super nice, highly skilled, hardcore MongoDB expert named Alon Spiegel.
Alon called me as soon as he had heard of our problem. He sounded very skeptical over our chances to recover the data since the 3.2.x version of MongoDB is notorious for potential data corruption (they improved significantly in later versions), but did suggest that if he were next to a computer (he was celebrating his son’s birthday) he’d probably try and create a new empty database and try to replace new collection files with the old ones that we had in an effort to “manipulate” MongoDB into reading the data inside those collections. “Not sure it’ll work, but I’d give it a try”, were his exact words.
However intrigued by the idea — we couldn’t keep him much longer on the line due to obvious family obligations and we didn’t quite understand what he was suggesting as the idea was lacking specifics.
Little did he know that his ‘far-fetched’ idea would be the precursor to the final solution. We have to give credit where credit is due (Thanks Alon!).
After more than 24 hours of trying to crack this, trying different things, shooting in the dark and basically almost losing any hope — our super talented and tenacious Head of R&D, Eliran Yasharel had a successful restoration of one collection (out of a total of almost 50).
The Solution
We deployed a MongoDB server version 4.0.5 since we knew the repairing capabilities of later versions were much superior to our 3.2.7 version. We also attached the disk that contained our precious data into the instance running the new MongoDB server.
From this point on, I’ll write in an instructional fashion:
Once deployed, create a new empty directory that will be used as the ‘dbpath’ of the new server. Directory paths are as follows:
- New Database collection files are stored in: /data/new-db
- Old Database collection files exist in: /data/old-db
Start the database application in the foreground by issuing:
mongod --dbpath /data/new-db
Once the server is up and running, connect to it via the ‘mongo’ command-line utility and create a new database, I called it ‘Recovery’ (but it’s best to name it after the original database). Then, insert some arbitrary document data into a new collection called ‘newDummyCollection1’, which represents the first collection that you’re going to recover:
use Recovery
db.newDummyCollection1.insert({arbitrary: “value”})
Which should result in this:
It’s important to note that you MUST put some document/data inside a collection in order to be able to manipulate MongoDB later on.
The next step is to determine the location on the disk in which the new collection you’ve just created has been saved to. This is done by running:
db.newDummyCollection1.stats()
This would yield a long JSON output, out of which you should search for the ‘uri’ object:
In this case, the full filename and location on the disk would be:
/data/new-db/collection-7-709196944276175444.wt
Confirm the file exists:
Good. At this point, we stop the database and exit the mongo shell!
Now comes the manipulation part: select a collection you wish to restore, in this case I’m now restoring collection file named “collection-13–8474722577232649897.wt”, so let’s copy this collection file and override the new collection file from the previous screenshot:
cp /data/old-db/collection-13–8474722577232649897.wt \
/data/new-db/collection-7-709196944276175444.wt
Once the file has been copied successfully, run the ‘repair’ operation:
mongod --dbpath /data/new-db --repair
Now you’ll see that it repairs the collection and salvages the data:
2021-04-06T13:57:35.813+0000 E STORAGE [initandlisten] WiredTiger error (0) [1617717455:813361][787:0x7f6a10b78a40], file:collection-7-709196944276175444.wt, WT_SESSION.verify: __verify_filefrag_chk, 474: file ranges never verified: 1 Raw: [1617717455:813361][787:0x7f6a10b78a40], file:collection-7-709196944276175444.wt, WT_SESSION.verify: __verify_filefrag_chk, 474: file ranges never verified: 1
2021-04-06T13:57:35.813+0000 I STORAGE [initandlisten] Verify failed on uri table:collection-7-709196944276175444. Running a salvage operation.
2021-04-06T13:57:35.819+0000 I INDEX [initandlisten] build index on: Recovery.newDummyCollection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "Recovery.newDummyCollection1" }
Start the database again:
mongod --dbpath /data/new-db
Verify that the data has been restored:
Viola! A few steps ago we only had one arbitrary document that we inserted into that collection, now MongoDB has erased that document and was able to replace it with the data from our old collection by repairing it.
Repeat this operation for each and every collection that you have by incrementing the ‘newDummyCollection’ number for each collection.
However, don’t pop your champagnes just yet, you still have one laborious task in order to complete the restoration process. Since the original table/collection names were stored inside ‘WiredTiger.wt’ file and are now lost, you’ll have to manually browse through each collection and determine which kind of data it holds. Then, you’ll have to rename the collection to match the original table name (otherwise your application would err, it does not know what ‘newDummyCollectionX’ represents..)
After reviewing the data inside the collection we recovered, we concluded that it was originally named ‘transactions’, so we rename it:
db.newDummyCollection1.renameCollection("transactions")
Once you recovered all the collections and renamed them properly, you probably want to import this data back into mongo 3.2.7 (remember, the restoration process was done on version 4.0.5).
You’d be very much surprised to hear that the ‘mongodump’ command of version 4.0.5 is working beautifully with the ‘mongorestore’ command of version 3.2.7!
This means that we were able to repair the data on mongo version 4.0.5, dump it using ‘mongodump’ version 4.0.5, copy the dump to a brand new and empty MongoDB version 3.2.7 and restore it using version 3.2.7 of ‘mongorestore’!
One word on indexing
You’ll still need to re-build your database indexes. In our case we save them in version control and ran a script that re-built them in minutes, so if you don’t have them saved somewhere — I’m leaving it for you to figure out.
A Final Word on Backups
After such an ordeal, the experts at Brillix suggested that from their experience, a disk snapshot backup is not ideal as the only means of MongoDB backup. They recommend using the native MongoDB backup tools (i.e. ‘mongodump’) and use the disk snapshot backup as a secondary backup method but certainly not as primary. Although running ‘mongodump’ on a large database with lots of collections might put some strain on the underlying hardware — it’s worth the cost of resources as the data reliability increases ten folds.
I hope someone out there finds this article useful. If so, please feel free to comment and share! Thanks to everyone involved in solving this incident.
-Ido Ozeri