Adventures in mongoDB Oplog Troubleshooting

Jason Ford
Snagajob Engineering
5 min readJul 25, 2022

One of the most important things to monitor when running mongoDB clusters in production is the length of the oplog. The oplog is essential to mongoDB’s high availability guarantees, and a short oplog means your cluster is far less resilient to an outage, both planned and unplanned.

On one of our replica sets, whose primary job it is to store active job postings, we were having trouble keeping the oplog at a reasonable length. The storage size (compressed, using wiredtiger) of our data was about 700GB, with our oplog storage at 150GB. That oplog storage represented just a day and half of activity, even with our data size remaining flat, suggesting we were somehow rewriting over 20% of our data every 36 hours. This was an annoyance, until a particularly large data import coincided with a drop of our oplog time to under 8 hours. At this point we knew we needed to get to the bottom of what was going on.

To better understand the writes to our cluster, we decided to take a look at what was in our oplog (fun fact: you can query the oplog just like any other collection in mongoDB). A simple aggregation query could be used to tell us how many writes were in the oplog for each collection, and of what type:

db.oplog.rs.aggregate([ 
{
'$group':
{
'_id': { 'ns': '$ns', 'op': '$op' },
'total': { '$sum': 1 }
}
},
{
'$sort': { 'total': -1 }
}
])

The results largely showed us what we would expect: our most active collections, with a number of inserts, updates, and deletes (represented by an i, u, or d, respectively, in the `op` field). What jumped out at us was something unexpected. Here’s a sample of the results of the above query, for our posting collection:

{ _id: { ns: ‘posting.posting’, op: ‘u’ }, total: 3513007 },
{ _id: { ns: ‘posting.posting’, op: ’n’ }, total: 2567712 },
{ _id: { ns: ‘posting.posting’, op: ‘i’ }, total: 441471 },
{ _id: { ns: ‘posting.posting’, op: ‘d’ }, total: 101561 },

…what’s an n operation? And why did we have two and a half million of them?

It turns out that ’n’ in the oplog stands for no-op, a non-operation, and it seems that its primary use was in slow clusters to ensure replication was still in sync in the absence of other writes to a cluster. That’s obviously not what was going on here, but some additional searching led us to a few other users who encountered the same issue. What we found was quite a bit different than we initially expected, as the explanation lied in the combination of the findAndModify command, retryableWrites, and a change made during the release of mongoDB version 4.2. A bit of explanation is in order:

findAndModify is a handy command somewhat unique to mongoDB that lets you update a document, but return the document in the state it was before the update happened. This is useful if you need to update a counter and return the current value, among other examples.

retryableWrites is a handy feature first introduced in mongoDB version 3.6 that automatically retries a write a single time if the initial write failed due to a network error or lack of a primary. These errors most commonly occur during the brief part of a replica set election/failover when there is no writable node, and waiting a second and retrying is usually enough for the error to go away. When this feature was initially introduced, it was disabled by default, and you needed to pass retryWrites=true in your connection string to take advantage of the feature. Starting with the drivers introduced in version 4.2, the default changed to where the feature was automatically enabled if not explicitly disabled in the connection string.

Being able to retry writes is a great thing, as it allows your mongoDB replica sets to switch primaries during the day with essentially no interruptions to your application. At Snagajob we had built this capability into our codebase a decade ago when we first adopted mongoDB, for this very reason. As such we didn’t pay much attention when the feature was added into the database itself.

How is all this relevant to our problem? It all came together when we came across this ticket on mongoDB’s core server jira board. SERVER-30407, “Make findAndModify retryable”, contains this description:

This will make findAndModify commands with retryability generate an additional type “n” oplog entry that will contain the pre or post image of the update/delete operation. In addition, the actual update/delete oplog entry will contain a preImageTs/postImageTs field that will point to this “n” oplog entry.

To summarize — every findAndModify command generates two records in the oplog — a “u” entry, containing the update, and an “n” update, containing an entire copy of the document being updated. Could this be it? The smoking gun to our oplog problem? To test, we simply updated our connection string to disable retryableWrites (after all, we already had retry logic built into our own codebase). Here’s what happened:

That change took us from ~36 hours to around 10 days of oplog. Our problem, as it were, was solved.

It’s a problem I imagine few users of mongoDB come across — enabling retryable writes in your connections is generally a good idea, and you would need to make extensive use of findAndModify to see the downside we did. The good news is that, in the recently released version 6.0, there’s a much better default solution to making findAndModify queries retryable, where instead of storing the documents in the oplog, they’re stored in a separate collection.

A couple of takeaways from this experience:

  • Query your oplog — You can learn a lot about your data usage patterns here in ways that aren’t easily surfaced through typical monitoring
  • Expect the unexpected — driver and server upgrades with mongoDB are generally a pretty painless process, but it’s worth walking through the release notes and critically considering the impact of new features and behaviors, even if it’s not immediately apparent how your application will be impacted.

As for me, I’m overjoyed to have a nice comfortable cushion of an oplog again, which will help guard us against ever increasing write volume.

--

--