Digging into a MongoDB 3.6.20 concurrency bug

Published in

Vena Engineering

8 min readFeb 19, 2021

What to do when Mongo won’t return your calls

Some time ago, one of our internal users reported a frustrating problem in our ETL job system. One of their jobs got stuck over the weekend, and all attempts to cancel it had no effect. It was the third weekend in a row that this had happened.

We use MongoDB to store the metadata for these jobs, as its document structure makes it a good fit for describing heterogeneous jobs with varying numbers of (and types of) steps. The Java server running the job was still online and responding to requests, but a jstack showed that the job thread was hung waiting on a response from MongoDB. All it was trying to do was update a single field in the Mongo document, but it was waiting forever for a response. From the MongoDB side, the primary node was also aware of the same update: the connection showed up in db.currentOp() and was marked "active". Terminating the query from the Mongo shell instantly caused the Java thread to throw an exception, so it wasn't a communication problem. Both sides of the connection were alive, yet the update wasn't making any progress!

Not good.

Albert investigated and found that our weekly secondary rotation jobs are scheduled around the same time that the group of UPDATE commands fired. Comparing timestamps in the rotation job output strongly suggested that UPDATE commands issued during a replSetReconfigure will hang and never return.

I started trying to reproduce the issue locally by mimicking the problematic UPDATEs and was surprised to find that my very simple Python script experienced the same problem:

I’ve pushed up scripts that I used to reproduce this issue here.

The fact that I could reproduce the issue with a simple Python script meant that it’s not something to do with any code we’ve written on top of the Java Mongo client. The reproduction test case was so straight forward I turned my sights on MongoDB itself.

background

This post assumes that you know a little about MongoDB: what a primary and secondary is, what replication is and how it works, what an operation is, and so on.

MongoDB is written in C++ so it’s useful to have a little knowledge there.

MongoDB takes a long time to build

My Vena-issued T480s has a 4 core i5–8350U (with hyper-threading off), 24GB of DDR4 RAM, a modern NVMe drive, and it still takes gollum 33 minutes to build mongod and the mongo CLI binaries.

scons: done building targets.

real    33m42.499s  
user    119m10.756s  
 sys    7m56.016s

Also, did you know that an optimized mongod binary (when statically linked) is around 1.3G? I didn’t either, and now you know.

jwm@gollum ~/opt/mongo [39c200878284912f19553901a6fea4b31531a899] [] $ ls -lh mongod  
-rwxr-xr-x 1 jwm jwm 1.3G Oct 21 17:23 mongod

Note: that SHA is for 3.6.20, our production cluster version.

hello gdb, my old friend

After building the debug binaries, I created a docker container using the resulting mongod and mongo binaries, and confirmed that the issue was still reproducible.

I reproduced the issue and used db.currentOp() to find the connection id ("conn15" in the example below):

        ...
        {
            "host" : "ecb589a5d43b:27017",
            "desc" : "conn15",
            "connectionId" : 15,
            "client" : "172.17.0.1:36012",
            "clientMetadata" : {
                "driver" : {
                    "name" : "PyMongo",
                    "version" : "3.11.0"
                },
                ...
            },
            "active" : true,
            ...

Then I attached gdb to the primary mongod process and found the corresponding thread:

The conn15 thread was stuck waiting on a condition variable at src/mongo/db/operation_context.cpp, line 313, which is frame 4 in the above backtrace:

Note this op doesn’t have write concern timeout, so the deadline is Date_t::max(). I interpreted this as "if our operation has no deadline, wait forever for the condition variable".

Frame 5 is the lambda execution, and frame 6 is the calling context: src/mongo/db/repl/replication_coordinator_impl.cpp:1689 is from a loop in a function called _awaitReplication_inlock:

The condition variable created here (the one we’re waiting on in frame 4) is passed both to ThreadWaiter waiter and opCtx->waitForConditionOrInterruptNoAssertUntil(...).

Let’s look at ThreadWaiter:

The ThreadWaiter is passed to a WaiterGuard on line 1682, so let's also look at WaiterGuard:

_doneWaitingForReplication_inlock does what you'd imagine - for an op with write concern majority, it waits for the majority of mongo instances to reach the op's opTime:

Based on those snippets, here was the mental model I had so far:

there’s a list of waiters in _replicationWaiterList
these waiters refer to condition variables
if _doneWaitingForReplication_inlock returns false, then the thread waits on the condition variable
once the condition variable is notified, the thread wakes up and checks if replication is done again.

I assumed those condition variables would be notified as opTimes on other mongo instances change, but it looks like they are notified when the instance they’re running on has committed a new opTime or something called a committed snapshot. Both _updateLastCommittedOpTime_inlock and _updateCommittedSnapshot_inlock call _wakeReadyWaiters_inlock:

Note that in _wakeReadyWaiters_inlock, the ThreadWaiter is removed from _replicationWaiterList if replication is done. This makes sense - that thread won't need to wait ever again.

I wanted to print out the ThreadWaiter object information, but:

(gdb) frame 6
(gdb) print _replicationWaiterList._list.size()
$13 = 0

No more waiters! We’ve reached a state where there are no waiters in _replicationWaiterList but we're still waiting on the condition variable to be notified. That means:

after the thread woke up, _doneWaitingForReplication_inlock reported false, because we entered another iteration of the loop.
nothing else will notify this thread’s condition variable because the reference to it was gone

This explains what we’re seeing: operations that never return.

So what removes waiters from this list?

The last two seemed related to primary stepdowns, and that’s not something we see in this case. This bug is triggered when secondaries are rotated out — nothing affects the primary.

One hypothesis I had was that when a secondary goes away, corresponding waiters are removed from this list, but I scrapped it because replication shouldn’t rely on a designated secondary — any secondary should be able to report that its opTime is up to date and satisfy replication criteria.

Our procedure for rotating secondaries is straight from Mongo’s documentation ( https://docs.mongodb.com/manual/tutorial/remove-replica-set-member/) where we:

Provision a new secondary.
Run rs.add(new secondary) on the primary.
Shut down mongod on the instance to be removed.
Run rs.remove(old secondary) on the primary.

rs.remove() is an alias that calls the replSetReconfig admin command with a new configuration that no longer contains the instance to be removed.

Issuing that admin command creates another thread that runs ReplicationCoordinatorImpl::processReplSetReconfig. There's a lot of machinery in there for validating the configuration update (among other things) and if all those are ok, ReplicationCoordinatorImpl::_finishReplSetReconfig is called.

it’s a concurrency bug!

Looking at _finishReplSetReconfig, there's a lot of code in there to deal with a potential leader election (any replSetReconfig can trigger a primary stepdown and subsequent leader election). If that is cut out, the code looks like the following:

Following _setCurrentRSConfig_inlock, there's again a lot of validation logic in there, and some extra logic for dealing with protocol version changes. If we cut out that code again, the code looks like the following:

_updateLastCommittedOpTime_inlock is one of the functions that calls _wakeReadyWaiters_inlock, which removes waiters from _replicationWaiterList if they are done waiting for replication. Here's that code again:

The update operation’s waiter is only signaled and removed if _doneWaitingForReplication_inlock returned true for that particular waiter. In the update operation's thread, execution would proceed after cv.wait(), and eventually reach another _doneWaitingForReplication_inlock in _awaitReplication_inlock:

If the first _doneWaitingForReplication_inlock returned true and the second returned false the update thread will hang on that condition variable's wait and (because _wakeReadyWaiters_inlock removes the waiter from the list) there's nothing that will notify it.

Both _doneWaitingForReplication_inlock calls above reference the same arguments except for the second one: it's referring to Timestamp() and minSnapshot respectively. What could make it return true then false?

Having another look at _doneWaitingForReplication_inlock, I spotted:

But back in _finishReplSetReconfig, snapshots are dropped!

This looked like a concurrency bug!

I confirmed it with a lot of extra Mongo log statements:

Note that conn17 is the update operation from the python client, which in this case is operation 1551.

Let’s go through this from the beginning:

These log messages were put in _awaitReplication_inlock:

the ThreadWaiter is added to _replicationWaiterList (add_inlock)
at the end of that function (where mongo loops waiting for _doneWaitingForReplication_inlock), _doneWaitingForReplication_inlock returns false.
print out which operation we’re looping for (looping for 1551, …), and print out the waiters that are in the list.

Replica set reconfiguration occurs.
_wakeReadyWaiters_inlock is called, which calls signalAndRemoveIf_inlock
_doneWaitingForReplication_inlock returns true for our operation's waiter (notice it reached the end, and called haveTaggedNodesReachedOpTime)
that waiter is removed from _replicationWaiterList

This comes again from _awaitReplication_inlock:

The condition variable was signalled by signalAndRemoveIf_inlock.
Execution continues to the top of the while loop, and _doneWaitingForReplication_inlock returns false.
There are no more waiters in _replicationWaiterList.
This thread will wait on the condition variable and never return.

The bug

Visualized (where the left side is the operation thread, and the right side is the thread processing the replSetReconfig command):

Below is the sequence that causes this bug:

_doneWaitingForReplication_inlock returns false
wait on condition variable
_doneWaitingForReplication_inlock returns true
set _currentCommittedSnapshot = boost::none
in _doneWaitingForReplication_inlock, check _currentCommittedSnapshot and return false because it's boost::none
wait forever on condition variable

the fix

The problem exists because _currentCommittedSnapshot was zeroed out temporarily - my proposal for a fix is to add a wait until _currentCommittedSnapshot is non-null again:

I’ve tested this, and it successfully resumes after a replSetReconfig:

Note that there is still a pause during the reconfigure, but it eventually resumes.

Filing a ticket

I was also able to reproduce this bug against versions 4.0.0 and 4.0.1, but not 4.0.2. I looked at all the commits between 4.0.1 and 4.0.2 and this one looks like it would solve this problem by removing one of the necessary conditions: that the waiter was removed from the list by something other than the waiter guard. That looks like a much cleaner fix than my proposal.

In any case, I opened an issue on Mongo’s Jira, and the story continues there!

Discuss on Hacker News

Originally published at https://engineering.vena.io on February 19, 2021.