Digging into a MongoDB 3.6.20 concurrency bug
What to do when Mongo won’t return your calls
Some time ago, one of our internal users reported a frustrating problem in our ETL job system. One of their jobs got stuck over the weekend, and all attempts to cancel it had no effect. It was the third weekend in a row that this had happened.
We use MongoDB to store the metadata for these jobs, as its document structure makes it a good fit for describing heterogeneous jobs with varying numbers of (and types of) steps. The Java server running the job was still online and responding to requests, but a jstack showed that the job thread was hung waiting on a response from MongoDB. All it was trying to do was update a single field in the Mongo document, but it was waiting forever for a response. From the MongoDB side, the primary node was also aware of the same update: the connection showed up in db.currentOp()
and was marked "active". Terminating the query from the Mongo shell instantly caused the Java thread to throw an exception, so it wasn't a communication problem. Both sides of the connection were alive, yet the update wasn't making any progress!
Not good.
Albert investigated and found that our weekly secondary rotation jobs are scheduled around the same time that the group of UPDATE commands fired. Comparing timestamps in the rotation job output strongly suggested that UPDATE commands issued during a replSetReconfigure will hang and never return.
I started trying to reproduce the issue locally by mimicking the problematic UPDATEs and was surprised to find that my very simple Python script experienced the same problem:
I’ve pushed up scripts that I used to reproduce this issue here.
The fact that I could reproduce the issue with a simple Python script meant that it’s not something to do with any code we’ve written on top of the Java Mongo client. The reproduction test case was so straight forward I turned my sights on MongoDB itself.
background
This post assumes that you know a little about MongoDB: what a primary and secondary is, what replication is and how it works, what an operation is, and so on.
MongoDB is written in C++ so it’s useful to have a little knowledge there.
MongoDB takes a long time to build
My Vena-issued T480s has a 4 core i5–8350U (with hyper-threading off), 24GB of DDR4 RAM, a modern NVMe drive, and it still takes gollum 33 minutes to build mongod and the mongo CLI binaries.
scons: done building targets.
real 33m42.499s
user 119m10.756s
sys 7m56.016s
Also, did you know that an optimized mongod binary (when statically linked) is around 1.3G? I didn’t either, and now you know.
jwm@gollum ~/opt/mongo [39c200878284912f19553901a6fea4b31531a899] [] $ ls -lh mongod
-rwxr-xr-x 1 jwm jwm 1.3G Oct 21 17:23 mongod
Note: that SHA is for 3.6.20, our production cluster version.
hello gdb, my old friend
After building the debug binaries, I created a docker container using the resulting mongod
and mongo
binaries, and confirmed that the issue was still reproducible.
I reproduced the issue and used db.currentOp()
to find the connection id ("conn15" in the example below):
...
{
"host" : "ecb589a5d43b:27017",
"desc" : "conn15",
"connectionId" : 15,
"client" : "172.17.0.1:36012",
"clientMetadata" : {
"driver" : {
"name" : "PyMongo",
"version" : "3.11.0"
},
...
},
"active" : true,
...
Then I attached gdb to the primary mongod process and found the corresponding thread:
The conn15
thread was stuck waiting on a condition variable at src/mongo/db/operation_context.cpp
, line 313
, which is frame 4 in the above backtrace:
Note this op doesn’t have write concern timeout, so the deadline is Date_t::max()
. I interpreted this as "if our operation has no deadline, wait forever for the condition variable".
Frame 5 is the lambda execution, and frame 6 is the calling context: src/mongo/db/repl/replication_coordinator_impl.cpp
:1689 is from a loop in a function called _awaitReplication_inlock
:
The condition variable created here (the one we’re waiting on in frame 4) is passed both to ThreadWaiter waiter
and opCtx->waitForConditionOrInterruptNoAssertUntil(...)
.
Let’s look at ThreadWaiter
:
The ThreadWaiter
is passed to a WaiterGuard
on line 1682, so let's also look at WaiterGuard
:
_doneWaitingForReplication_inlock
does what you'd imagine - for an op with write concern majority, it waits for the majority of mongo instances to reach the op's opTime:
Based on those snippets, here was the mental model I had so far:
- there’s a list of waiters in
_replicationWaiterList
- these waiters refer to condition variables
- if
_doneWaitingForReplication_inlock
returns false, then the thread waits on the condition variable - once the condition variable is notified, the thread wakes up and checks if replication is done again.
I assumed those condition variables would be notified as opTimes on other mongo instances change, but it looks like they are notified when the instance they’re running on has committed a new opTime or something called a committed snapshot. Both _updateLastCommittedOpTime_inlock
and _updateCommittedSnapshot_inlock
call _wakeReadyWaiters_inlock
:
Note that in _wakeReadyWaiters_inlock
, the ThreadWaiter
is removed from _replicationWaiterList
if replication is done. This makes sense - that thread won't need to wait ever again.
I wanted to print out the ThreadWaiter object information, but:
(gdb) frame 6
(gdb) print _replicationWaiterList._list.size()
$13 = 0
No more waiters! We’ve reached a state where there are no waiters in _replicationWaiterList
but we're still waiting on the condition variable to be notified. That means:
- after the thread woke up,
_doneWaitingForReplication_inlock
reportedfalse
, because we entered another iteration of the loop. - nothing else will notify this thread’s condition variable because the reference to it was gone
This explains what we’re seeing: operations that never return.
So what removes waiters from this list?
~WaiterGuard()
ReplicationCoordinatorImpl::_wakeReadyWaiters_inlock
ReplicationCoordinatorImpl::shutdown
ReplicationCoordinatorImpl::_updateMemberStateFromTopologyCoordinator_inlock
The last two seemed related to primary stepdowns, and that’s not something we see in this case. This bug is triggered when secondaries are rotated out — nothing affects the primary.
One hypothesis I had was that when a secondary goes away, corresponding waiters are removed from this list, but I scrapped it because replication shouldn’t rely on a designated secondary — any secondary should be able to report that its opTime is up to date and satisfy replication criteria.
Our procedure for rotating secondaries is straight from Mongo’s documentation ( https://docs.mongodb.com/manual/tutorial/remove-replica-set-member/) where we:
- Provision a new secondary.
- Run
rs.add(new secondary)
on the primary. - Shut down mongod on the instance to be removed.
- Run
rs.remove(old secondary)
on the primary.
rs.remove()
is an alias that calls the replSetReconfig
admin command with a new configuration that no longer contains the instance to be removed.
Issuing that admin command creates another thread that runs ReplicationCoordinatorImpl::processReplSetReconfig
. There's a lot of machinery in there for validating the configuration update (among other things) and if all those are ok, ReplicationCoordinatorImpl::_finishReplSetReconfig
is called.
it’s a concurrency bug!
Looking at _finishReplSetReconfig
, there's a lot of code in there to deal with a potential leader election (any replSetReconfig can trigger a primary stepdown and subsequent leader election). If that is cut out, the code looks like the following:
Following _setCurrentRSConfig_inlock
, there's again a lot of validation logic in there, and some extra logic for dealing with protocol version changes. If we cut out that code again, the code looks like the following:
_updateLastCommittedOpTime_inlock
is one of the functions that calls _wakeReadyWaiters_inlock
, which removes waiters from _replicationWaiterList
if they are done waiting for replication. Here's that code again:
The update operation’s waiter is only signaled and removed if _doneWaitingForReplication_inlock
returned true for that particular waiter. In the update operation's thread, execution would proceed after cv.wait()
, and eventually reach another _doneWaitingForReplication_inlock
in _awaitReplication_inlock
:
If the first _doneWaitingForReplication_inlock
returned true and the second returned false the update thread will hang on that condition variable's wait and (because _wakeReadyWaiters_inlock
removes the waiter from the list) there's nothing that will notify it.
Both _doneWaitingForReplication_inlock
calls above reference the same arguments except for the second one: it's referring to Timestamp() and minSnapshot respectively. What could make it return true then false?
Having another look at _doneWaitingForReplication_inlock
, I spotted:
But back in _finishReplSetReconfig
, snapshots are dropped!
This looked like a concurrency bug!
I confirmed it with a lot of extra Mongo log statements:
Note that conn17
is the update operation from the python client, which in this case is operation 1551.
Let’s go through this from the beginning:
These log messages were put in _awaitReplication_inlock
:
- the
ThreadWaiter
is added to_replicationWaiterList
(add_inlock
) - at the end of that function (where mongo loops waiting for
_doneWaitingForReplication_inlock
),_doneWaitingForReplication_inlock
returns false. - print out which operation we’re looping for (looping for 1551, …), and print out the waiters that are in the list.
Next:
- Replica set reconfiguration occurs.
_wakeReadyWaiters_inlock
is called, which callssignalAndRemoveIf_inlock
_doneWaitingForReplication_inlock
returns true for our operation's waiter (notice it reached the end, and calledhaveTaggedNodesReachedOpTime
)- that waiter is removed from
_replicationWaiterList
Next:
This comes again from _awaitReplication_inlock
:
- The condition variable was signalled by
signalAndRemoveIf_inlock
. - Execution continues to the top of the while loop, and
_doneWaitingForReplication_inlock
returns false. - There are no more waiters in
_replicationWaiterList
. - This thread will wait on the condition variable and never return.
The bug
Visualized (where the left side is the operation thread, and the right side is the thread processing the replSetReconfig command):
Below is the sequence that causes this bug:
_doneWaitingForReplication_inlock
returns false- wait on condition variable
_doneWaitingForReplication_inlock
returns true- set
_currentCommittedSnapshot = boost::none
- in
_doneWaitingForReplication_inlock
, check_currentCommittedSnapshot
and return false because it'sboost::none
- wait forever on condition variable
the fix
The problem exists because _currentCommittedSnapshot
was zeroed out temporarily - my proposal for a fix is to add a wait until _currentCommittedSnapshot
is non-null again:
I’ve tested this, and it successfully resumes after a replSetReconfig:
Note that there is still a pause during the reconfigure, but it eventually resumes.
Filing a ticket
I was also able to reproduce this bug against versions 4.0.0
and 4.0.1
, but not 4.0.2
. I looked at all the commits between 4.0.1
and 4.0.2
and this one looks like it would solve this problem by removing one of the necessary conditions: that the waiter was removed from the list by something other than the waiter guard. That looks like a much cleaner fix than my proposal.
In any case, I opened an issue on Mongo’s Jira, and the story continues there!
Originally published at https://engineering.vena.io on February 19, 2021.