That one time without the mutex

Published in

Adventures in Software Development

3 min readNov 30, 2019

Every developer who starts with threading soon learns about the necessity of locking structures. The reason is pretty simple — if your program has two threads of execution, and there’s any chance at all one of them will alter the structure, then you need to lock between them.

The exact way you do this varies a bit by language, but the old school way of doing this in C and C++ is with a boring mutex, which you claim before you access the structure and then release after.

Without such locking, there’s a chance that one of threads can alter the structure whilst the other is part way through reading — if that’s pointer, or there’s assumptions about the atomicity of the values, well, bad news if you can’t control the sequencing. Even if it’s “incredibly rare”, or “one in a million” or whatever you want to tell yourself, it’s probably going to happen, and the consequences and likely crash are going to be unpredictable.

You really need that lock

As an aside, it is sometimes (very occasionally) possible to get away without full locking and instead make use of simple flags. Such cases have to be engineered very carefully, and are most suited to situations where you’re OK with sometimes having junk in a buffer (e.g, video transitions). Even so, it can be hard to guarantee things.

The mutex works, because it ensures that only thread can claim the mutex at once — this is done (at least in old school computer architecture terms) by disabling interrupts and therefore thread context switches during the claim.

Novice software developers starting with threads learn at least some version of the above, or least they understand that data can be shared by threads and that’s “bad”.

So, what happened?

(Some details obfuscated to (a) avoid strict identification and (b) because I forget some of the precise details and (c) because I feel like it)

I was running some recent code from a co-worker through a memory checker (valgrind) and spotted some corruption. Upon code inspection, it was obvious it was one these cases; and upon further check, it was obvious that a similar section of code using a different structure was also at risk. And so, I did what and good developer would do, and politely pointed these out. As some background, this was a mid-level developer known for over-complex solutions.

What was his response? Reluctant agreement with the first section of code, but firm denial of the second section. His reasoning? “QA didn’t report a problem with it”.

*facepalm*

Now, I can already hear many of you already lining up your objections, but let’s lay this out fully:

QA indeed hadn’t seen it — their focus had been on an older version of the code for a release candidate and would be for a while longer.
QA was unlikely to be running memory checkers — such things are typically outside their remit and understanding. Nevertheless, I did later on instruct them in its use.
What exactly would QA be reporting anyway? Some mysterious crash later on? Some unexplained transient behavior.

When I went on to explain that lock requirements are necessary, he wanted everyone to gather around a large screen and do a code review. It’s unclear what he hoped to achieve with this.

When my patience eventually ran down, the baton was taken up by a co-worker. He pointed out rightly that there might be some performance concerns here — indeed, locks can be “expensive” in some cases — but the developer didn’t take up this line of argument, and in practice, this wasn’t empirically supported.

The argument was eventually “won”, but probably not for the right reasons.

Conclusions

This is of course a classic case of the cost of fixing bugs. In the classic book, “The Mythical Man Month”, it’s pointed out that the time cost of fixing bugs at each stage of software lifecycle goes up exponentially:

Fixing when I had pointed it out (or the original developer doing the right thing) — minutes.
Cost of QA spotting “a problem” then go around with a developer to narrow down the problem code change — hours.
Cost of a beta tester running into problems — days.

And so it goes. Always check your locks.

That one time without the mutex

You really need that lock

So, what happened?

Conclusions

Written by Peter Naulls