A short tale of a read overflow
When a long running process crashes, it is pretty uncool. More so if the process happens to take a lot of state in memory. This is why I love web programming frameworks that are able, without major performance overhead, to create a new interpreter and a new state for each page view, and deallocate every resource used at the end of the page generation. It is an inherently more reliable programming paradigm, where memory leaks, descriptor leaks, and even random crashes from time to time do not constitute a serious issue. However system software like Redis is at the other side of the spectrum, a side populated by things that should never crash.
Months ago I received a crash report from my colleague Dvir Volk. He was developing his RediSearch Redis module, so it was not clear if the crash was due to a programming error inside the module, perhaps corrupting the heap, or a bug inside Redis. However it looked a lot like a real problem into the radix tree implementation:
=== REDIS BUG REPORT START: Cut & paste starting from here ===
# Redis 999.999.999 crashed by signal: 11
# Crashed running the instuction at: 0x7fceb6eb5af5
# Accessing address: 0x7fce9c400000
| redis-server *:7016 [cluster](raxRemoveChild+0xd3)[0x49af53]
| redis-server *:7016 [cluster](raxRemove+0x34f)[0x49b34f
| redis-server *:7016 [cluster](slotToKeyUpdateKey+0x1ad)[0x4415dd]
The radix tree is full of
memmove() calls, and Redis crashed exactly trying to access a memory address that was oddly zero padded at the end:
0x7fce9c400000. My first thought was, I’m sure I’m doing some wrong memory movement here, and the address gets overwritten with zeroes, leading to the crash when the program attempts the deference the address.
I’m pretty proud about my radix tree implementation. Not because of the implementation itself, while it’s a complex data structure to implement, it’s not rocket science. But because of the fuzz tester that comes with it and is able to cover the whole source code (trivial) and a lot of non trivial state (that’s definitely more interesting). The fuzz tester does not do fuzzing just to reach a crash, it compares the implementation of the radix tree dictionary and iterator with a reference implementation that uses an hash table and qsort, to have exactly the same semantics, but in a short and easy to audit implementation. After receiving the crash report I improved the fuzz tester, ran it for days, with and without Valgrind, wrote additional data models, created tests using 100 millions of keys, but despite the efforts I could not reproduce the crash. A few days ago I would discover that there was no bug in the implementation I was testing, but at the time I was not aware of that. I could simply never find a bug that was not there. So after failing to reproduce I gave up.
One week ago I received two other additional bug reports that were almost the same. And again, the addresses were zero padded.
Dvir crash: Accessing address: 0x7fce9c400000
Issue 4605: Accessing address: 0x7f2959e00000
Issue 4642: Accessing address: 0x7f0e9b800000
It was time to read every
realloccall inside the source code, trying to figure out if there was something wrong that for some reason the fuzz tester was not able to catch. I found nothing, but then inspecting the Redis crash report I noticed something funny. During crashes Redis reports the memory mapped areas of the process, things like the following:
*** Preparing to test memory region 7f0e8c400000 (255852544 bytes)
Now, if you add
0x7f0e8c400000, the result is
0x7f0e9b800000 , which is exactly the accessed address in the crash reported in issue 4642. So the program was not crashing because the memory address was corrupted, but because it was accessing an address immediately after the end of the heap. I checked the other issues, and the same was true in all the instances. Basically the end of the heap, at the edge of the start of unmapped addresses, was acting as a memory guard in order to detect and crash when an access was performed outside the bounds. This is a common technique that certain C memory sanitation tools used to employ in the past. Such tools would provide a drop in replacement for
malloc() that would return addresses allocated at the edge of a non accessible memory page. Every overflow would be immediately detected in this way.
Because the program would crash only in that case, when deallocating a radix tree node at the end of the heap, it was simple to realize that the problem was a read overflow. You can never detect a read overflow otherwise: it will just access data outside your structure, but inside mapped memory, so the bug would be totally harmless and silent, with the exception of doing the same operations at the end of the mapped region. Finally I had a clear place where to look, this portion of C code:
/* 3. Remove the edge and the pointer by memmoving the remaining children pointer and edge bytes one position before. */
int taillen = parent->size - (e - parent->data) - 1;
debugf("raxRemoveChild tail len: %d\n", taillen);
/* Since we have one data byte less, also child pointers start one byte before now. */
/* Move the remaining "tail" pointer at the right position
as well. */
size_t valuelen = (parent->iskey && !parent->isnull) ? sizeof(void*) : 0;
/* 4. Update size. */
I asked the user to send me the
redis-server binary that produced the crash report, and reading the disassembled code it was clear that many CPU registers, also included in the Redis crash report, would be still populated with the variables above! Note that the CPU registers RDI, RSI, RDX are used in order to pass the first three arguments to memmove. In one of the crashes we had:
parent = RBP = 7f2959dffff
Checking RDI, RSI, RDX we extract the memmove() arguments:
The memmove will go out of bound accessing up to 7f2959e00004.
So I had my proof. But there was more, checking other registers I could also reconstruct the node header, in order to understand how the memmove count argument was obtained. Something was definitely wrong there. Basically I was seeing that the disassembled executable did not match the C function I was reading. How was that possible? The read over the buffer should not happen, because the state was sane. Only the count was incorrectly computed by the crashing instance. It was night, and I had worked at this damn thing for two days no stop, so I decided to create a gist and post it on Twitter, to see if somebody could explain how such C would be turned into such assembler by the compiler.
I was lucky enough that my friend Fedor Indutny of Node.js fame was willing to help. He quickly realized that it was very clear why the C and the assembler could not match: what I was analyzing was not the right C code… but a newer version of the same function. Fedor had a GCC 5.4.0 at hand, the same compiler used by the user reporting the bug, so he tried to compile an older version of the code with it, realizing that now the two versions of the emitted code matched perfectly. He pinged me about it, asking if I was sure that this was a recent Redis version. I was absolutely sure, it was Redis 4.0.6. But then I started to have a few doubts, and I took a diff of
rax.c between the Redis unstable branch and Redis 4.0. What happened was that I fixed this bug about ten months ago, in the course of the Streams implementation. During the cherry picking activity, where bugfixes from unstable are backported to Redis 4.0, the fix was inside a commit about streams, so I was constantly skipping it. Everything was finally clear, I had debugged for days a bug that was not there in the version that I was testing.
If it was just a lame mistake from me, why I did bother to write this blog post? Because, I believe, there are lessons to be learned from all this.
The first lesson is that crash reports like the ones Redis is able to emit are a key asset in system software. They allow to reconstruct the state of bugs that you are not able to replicate, but happen very rarely on the wild. While the bug was already fixed, I was able to exactly understand what was happening just looking at the bug reports, register dumps, offending address, and call stack.
The second lesson is that you should learn AMD64 assembler today, at least to read comfortably code emitted by the compiler and follow what is happening, if you want to get involved in system programming. This is often the only way to understand what is happening during an heisenbug. Debuggers won’t help much. GDB was claiming that the crash was happening in the instruction
parent->size-- which is, of course impossible. But GDB is not to blame, modern compilers, with optimizations turned on, generate code that can hardly be matched back to the source code.
Another lesson is how powerful well done fuzz testing is. The fuzz tester was immediately able to find the bug in the broken version. Similarly the fact that no radix tree crash was never observed other than this bug fixed a long time ago, speaks for itself. The radix tree implementation is very complex, yet thanks to fuzz testing there are apparently no bugs in the implementation which is so new and so complex. I want to stress the importance of doing fuzzing not just to find crashes: that’s good for security people that want to discover zero days. Fuzzing for system software should perform random operations according to a sane operation model, and compare the outcome with a reference implementation.
And finally, there is a clear lesson for me: next time I need to be more carefully when working at a feature branch and making fixes that are not specific of the feature branch. I was working under the feeling that I would merge everything back into 4.0 ASAP. Then it was not the case, and putting the radix tree update into the same commit implementing Stream things was a fatal error.
Well, bonus point, having smart friends help :-) I was a bit lost at that point, and the kind help of Fyodor helped realize the last part of the puzzle and quickly move forward. Don’t be afraid to ask for help. If you get involved in system software, remember that it is very different than other fields in programming. It’s not just that you do some work and you can bring things forward. You must be prepared to spend days trying to understand why a bug happened, a bug that often you are not able to reproduce, because users deserve more than software crashing like that.