What is a good bug?
If a bug wasn’t caused by radiation usually it’s a direct consequence of inadvertence or just not having a full understanding of the problem.
Suddenly we started to get tons of MySQL-related errors like “Error Code: 2013. Lost connection to MySQL server during query” and the slightly more detailed yet still cryptic “Error 2013 / Lost connection to MySQL server at ‘reading initial communication packet’, system error: 4”. It looked like queries were aborted for no particular reason…
Nevertheless, the present version of the mysql back end often works successfully even in a threaded environment, albeit with subtle problems.
After fixing this problem, the “reading initial communication packet” error disappeared and overall MySQL stability improved greatly. Error Code: 2013. Lost connection to MySQL server during query persisted…
I was writing a RabbitMQ client in Common Lisp and some of my tests were failing randomly (again!). CL-bunny maintains one thread per connection where it uses async/non-blocking IO to communicate with a RabbitMQ server. It uses IOLib as a relatively thin wrapper around the native Linux API so programming is pretty straightforward. Suppose you want to receive a message — you subscribe to your queue and ask CL-Bunny for messages in a loop. The API for receiving messages supports
:timeout argument and works with AMQP nicely. However sometimes timeouts had no effect and the calling thread got blocked forever.
Here we spawn a separate thread to simulate IO-loop and simply wait until the mailbox receives something or time out if it doesn’t. What actually happened is that “qwe” is never printed! So essentially
sb-concurrency:receive-message ignored the
Let’s look at the implementation:
NIL if the timeout expires yet somehow it never returned. The problem was that the internal semaphore implementation wasn’t properly handling
condition-wait spurious wake ups. It basically forgot to decrement timeout. I asked guys in the #lisp IRC channel and a fix was released soon. As it turned out it wasn’t the only problem with timeout handling in SBCL. The function which calculates time difference also received some love. One issue remains untouched though — SBCL uses
gettimeofday to calculate timeouts, not a monotonic time source!
Suppose you set timeout to 100 seconds:
100.007 seconds of real time
198.800000 seconds of total run time (190.708000 user, 8.092000 system)
Simple time adjustment via
date -s can mess up your timeout:
206.013 seconds of real time
11.620000 seconds of total run time (11.164000 user, 0.456000 system)
SBCL on linux uses futexes (http://cis-linux1.temple.edu/~giorgio/cis307/readings/futex.pdf) to implement conditional variables:
sys_futex is called with four arguments while this syscall in fact has 6 arguments.
Why is futex constantly being woken up? Let’s run sbcl in special debug mode:
This basically enables very verbose and low-level debugging output.
The code (where I’m using condition-wait directly):
Output will look like this:
[0x7fff75830000/0x7fff7562f700] /setting gc_blocked_deferrables
[0x7fff75830000/0x7fff7562f700] /entering interrupt_handle_pending
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:waiting on lock
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:got lock
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: thread=140735162808064, state=2
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: thread=140737353930496, state=2
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: suspending thread 140737353930496
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:signals sent
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: waiting for thread=140737353930496: state=2
[0x7fff75830000/0x7fff7562f700] /gc_start_the_world: resuming 140737353930496
[0x7fff75830000/0x7fff7562f700] /exiting interrupt_handle_pending
T printed each time condition-wait returns. That is because
condition-wait was able to reacquire the mutex.
Why GC kicks in?
Let’s look closer at this line:
(log:debug (eventfd.read control-fd)).
control-fd is an instance of eventfd,
evenfd.read decrements eventfd counter and
log:debug is a macro exported by log4cl logging library. A macro! It simply won’t evaluate its arguments if logging level doesn’t match (I wonder why I wanted to print eventfd value). So
(eventfd.read control-fd) never executed (as my log level was set to
:error), eventfd counter never decremented and consequently eventfd was always readable.
Of course GC kicks in as this tight loop generates tons of garbage. And SBCL sends a signal to each thread asking to perform GC:
So potentially this eliminates all benefits of the
TASK_INTERRUPTABLE state. One memory-heavy thread can wake up all others without any guarantee they can actually free memory.
As it turned out the
libmysqlclient version we were using at that time didn’t bother with handling EINTR at all (https://bugs.mysql.com/bug.php?id=65956). Apparently the straggle continues: https://bugs.mysql.com/bug.php?id=82019, http://bugs.mysql.com/bug.php?id=81258.
- macros don’t always evaluate their arguments;
- CLSQL integration with MySQL I’m using isn’t thread-safe;
- my libmysqlclient version can’t handle EINTR;
- SBCL’s WAIT-ON-SEMAPHORE could block indefinitely despite :TIMEOUT being supplied;
- SBCL’s RELATIVE-DECODED-TIMES could return non-zero for times in the past;
- SBCL uses
gettimeofdayfor calculating timeouts which is wrong;
- SBCL calls
futexsyscall with wrong arguments count;
- SBCL wakes up all threads for GC, so lots of “spurious” wake ups can happen.