The Bug


What is a good bug?

If a bug wasn’t caused by radiation usually it’s a direct consequence of inadvertence or just not having a full understanding of the problem.

MySQL

Suddenly we started to get tons of MySQL-related errors like “Error Code: 2013. Lost connection to MySQL server during query” and the slightly more detailed yet still cryptic “Error 2013 / Lost connection to MySQL server at ‘reading initial communication packet’, system error: 4”. It looked like queries were aborted for no particular reason…

Thread-(un)safety was the first suspect. As it turned out CLSQL, the database library we were using didn’t have thread-safe MySQL integration (documented!, not a bug?):

Nevertheless, the present version of the mysql back end often works successfully even in a threaded environment, albeit with subtle problems.

After fixing this problem, the “reading initial communication packet” error disappeared and overall MySQL stability improved greatly. Error Code: 2013. Lost connection to MySQL server during query persisted…

More on thread-safe MySQL.

SBCL

I was writing a RabbitMQ client in Common Lisp and some of my tests were failing randomly (again!). CL-bunny maintains one thread per connection where it uses async/non-blocking IO to communicate with a RabbitMQ server. It uses IOLib as a relatively thin wrapper around the native Linux API so programming is pretty straightforward. Suppose you want to receive a message — you subscribe to your queue and ask CL-Bunny for messages in a loop. The API for receiving messages supports :timeout argument and works with AMQP nicely. However sometimes timeouts had no effect and the calling thread got blocked forever.

Simplified code:

Here we spawn a separate thread to simulate IO-loop and simply wait until the mailbox receives something or time out if it doesn’t. What actually happened is that “qwe” is never printed! So essentially sb-concurrency:receive-message ignored the :timeout argument!

Let’s look at the implementation:

wait-on-semaphore returns NIL if the timeout expires yet somehow it never returned. The problem was that the internal semaphore implementation wasn’t properly handling condition-wait spurious wake ups. It basically forgot to decrement timeout. I asked guys in the #lisp IRC channel and a fix was released soon. As it turned out it wasn’t the only problem with timeout handling in SBCL. The function which calculates time difference also received some love. One issue remains untouched though — SBCL uses gettimeofday to calculate timeouts, not a monotonic time source!

Suppose you set timeout to 100 seconds:

Evaluation took:
100.007 seconds of real time
198.800000 seconds of total run time (190.708000 user, 8.092000 system)

Simple time adjustment via date -s can mess up your timeout:

 Evaluation took:
206.013 seconds of real time
11.620000 seconds of total run time (11.164000 user, 0.456000 system)

Spurious Wakeup?

SBCL on linux uses futexes (http://cis-linux1.temple.edu/~giorgio/cis307/readings/futex.pdf) to implement conditional variables:

Strangely enough sys_futex is called with four arguments while this syscall in fact has 6 arguments.

Why is futex constantly being woken up? Let’s run sbcl in special debug mode:

SBCL_DYNDEBUG="fshow_signal" sbcl

This basically enables very verbose and low-level debugging output.

The code (where I’m using condition-wait directly):

Output will look like this:

[0x7fff75830000/0x7fff7562f700] /setting gc_blocked_deferrables
[0x7fff75830000/0x7fff7562f700] /entering interrupt_handle_pending
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:waiting on lock
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:got lock
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: thread=140735162808064, state=2
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: thread=140737353930496, state=2
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: suspending thread 140737353930496
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:signals sent
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world: waiting for thread=140737353930496: state=2
[0x7ffff5e90000/0x7ffff7fcd700] /sig_stop_for_gc_handler
[0x7ffff5e90000/0x7ffff7fcd700] suspended
[0x7fff75830000/0x7fff7562f700] /gc_stop_the_world:end
[0x7fff75830000/0x7fff7562f700] /gc_start_the_world:begin
[0x7fff75830000/0x7fff7562f700] /gc_start_the_world: resuming 140737353930496
[0x7fff75830000/0x7fff7562f700] /gc_start_the_world:end
[0x7ffff5e90000/0x7ffff7fcd700] resumed
[0x7fff75830000/0x7fff7562f700] /exiting interrupt_handle_pending
T

Note T printed each time condition-wait returns. That is because condition-wait was able to reacquire the mutex.

Why GC kicks in?

Let’s look closer at this line: (log:debug (eventfd.read control-fd)). control-fd is an instance of eventfd, evenfd.read decrements eventfd counter and log:debug is a macro exported by log4cl logging library. A macro! It simply won’t evaluate its arguments if logging level doesn’t match (I wonder why I wanted to print eventfd value). So (eventfd.read control-fd) never executed (as my log level was set to :error), eventfd counter never decremented and consequently eventfd was always readable.

Of course GC kicks in as this tight loop generates tons of garbage. And SBCL sends a signal to each thread asking to perform GC:

So potentially this eliminates all benefits of theTASK_INTERRUPTABLE state. One memory-heavy thread can wake up all others without any guarantee they can actually free memory.

MySQL again

As it turned out thelibmysqlclient version we were using at that time didn’t bother with handling EINTR at all (https://bugs.mysql.com/bug.php?id=65956). Apparently the straggle continues: https://bugs.mysql.com/bug.php?id=82019, http://bugs.mysql.com/bug.php?id=81258.

Lessons:

  • macros don’t always evaluate their arguments;
  • CLSQL integration with MySQL I’m using isn’t thread-safe;
  • my libmysqlclient version can’t handle EINTR;
  • SBCL’s WAIT-ON-SEMAPHORE could block indefinitely despite :TIMEOUT being supplied;
  • SBCL’s RELATIVE-DECODED-TIMES could return non-zero for times in the past;
  • SBCL uses gettimeofday for calculating timeouts which is wrong;
  • SBCL calls futex syscall with wrong arguments count;
  • SBCL wakes up all threads for GC, so lots of “spurious” wake ups can happen.

Acknowledgments

I’d like to thank Jan Moringen for fast response and the fixes and Eric Bailey for feedback and proof-reading (all gramma errors are his own :-).