Postmortems at Airbnb

By Ben Hughes

At Airbnb, postmortems are an important part of our culture. We have a custom tool for tracking incidents along with what went wrong and what went right. This gives us a huge base of knowledge to draw upon whenever we observe issues. It’s also a great way to keep the team on the same page about the sort of problems that we are currently facing.

Writing tons of postmortems could quickly become a dismal affair, especially for those of us on the reliability teams. We practice blameless postmortems where the primary goal is learning. We also have a lot of fun with our postmortems. This makes them more enjoyable to read and less onerous to write. We’ve had everything from Sam Spade detective fiction to Karl Marx as themes for write ups. It’s not uncommon to hear people unrelated to an incident asking each other whether they’ve read the latest postmortem.

Recently, we had a mysterious bug in our integration tests that prevented us from getting more than a few green builds per day. I spent about three days trying to narrow down the bug and figure out the fix. It was a pretty intense debugging session, and I ended up writing a postmortem themed after one of my favorite movies. We’d like to share that report with you today.

The Threads! The Threads!

This was our CI situation:

Failed Builds

We weren’t able to get tests to pass, which was annoying and slowed down our deploys.

This postmortem is a long one, but the short of it is that our version of db-charmer does not get along with threads.

There is no way to tell his story without telling my own. And if his story really is a confession, then so is mine.

This week our integration tests were frequently red, causing extreme delays in our deploys. Morale around here was getting low, and all of us were becoming increasingly frantic.

I was on R&R off on some humid beach miles from civilization when I started to hear murmurings about the issue. The reports from the front were frenzied, inconsistent, and most of all unexplained. For reasons unknown Col. Jenkins was sending back ramblings of the forms:

undefined method `db_charmer_connection_name' for #<ActiveRecord::ConnectionAdapters::Mysql2Adapter:0x000000011c21f8>

Mysql2::Error: This connection is in use by: #<Thread:0x000000179e7d18 sleep>: SELECT `users`.* FROM `users` WHERE `users`.`id` = 143874 LIMIT 1 /*application:monorail,controller:route,action:glob*/

You have a nil object when you didn't expect it!
You might have expected an instance of Array.
The error occurred while evaluating nil.last (NoMethodError)
/mnt/jenkins/workspace/monorail_integration_tests/bundle/gems/activerecord-3.0.20.7/lib/active_record/connection_adapters/abstract/database_statements.rb:190:in `add_transaction_record'

Everyone has a breaking point. You and I have one. Col. Jenkins has reached his. And very obviously, he has gone insane.

The second I reported in for duty at HQ, General Davide pulled me aside and gave me my mission — to terminate these errors with extreme prejudice.

Your mission is to proceed up the Nung River in a Navy patrol boat. Pick up Colonel Jenkins’s path at Nu Mung Ba, follow it, and learn what you can along the way.

My descent into madness can be retraced via the branch: break-jenkins-to-remake-jenkins [Ed. Note: This branch contains a set of commits with increasingly nonsensical commit messages that were preserved to show my debugging technique and level of confusion].

Starting out, there was little to go on other than that it was thread related and only appeared in integration tests. It was assumed that all three errors were somehow related. Integration tests run threads for a number of purposes, the two main ones are an RSpec thread that executes the specs and a Webrick thread that exposes a webkit server. The Webrick thread also spins up a new thread for each request. Other possible threads would be the work queues for Airbrake/Squash (which may or may not exist in integration tests), sharded connection threads, or various utility threads for communication with WebKit. We were fighting a war and we didn’t even know who the beligerents were.

Progress up-river was further complicated by the inconsistency of reproduction of the behavior. About 2/3rds of the tests I queued up would fail, but only about 1/3rd of the failures were the Thread error that I figured would lead to the culprits.

Logging identified that the two competing threads were a Webrick request thread, and the RSpec thread. Interestingly, the specific clash usually happened during the setup phase of a test run, indicating that requests triggered from the previous test were still in flight while the test suite had moved on. From here, additional tests confirmed that the violation was related to a thread querying on a connection was not registered to it in the connection pool. This was complicated during observation because it was possible for the connection pool state to have changed while we were trying to read out the values, though in the end this lead proved to be correct.

One tricky aspect to this was that the site of the error was far removed from the place where the conditions could be set up for the error to occur.

In this war, things get confused out there, power, ideals, the old morality, practical military necessity, the owner threads of connections.

Investigation proceeded into whether there was some unthreadsafe corner case in ActiveRecord::ConnectionPool. This seemed initially likely as several more methods were synchronized in rails 3.2 than were synchronized in rails 3.0 and 3.1. Monkey patches were deployed and then the tests continued to fail.

DbCharmer was an early contender, as it does mess with connections, but the connection messing happens like so via alias_method_chain:

def connection_with_magic
db_charmer_remapped_connection || db_charmer_connection_proxy || connection_without_magic
end

db_charmer_remapped_connection is uninteresting (though it does play into one other issue)

db_charmer_connection_proxy is a wrapper around a map from class_name to a DbCharmer::ConnectionProxy (or nil if it is just the default ActiveRecord::Base connection, in our case airmaster)

connection_without_magic is just the regular (thread safe) ActiveRecord::Base#connection, this is called if db_charmer is nil

The ConnectionProxy should be thread safe in the specific case of two threads sharing the same connection as the implementation proxies to the underlying AR::Base#retrieve_connection

class ConnectionProxy < ActiveSupport::BasicObject
def method_missing(meth, *args, &block)
@abstract_connection_class.retrieve_connection.send(meth, *args, &block)
end
end

One interesting thing is the use of BasicObject, if one were to call .class on the ConnectionProxy, the return would be the class of the object being proxied to, in this case Mysql2Adapter. This shouldn’t be an issue as the only thing identifying as a Mysql2Adapter stored in db_charmer_connection_proxy should actually be a ConnectionProxy.

Running out of time and succumbing to the fever of the jungle of the code, I started poking at this assumption. The first step was monkeypatching in a method called holy_actual_class on both Object and DbCharmer::ConnectionProxy that would tell the actual class. I then hacked in logging to determine whether there was any place that was setting the db_charmer_connection_proxy to a fully realized connection object (which would be owned by a thread).

It was a lie. And the more I saw them, the more I hated lies.

This hack showed me that there was a place in DbCharmer that would set db_charmer_connection_proxy to a connection instead of a connection proxy. The circumstances for this occur through this call stack.

The first step is an alias method chain of the method in ActiveRecord::Base that constructs an ActiveRecord::Relation. This is the method used to create the relations with syntax like User.where(:foo => "bar"). Here db-charmer stores either a realized connection (not a connection proxy) or a connection proxy on the relation. A connection proxy is stored if the connection has already been switched out (for instance inside of a model’s on_slave block or if the model is not on the default database). This happens because self.connection would return db_charmer_connection_proxy. A regular connection is stored if the model is on the default master database, because self.connectionwould return connection_without_magic

def relation_with_db_charmer(*args, &block)
relation_without_db_charmer(*args, &block).tap do |rel|
rel.db_charmer_connection = self.connection
...
end
end

This should be fine, as long as relations are not shared between threads.

This method (and a couple others) hint at the problem. This method is on the ActiveRecord::Relation.

# Make sure we get the right connection here [ed. note: lol]
def connection
@klass.on_db(db_charmer_connection).connection
end

Here we call ActiveRecord::Base’s on_db method with db_charmer_connection, which is an actual thread specific connection. The on_db method does some book keeping (which causes one of the other issues) and then calls into switch_connection_to where the thread violation occurs.

def switch_connection_to(conn, should_exist = true)
new_conn = coerce_to_connection_proxy(conn, should_exist)
...
self.db_charmer_connection_proxy = new_conn self.hijack_connection!
end

switch_connection_to is called with the relation’s db_charmer_connection, which sets the (thread global) db_charmer_connection_proxy to the thread specific connection.

db_charmer_connection_proxy is specific to the ActiveRecord model class (to allow UserFraudInfo to be mapped to different dbs than User, for instance) but it is thread global. During the time that another thread is using a relation of any model to query, any queries on that model will get the other thread’s connection instead of their own, causing the clash.

This explains the “this connection is in use by: Thread” error. The other two have similar underlying causes:

The undefined_method db_charmer_connection_name comes because db_charmer_remapped_connection (the first option inside connection_with_magic) tries to call that method on a connection (stored in db_charmer_connection_proxy) instead of a connection proxy. The specific reason that this is triggered comes from the global nature of the db_charmer_connection_levelcounter, which allows threads to interfere with what constitutes a top level connection. This combined with the setting of db_charmer_connection_proxy to a connection instead of a proxy triggers this bug.

The third error occurs when you start a transaction on one connection and then that connection is switched out for another one (that knows nothing of the transaction) by another thread during the course of the transaction. This could happen because of the common cause, but could also occur if a different thread switched to make a connection on slave.

I watched a snail crawl along the edge of a straight razor. That’s my dream. That’s my nightmare. Crawling, slithering, along the edge of a straight razor . . . and surviving.

Armed with this knowledge, I was able to quickly patch DbCharmer in test mode. The solution was to wrap the execution of on_db and transaction in a Monitor. This guarantees mutual exclusion and also allows these methods to be called recursively without deadlock. The patch is only applied in test mode, as it’s kind of dirty and shouldn’t be a problem in production.

The longer term fix is to make our version of DbCharmer threadsafe as well as further audit/test its behavior.

While debugging this, I also discovered that the thread error occurs excruciatingly rarely in production. This is interesting in its own right as we should not be making db calls from multiple threads. This suggests that a relation is being realized from the squash or airbrake threads, which could be a clue for something that I have suspected for a long time, but have been unable to track down.


Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData


Originally published at nerds.airbnb.com on October 9, 2013.