MySQL, Rails and Metadata Locks: a Bug Hunt Bedtime Story

C’mon, c’mon! Brush your teeth, grab your blanky and crawl in. Tonight, we’re going tell a story of hunting giant bugs. In production! With one hand tied behind our backs. And I promise you, it’s all true…..

TL;DR

A long-running report job was taking down our whole site. But not in any of the naïve ways one might expect. We learned:

  • Long-running SELECT statements block ALTER TABLE queries due to MySQL metadata locks. All SELECTS behind the ALTER TABLE also get blocked
  • Rails ActiveRecord doesn’t close connections from the connection pool when you think it does
  • acts-as-taggable-on improperly advises that users force binary collation in tables inside an initializer, which runs for each resque workers spawned
  • MySQL 5.6 doesn’t clearly report metadata locks

The Story

As with all great bug hunting stories, this one is distinctly non-linear. But for fear of you dozing off before we get to the punchline, let me spin the yarn in a more streamlined fashion.

It starts with a pretty typical Ruby on Rails setup:

  • Rails 4.2 on EC2 in an auto-scaling group
  • thin instances handling web traffic
  • MySQL 5.6 database hosted on RDS
  • Resque and cron-like sidekick resque-scheduler handling background jobs

Symptoms

Like many sites, we have an administrative backend, and one of it’s features is generating some reports. Some of these can take as much as 15 minutes to generate depending on the parameters, so of course we run them in a background job resque worker.

def self.get_records report_type: nil, **options
query = build_query(**options)
    ActiveRecord::Base.connection.exec_query(query.to_sql)

{
columns: build_record({}).keys,
rows: results.collect { |row| self.build_record(row).values }
}
end

The query takes about 2 minutes, and iterating through the build_record calls takes about 10 minutes, depending on the size of the report (generally on the order of 100K rows)

We were seeing a problem, however, where these long running report jobs would sometimes have the effect of taking down the whole site. Ruh-roh…

The site wouldn’t just get slow, it would stop responding entirely. At which point, the EC2 ELB and auto-scaling group would turn into puppies in a pillow factory and start wreaking havoc in a futile effort to “help”.

Now we had upset customers who were getting 503s, and upset internal stakeholders who were being told not to push the report button.

Aside: Debugging in Production

As it turns out, our report jobs didn’t take down environments other than production. Of course. Because as long as the bug is confusing, let’s make it appear only in production, #amiright?

Now to be fair, we don’t currently have a production-like environment with the same full set of data and simulated live user traffic. To the two readers who do have that and thus think you are better than the rest of us: yeah, yeah…buzz off.

So unfortunately, this meant that we had to do all our debugging at off hours on the live site. It really upped the ante on designing each test knowing it was likely to take users offline for a few minutes. Such is life in a startup!

Forensics

Given that a background job was taking down the main web tier of our application, and that it seemed related to a long-running report query, it was clear pretty early on that the culprit was MySQL contention.

We first looked at CloudWatch for any clues. CPU, memory, connections, network traffic all looked within reasonable bounds.

We considered whether there was network I/O contention at the database caused by a large amount of data streaming back from our report query, but optimizations there yielded no difference.

We also examined more closely the downtime, which lasted approximately 15 minutes. We observed that the downtime started just as the big query was finishing, and then didn’t clear up until after the whole report job was done, not just the short 2 minute DB query part of it.

So what’s left, you ask?? Locks, child. Locks.

Metadata Locks

As we all know, the InnoDB engine is famous for “row locking” instead of “table locking”. And further, we all know that this locking happens while waiting for writes and during transactions that include writes. Write^H^H^H^H^HRight? Wrong!

Well, sort of wrong. It turns out, there’s also metadata locks, which lock a table when its schema or settings are being modified.

So here’s the summary the gotcha, kids:

SELECT * from widgets;
ALTER TABLE widgets ....; <--- This will wait for the SELECT

This seems obvious, of course. If MySQL were willing to change the table schema during SELECT statements, or during multiple-SELECT transactions, all hell would break loose.

Unfortunately, MySQL 5.6 does a terrible job of revealing that there are metadata locks outstanding, which resulted in our debugging heroes enduring a wild goose chase looking for locks.

Eventually, we noticed a clue in SHOW ENGINE INNODB STATUS output, in the form of a query that was waiting on a metadata lock:

---TRANSACTION 0, not started
mysql tables in use 1, locked 1
MySQL thread id 2067430, OS thread handle 0x2ae7d4aa8700, query id 4708605990 170.31.60.131 newco Waiting for table metadata lock
ALTER TABLE tags MODIFY name varchar(255) CHARACTER SET utf8 COLLATE utf8_bin

Whoa, whoa, whoa, stop the clock. Why would an alter table be happening right now? We are debugging intensive report jobs, not doing a deployment with database migration!

Europeans and Tags

So we’re nearly finished, but before the big finale, our story needs the requisite flashback.

Once upon a time, we had a customer from Europe. Like many Europeans, this customer uses a character set that the infinite wisdom of the ASCII designers could never have predicted. So this customer, unwilling to take Ñø for an answer, inspired us to make sure accented characters worked in our tags.

We use the venerable acts-as-taggable-on gem, which per popular issue 623, indicates that in order to properly handle special characters, one should set the collation on the tags table in MySQL to utf8_bin. Fair enough, we said.

We followed the instructions on the acts-as-taggable-gem README, and modified our initializer as such:

config/initializers/acts_as_taggable_on.rb:
ActsAsTaggableOn.force_binary_collation = true

And once again, Euro-American relations were comfortably back at defcon 4.

Puzzle Picture Revealed

So my young detective, have you pieced the whole puzzle together yet? You now have all the clues….

I see you clutching your blanky with anticipation, so let’s walk the mystery together so you can get some sleep. Here’s what the situation looks like:

  1. User pushes the button to generate an admin report
  2. A resque worker starts up, running a 2-minute long SELECT query
  3. That query creates a metadata lock on the “tags” table
  4. That query completes, but does not get autocommitted until the database connection is returned to the pool, which doesn’t happen when exec_query() finishes, but rather when….well, it’s complicated.
  5. Meanwhile, on the other side of Gotham city, we have a resque-scheduler job which is firing off once per minute on a seemingly unrelated task
  6. Each of those minutely jobs launches a Rails thread, invokes the initializers, thus calling ActsAsTaggableOn.force_binary_collation, which, you guessed it, calls an ALTER TABLE!!
  7. Now MySQL diligently tells that ALTER TABLE statement to wait until the SELECT transaction is complete.
  8. On a site where there is live traffic (of course, not on our test site), users are now clicking around the site, and need to retrieve tags. But MySQL forces all their SELECTs on the tags table to wait for the ALTER TABLE, which is in turn waiting on the original long-running SELECT
  9. Once we get enough of these waiting requests (depends on site load), we have no further thin processes to to handle incoming requests
  10. The site goes down hard. ELB/ASG ➡ puppies and pillows.

Assigning a Root Cause

As to what “caused” this bug, it was close to a perfect storm of things coming together. Most of them individually make sense, but caused emergent behaviors in our system. Despite the deep ingrained desire of engineers to place blame, this one isn’t so cut and dry.

There are two misdemeanor offenses here, both related:

  1. ActsAsTaggableOn should never have suggested that you put a function into an initializer which causes a schema change to the database. This schema change should be done once in a migration, and never needs to be done again.
  2. We shouldn’t have paid attention to that piece of bad advice. And various team members had seen this ALTER TABLE statement in logs, but just let it slide. It’s a good reminder than when something smells, it’s best to clean it up before the smell turns into a stink for your customers!

Solution: We removed the ALTER TABLE call. And for good measure, we also explicitly close the connection in our long-running report jobs after we’ve completed the DB queries, so we clear the metadata locks when we know we’re not going to be issuing any more statements on that connection.

Here’s response time when running reports before and after the deploy. Sleep well, little one. Sleep well.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.