The notorious callback that drove me nuts — starring PostgreSQL, Unicorn, PubNub, EventMachine & ActiveRecord!

I’ve been working on a live streaming platform at Spritle for one of our clients and couple of weeks back they had to run an event on it for a customer. During the course of the event, we faced some really interesting and nerve-racking problems. I’d like to write about my experience during this debacle and how the problems were solved.

There are always threads involved!

It was 10:00 PM India time and I was finishing off my reading list for the day when I got a 911 notification from our Trello board. I knew 10:00 PM (12:30 PM, Toronto) was the start time of the event and I got really nervous. My client had posted the following words in the Trello card —

****911**** — SERVER DOWN!!

As a quick sanity check, I quickly tried to open the app and got 504. We’re running Unicorn, so my first thought was that Unicorn must have crashed. Upon SSH-ing into the server I ran htop and saw that Unicorn was up. Memory & CPU were at pretty decent levels. But in htop though, I noticed something odd: There were a ton of threads running under each Unicorn worker processes. In the heat of the moment I did not ponder much into this abnormality.

It struck me then that our server was setup for development purposes with just 2 unicorn workers running. These two workers were not able to keep up with the sheer number of people trying to get into the event. This was a disaster.

As a knee jerk reaction, I updated the configuration to use 4 workers and restarted unicorn. There was some relief but still there were sporadic 504s happening. Then I notched up the worker count to 8 since we had some memory to spare and killed Sidekiq to save some more.

Things started to stabilize. As I was updating our client on Trello that things were normal, a HipChat notification popped up in my screen and it was from Sentry, an exception tracking tool we use. The exception was —

PG::ConnectionBad - FATAL:  sorry, too many clients already

It looked like some of the workers were not able to get their PostgreSQL connections and thus there were some 500 errors occurring. But by this time most of the viewers had already entered the event and were watching the stream (which was served from Wowza Cloud). The pressure was off a but and I had some breathing space to attack this new problem.

I checked the configuration in the postgresql.conf file and it had the default max_connections setting of 100. This was strange. There was no way that our server could be using 100 PG connections. Since we had 8 unicorn workers and a Sidekiq process with a concurrency of 25 running, the maximum number of connections that would be needed at any point in time would be just 33. The database pool size was set to 25 to accommodate the 25 connections needed for Sidekiq. In any case, the Unicorn workers should not be using more than one connection and even if a connection hangs occasionally, it should not be bloating out all the available connections. Something else was off. We were leaking connections!

During the event, I could not use Postgres’ pg_stats_activity table to see who was holding the connections since even the psql client could not get a connection to the database.

The rest of the event went well as all the viewers were able to get in. Once the event was done, I restarted unicorn and logged into Postgres and queried the pg_stats_activity table and this is what I got —

prod_db=# select datid, datname, pid, usesysid, usename, application_name, state from pg_stat_activity;                                                                             
datid | datname | pid | usesysid | usename | application_name | state
-------+-----------------------+-------+----------+----------+-----------------------------------------------------------------+--------
16384 | prod_db | 30104 | 10 | postgres | unicorn worker[1] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30094 | 10 | postgres | unicorn worker[0] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30110 | 10 | postgres | unicorn worker[2] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30116 | 10 | postgres | unicorn worker[3] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30123 | 10 | postgres | unicorn worker[4] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30129 | 10 | postgres | unicorn worker[5] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30135 | 10 | postgres | unicorn worker[6] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 30157 | 10 | postgres | unicorn worker[7] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 32161 | 10 | postgres | unicorn worker[5] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 32183 | 10 | postgres | unicorn worker[7] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 32273 | 10 | postgres | unicorn worker[5] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 32296 | 10 | postgres | unicorn worker[2] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 374 | 10 | postgres | unicorn worker[1] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 491 | 10 | postgres | sidekiq 3.4.2 app_production [0 of 25 busy] | idle
16384 | prod_db | 498 | 10 | postgres | unicorn worker[7] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 581 | 10 | postgres | unicorn worker[3] -c /u/apps/e...ig/unicorn.rb -E deployment -D | idle
16384 | prod_db | 1337 | 10 | postgres | psql | active
(17 rows)

As you can see some unicorn workers were responsible for more than one PG connections. Clearly there was a leak.

As I was doing my research on how one could debug ActiveRecord connection leaks, I came across blogs that said Threads are a major source of such problems. I learned that if for some reason an ActiveRecord connection was needed inside a thread spawned from a Rails process, a new connection is created which needs to be closed properly at the end of the operations in the thread.

This made sense to me but the only problem was the fact that I had not explicitly used any threading in my application. But still based on what I saw in htop, there was definitely something weird going on with the threads. Here’s a snapshot of htop during this time —

Partial output of htop. White entries are processes and green entries are the threads running under them.

As you can see there were around 24 threads under the unicorn worker[0] alone and some of those threads could very well be holding on to ActiveRecord connections leading to the over-use of available connections.

While I was trying to figure out how I could identify the source of these threads, I came across this brilliant article that described a technique to pin-point the exact line where a thread is executing at any point of time. I recommend you read the full article, but I would like to present here few lines of code from it that helped me do this —

This simply traps the USR2 UNIX signal, and loops through the list of threads of the process using Thread.list and prints out the backtrace of the threads at that point in time. Brilliant right?

I updated unicorn.rb with this and restarted it and sent the USR2 signal using the kill -USR2 <pid> command and as expected the thread stacktraces were spitted out into the logs. Here is the relevant bit —

[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `pop'
[17176] /u/apps/eventstream_production/shared/bundle/ruby/2.2.0/gems/eventmachine-1.0.8/lib/eventmachine.rb:1057:in `block in spawn_threadpool'
[17176] ---
[17176] -------------------

EventMachine! Although we were not using EventMachine directly, we did use the PubNub gem (for real-time data needs) which in-turn was using EventMachine to make async calls to the PubNub data network. EventMachine maintains a threadpool of maximum 20 threads (by default) using which it does blocking operations.

Phew! By now it was clear what was causing the spawning of all those threads. There was still the issue of the overuse of Postgres connections to be solved and I thought if I could somehow prove that these threads were indeed creating all those connections, then the problem would be solved.

When using the PubNub gem in async mode, you’ll need to pass in a callback. In all but one of the places where we had used PubNub, we had used a dummy callback since we didn’t need to do anything.

But in one place we needed to find out how many people were subscribed to a particular PubNub channel and send that data to Keen.io for analytics. For this we used a callback from which some data was sent to Keen.io including subscriber count data from PubNub. Although the code in the callback seemed innocuous, we were actually building a hash from an ActiveRecord model instance and in the process were accessing some data from an associated model which needed the ActiveRecord connection.

Just to confirm this, I tried running this piece of code from the Rails console and after execution I could find two open PG connections. Before execution there was just one which was opened for the Rails console.

Finally, everything fell into place. It all made sense. I quickly patched the code and did a quick load test using Apache Benchmark (ab) and the number of connections never exceeded 8. Problem solved.

Also, I found out that PubNub had released a newer version that dropped EventMachine in favor of Celluloid (which is another concurrency library for Ruby). We upgraded to that version as well.

Its been few weeks now and there have been couple more similar events and everything is running smoothly. I hope you found this story interesting. It was certainly a good learning experience for me. I finally got around to understanding how and when ActiveRecord connections are created, how Postgres client processes are created, about the pg_bouncer tool, how EventMachine works and that fantastic blog which helped me find out where the threads were executing!

Did you have any recent fire fighting experience? Do in the responses if so! :)

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.