Post mortem: When database queries take you down

Dag-Inge Aas
The Whereby Blog
Published in
5 min readSep 26, 2016

I could tell you many stories about late nights spent debugging, but today I want to tell you a special story of mine, about dragons in code, when database queries go wrong.

The appear.in team hopefully going up, not down. Credits: Synshorn Via Ferrata

In late September 2016 we started seeing some unusual issues. Seemingly at random, one of our production servers completely froze up, and refused to talk to the database. Due to fixed load balancing in our set-up for socket servers, this meant that a group of our users couldn’t connect or do anything at all. They were connected to our server just fine, but any call or state change requiring the database would fail. Restarting the service fixed the problem, but only for a couple of hours before the alarm sounded again, this time, a different server.

We use knex.js and kept seeing “Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?” in our logs. Our investigation led us down a path of looking for misbehaving transaction calls, but we couldn’t find any obvious candidates. Besides, most of those calls had been around for a really long time, called countless of times, why would that suddenly become a problem?

Eager to find a solution, we found that a new version of Knex was just rolling out, switching out the current pool implementation. Hopeful that we had just discovered some underlying issue in our database adapter, we tried upgrading to 0.12, and that seemed to work. For a while.

I was on call this week, and the servers chugged along nicely, right until Thursday evening 00:30, when the alarm sounded again. Same symptom, different server. Must restart. Must. Sleep.

I got pretty used to seeing this alarm

Next morning I woke up refreshed, eager to figure out what had happened. I looked at logs, metrics, everything I could but nothing told me exactly what happened. No sudden spikes in throughput, no obvious candidates for runaway queries. Until I found this:

New Relic Insights showing number of available connections in the pool. Guess when I restarted.

In an instant, all available connections were gone. No explanation, no logs, just gone. And it all came back just fine when I restarted the process. Distraught, I created an issue on the Knex Github project: https://github.com/tgriesser/knex/issues/1688. We have two different “types” of servers running against this particular database, API and socket servers, running some shared code. However, only the socket servers experienced issues, why was that? Logically, the issue must be in our code.

The author of Knex responded quickly and suggested some points to help me further debug. Logging debug messages from knex and node-pool, checking for code paths that might lead to destroy() which would tear down all connections. But he couldn’t explain why our database connection suddenly decided to harakiri. And neither could we.

Fast forward some hours, I was still not sure what was going on. I had exhausted my methods. New Relic, Kibana, manually inspecting metrics on each server didn’t tell me anything. But I was wrong, I had been looking at this from only a single viewpoint, the server itself. But there was an another angle, the database. What did it see?

This is not my first rodeo debugging performance issues in postgres, so I had some hunches as to where I should start looking. PSQL provides several statistic tables one can query for different information. Which tables are read more, number of buffer hits, block reads, index usage and more. In addition, it has a very important table: pg_stat_activity. This table shows you the currently running queries and some statistics about them. I ran the following command:

SELECT query, client_addr, query_start FROM pg_stat_activity ORDER BY query_start LIMIT 50;

This would give me the query, ip from which it was called, and when the database started working on the query, sorted by when the query started, beginning with the oldest. We didn’t have any downtime then, but every once in a while I spotted a long and complex query. Its responsibility was to get the number of unread messages for a room, grouped by users. To do this, it joined one of our massive tables containing all our chat messages. Curious to the cost of this query, I ran EXPLAIN:

HashAggregate  (cost=570662.70..570715.78 rows=5308 width=32)
Group Key: h."userId", r."userId"
-> Nested Loop (cost=3509.07..512212.70 rows=7793333 width=32)
Join Filter: (h."timestamp" > r."timestamp")
-> Bitmap Heap Scan on history h (cost=3489.54..160214.65 rows=58450 width=34)
Recheck Cond: ("roomName" = '********'::text)
-> Bitmap Index Scan on history_roomname_timestamp_index (cost=0.00..3474.93 rows=58450 width=0)
Index Cond: ("roomName" = '*******'::text)
-> Materialize (cost=19.52..1299.05 rows=400 width=37)
-> Bitmap Heap Scan on readstate r (cost=19.52..1297.05 rows=400 width=37)
Recheck Cond: ("roomName" = '******'::text)
-> Bitmap Index Scan on readstate_roomname_userid_unique (cost=0.00..19.42 rows=400 width=0)
Index Cond: ("roomName" = '******'::text)

Some database architect is probably pulling their hair out as we speak. That’s one seriously expensive query! Pretty sure I had found the culprit, I made a change to remove the offending query and refactor the code path so it wasn’t needed anymore. We were lucky that it didn’t impact something larger and more complex. However, I wanted to prove that this was indeed the issue, so I waited.

After countless alarms, restarts, grey hairs, some fetal position needed, this next alarm was going to confirm my suspicion. After a few hours more, it happened.

Rushing with anticipation, I fired up the client to connect to the database and ran the query to check pg_stat_activity again. Time was short, for every second I used, people were having a bad experience, not being able to connect, losing trust in us. This had been going on long enough. When the result came, I cheered. Ten queries, all running for >5 minutes (!), all identical, and all from the same server. I had my culprit. I restarted the service, traffic rebalanced, and the queries disappeared. The world was back in order.

I merged my changes shortly thereafter an deployed by churning our entire fleet.

Let’s play a game of: Spot the deploy!

No more alarms came from that issue ever again. (Ok now I’m just jinxing it).

We still don’t know exactly why these queries locked up when they did, and so suddenly. We’re going to investigate more, but luckily, our production servers are now chugging along nicely.

Want to be part of the excellent team that helped create and solve this issue? See open positions over at https://appear.in/information/jobs/

--

--

Dag-Inge Aas
The Whereby Blog

CTO and Co-founder for Confrere. Strangely fascinated by numbers and graphs.