PostgreSQL lessons we learned the hard way

Investigating, understanding, fixing, and preventing troublesome database locks

Paul Brown
Compass True North
20 min readJun 3, 2022

--

Databases are often the weakest part of an organization’s tech infrastructure. While traffic to API servers can easily be load balanced between several redundant instances, databases, on the other hand, often end up as a single point of failure, due to the difficulty and limitations that would be associated with managing data integrity/transactions/etc across multiple db instances.

At Compass, we experienced a problem where one of our databases would periodically lock up during migrations, during which even basic read requests would be blocked, leading to big problems for our organization and users. Though this article will discuss that specific issue in depth, it also has a larger goal — to provide an introduction to how database locks work, and to tools anyone can use to investigate, solve, fix, and prevent issues related to database locks.

To this end, this article is split into three parts:

  1. Overview — An introduction to database locks, and specifically how they work in Postgres
  2. Investigation and Resolution —An overview of the specific problem we faced, and an explanation of the tools and strategies we used to investigate and fix it.
  3. Preparation and Prevention — Things we did to prevent and prepare for problems like this in the future.

Section I — Overview

Locks in Postgres

Before we dig into our specific issue, let’s back up and talk about locks in PostgreSQL in general.

photo of smoked salmon on toast
not that kind of lox

To conform with the SQL standard, Postgres supports wrapping operations in transactions of various isolation levels [1]. In order to guarantee isolation between transactions, Postgres uses a system of tracking concurrent changes to a row as different versions of that row, known as Multiversion Concurrency Control (MVCC) [1]. But MVCC alone is insufficient to guarantee ACID-compliance — so Postgres uses table-level locks as well [1, slide 17]. Row-level locks, while also used, are harder for the user to monitor, and are out-of-scope for this article (see https://postgrespro.com/blog/pgsql/5968005 for a good discussion of how row-level locks work).

Most commands you give to Postgres (for example, a SQL query), will require Postgres to take out one or more locks [1]. Postgres will first attempt to acquire the relevant locks (and wait until it can acquire them if it initially cannot). Then, once the locks have been acquired, it will run the requested command. Locks can also be acquired explicitly by the user (e.g. with an SQL statement like “LOCK TABLE people IN SHARE ROW EXCLUSIVE MODE;”).

It is important to understand, however, that there are different kinds of locks, and that some locks can coexist on the same table at the same time, while others are incompatible. In general, reads do not block writes, and writes do not block reads. Reads do not block other reads. Allowing concurrent non-conflicting locks on the same resource, in this manner, is one of the advantages gained by Postgres using MVCC to guarantee transaction isolation instead of using locks alone [1]. Meanwhile, writes might block other writes — if operation A has already acquired a write lock on a table, and operation B also wants to write to that same table, operation B may have to wait until operation A has completed and released its lock until it can acquire its lock and run.

For a more detailed look at types of table-level locks and which block which, see the Postgres docs regarding locks, especially this table:

a table showing which locks block which other locks
table of which table-level locks block which other locks - screenshot from the Postgres docs

Let’s highlight a few lock modes specifically from the table above:

Access share lock: This lock is taken out by most SELECT statements. It can only block (and can only be blocked by) “ACCESS EXCLUSIVE” locks

Row exclusive lock: This is the lock taken out by most UPDATE statements. Despite the name, it is (like all the other locks we’re discussing today) a table-level lock. Though this lock allows other row exclusive locks on the same table, note that most writes which require row exclusive locks will also take out row-level-locks on the rows affected — and these row-level locks would block any updates that impact the same rows (again, row-level locks are out-of-scope here! Please read https://postgrespro.com/blog/pgsql/5968005 for more info).

Access exclusive lock: This lock is required by most ALTER TABLE statements (i.e. schema changes/db migrations). Note that this is the only lock that can block (and be blocked by) our all-important read queries (ie SELECT/“access share lock” commands). This will be an important plot point later in the article.

Section II — Investigation and Resolution

The Problem

So the problem we were facing was that we’d be running a database migration, and suddenly our database would lock up and be unable to respond to any requests at all — even read queries.

Off the bat, given what we just learned in the section above, this makes some sense. Our migrations would often attempt to change the schema of one of our tables, therefore would often contain an “ALTER TABLE” statement, and therefore would often need to obtain an “ACCESS EXCLUSIVE” lock on the relevant table. So as long as the migration was running, the lock would be held on the table, and reads would be blocked:

a diagram representing reads failing to run on a db table due to a lock on the table

Now, so far, this is the expected behavior for a migration. It waits until it can get the lock it needs, it obtains the lock, it runs the schema change, then it commits and releases its lock. And yes, while it’s running it does block any other queries to that table, but if it finishes promptly it won’t cause that much latency, and everyone can get on with their lives.

But our problem was that the migration would not complete. It would just hang forever, blocking all queries to that table indefinitely. So we had to investigate to find out . . . why was our migration not finishing?

photograph of a man sitting at a laptop looking thoughtfully into the distance
this person is wondering why our migration won’t finish

The Investigation

To summarize the situation - we know:

  • We tried to run a migration
  • The migration won’t finish
  • The migration is blocking every other read from completing

We’re going to take a look at how one would investigate and resolve such a situation, using screenshots I took from my machine after recreating the issue locally.

The first place to look in a situation like this is the pg_stat_activity table. This table contains information about each process currently running on the db — each row represents one process.

A query of “select * from pg_catalog.pg_stat_activity” yields:

an screenshot of a sql result set for “select * from pg_catalog.pg_stat_activity”. The relevant data from the results is described in the paragraph below.

or, if we isolate only the relevant info:

We can identify which process is our migration by looking in the “query” column: we see the “alter table” statement attempting to drop a constraint, and the comment we wrote to identify it as part of the migration. Meanwhile, we see in the “wait_event_type” column that our migration is waiting on a “lock”. So - our migration appears to be waiting for a “lock”.

Hypothesis - it seems likely that the reason our migration can’t complete is because it can’t obtain the “ACCESS EXCLUSIVE” lock it would need to run its “ALTER TABLE” statement. Probably there is some other query with a conflicting lock on the table that we’re trying to alter. Let’s find out!

Which lock is it waiting on?

Now let’s join the pg_stat_activity table to the pg_locks table in order to see what locks our migration has attempted to acquire in order to run. We’ll grab the pid of our migration from the previous query on the pg_stat_activity table, and use it to filter:

(we also join to the pg_stat_all_tables table for good measure, just to get the name of any table we’re waiting for a lock on)

This yields a table for which each row is a lock our migration needs to complete:

an screenshot of a sql result set for the query listed above. The relevant data from the results is described in the paragraph below.

We can see, looking at the “granted” column, that our migration needed 6 locks in order to run, of which 5 have already been granted. The 6th, the one we’re waiting on, is an “Access Exclusive” lock on the “people” table (based on looking at the “mode” and “relname” columns, respectively). So - our migration is waiting to obtain a lock on the “people” table.

Hypothesis — there must be some conflicting lock on the “people” table that is blocking our migration from obtaining the “Access Exclusive” lock it needs to complete.

Why can’t we acquire the lock?

In postgres 9.6, a method was introduced called pg_blocking_pids, which can show you the process ids of exactly which other processes are blocking the completion of your process. Let’s query it, using the pid of our migration we got from the pg_stat_activity table:

which gives us the result:

So — the process 31825 is blocking our migration from finishing.

Hypothesis - It looks like process 31825 is the process that has the lock on the people table that is preventing our migration from obtaining the lock it needs on that table.

What locks does *that* query have?

Let’s query the pg_locks table filtering by this new pid:

Which yields:

So — the blocking query has an AccessShareLock on the people table. Presumably this AccessShare lock on the people table is what is blocking our migration from obtaining the AccessExclusiveLock it needs.

And, under normal circumstances, that would be fine! We should expect that when we run a migration, it might need to queue up behind some currently-running processes on the db before it can run itself. But the issue we’re facing here is that this blocking query won’t terminate.

I’ll bite - why can’t that query finish?

Let’s look back at it’s entry on the pg_stat_activity table:

The query blocking us is “idle in transaction” waiting on a “Client” — we’ll explain what that means in the section below.

But to summarize what we know so far: our migration couldn’t finish, because it was blocked by a query that had a conflicting lock, and that query was “idle in transaction”. That migration, in turn, was blocking all other queries to our db.

image representing “idle read query blocks our migration, which in turn blocks user reads”
user reads ->x our migration ->x idle read query ->x db

So what is an idle-in-transaction query?

Again — our read/select query won’t finish apparently because it is “idle in transaction”. So to understand why that query won’t finish, we need to understand what “idle in transaction” means.

A query is “idle in transaction” if a client began a transaction, but never committed or rolled it back [1]. A query is NOT “idle in transaction” if, for example, the query is computationally-intensive and is taking the db a long time to complete — it is only “idle in transaction” if the client opened a transaction, and is letting the connection sit idle without committing that transaction.

This can commonly be caused by a human user connecting to the db from their local machine, opening a transaction, but never finishing it — indeed, that was how I recreated the above incident on my local machine:

The thing with idle transaction is that, by default, Postgres will let you leave the idle transaction open FOREVER. This sort of makes sense: Postgres doesn’t know your business logic — maybe you, or the backend client, has some sort of valid reason to need to start a transaction and then let it sit for a long time. Maybe you need to open a transaction, read some data, email it to someone, wait until you get a response within 3 business days, and then do a write based on their response, and only then close the transaction. Or maybe you need to do open a transaction, do a few reads, do some complex resource-intensive calculations on the client that take days to complete, before doing a write and closing the transaction. Postgres doesn’t mind — it’ll wait.

Which, depending on the nature of your idle transaction, usually won’t cause a problem … at least initially. In the example above, the idle transaction was a simple read query, that only needed an “Access Share” lock. Even if that transaction never finishes and releases its lock, that sort of lock doesn’t block other reads, and doesn’t really block writes. You might never notice that it never got closed. But remember our lesson from earlier — the one type of lock that “Access Share” locks can block is “Access Exclusive” locks — which are the lock required by “alter table” statements, and therefore required by most migrations.

So a simple idle transaction read can sit innocently on you db for days, like a landmine, not causing any issues … until you run a migration. Then that idle read can block your migration, which can block all other reads to that table.

tldr: our database was locked because “idle in transaction” query + migration = locked db

The Resolution

Once you’ve used the tools above (ps_stat_activity table, pg_locks_table, pg_blocking_pids() method) to identify the pid of the query blocking your migration, simply terminate that query (note: if you still don’t know the pid of your blocking query, you could try the query in the appendix).

Once you know the offending pid, you can try this simple sql command pg_cancel_backend:

(substituting 31825 for the pid blocking your migration). This will try to end the query gracefully.

If that doesn’t work, or isn’t fast enough, then try pg_terminate_backend:

(substituting 31825 for the pid blocking your migration).

Note of course this won’t work for just any sql database. The “pg” here stands for “Postgres”, and prefixes tables and methods specific to Postgres.

You can also kill the migration of course, which is possibly a good idea. But don’t just kill the migration without also killing the idle transaction — this is a recipe for having your database fail again next time you try to run the migration (which might also be quite soon if your migrator client is set to auto-retry upon failure). That’s a lesson we learned the hard way.

Why is this even interesting or cool?

What I find interesting, or at least tricky, about this situation is that neither the migration, nor the idle query, alone was particularly problematic. The idle query was an innocent read operation that didn’t block any reads or any writes until the migration came along. And the migration itself wasn’t poorly written or a bad idea — under normal circumstances the migration would queue up behind any reads/writes currently going on in the relevant table, wait for them to finish, run swiftly itself, then commit and release its locks. If the idle transaction hadn’t been there, our migration would have been perfectly successful. It was only the two together that caused an issue.

The other thing kind of weird here is that our migration’s “AccessExclusiveLock” blocked all reads to the table even though our migration hadn’t yet obtained that lock. That’s because not only active locks, but also locks of queued processes, can block the locks needed by processes queued behind them. As the postgres docs explain:

One server process blocks another if it either holds a lock that conflicts with the blocked process’s lock request (hard block), or is waiting for a lock that would conflict with the blocked process’s lock request and is ahead of it in the wait queue (soft block).

So if we revisit our diagram, we see that even though our migration never got its lock, that lock can still “soft block” other incoming reads/writes:

locks that are not granted can nonetheless still block other locks

Section III — Prevention/Preparation

There are a number of things we did to prevent this issue from recurring in the future, and to prepare for similar future occurrences. You can do these things too!

Prevention

Set idle-in-transaction timeout

As we discussed, postgres won’t, by default, terminate long-running idle transactions. But you can configure it specifically to do so, using the idle_in_transaction_session_timeout configuration. This is the number 1 thing you can do to prevent this issue.

Even if you don’t want to set if for every minute, even setting it to timeout after 10 minutes, or an hour, or even 24 hours, helps clean up troublesome idle transactions that could sabotage you the next time you want to run a migration.

You might initially think this is somewhat risky — “what if I have a query that scans a really large table and ends up taking longer than my timeout?”. But note that this is not a general statement timeout (that exists too) — this is only a timeout for transactions that are idle because they are waiting for a client to send a new command or commit the transaction. Transactions that are currently doing calculations, or scanning a table, are not marked as “idle in transaction”.

Set migration timeout

Remember how our formula for trouble was:

idle in transaction query + migration = locked db

Well, just as it made sense to timeout a transaction if it takes too long, you can also time out migrations if they take too long. You can do this in your migrator client, or you can do it in the SQL code of the migration itself with a local statement timeout. This latter approach has the advantage that you can have different timeouts for different migrations, depending on how long you expect them to take. Just do something like:

Note, of course, that for statement timeouts the timeout will be per statement, not per transaction. An extra caveat on top of our caveat here is that postgres versions before version 13 “usually treated the timeout as applying to the whole query string”, even if that string is technically split into multiple separate statements by semicolons [1].

Acquire explicit lock with NOWAIT

This is likely not a good solution for high-traffic tables, but you can also explicitly have your migration immediately fail if it is blocked by ANY other queries. This has the advantage that the migration will fail fast, instead of queuing up behind any other queries and blocking any incoming queries while it waits in the queue. You can accomplish this with some code like:

If there are any other queries live on the people table when this runs, it will immediately fail. Again, not a good idea for high traffic tables, since it means your migration will almost always fail due to existing traffic on the table.

Read more here.

Use read replicas

In the intro to this piece, we discussed that databases are often single points of failure. One solution to this, of course, is to use a database cluster. The simplest configuration is to use a primary writer database with several read-only replicas. One advantage of this is if the writer locks up, the read replicas may still be available. This is particularly convenient if you are using your writer for writes and low-priority reads, and using your read replicas as a VIP lane for especially high-priority reads.

Furthermore, your replicas may be less likely to lock up in the first place, compared to the primary. You can configure your read replicas to terminate any locks that, for too long, prevent replication of writes from the primary. This is the default configuration for AWS Aurora clusters, and explains why our replicas never experienced this type of locking behavior — if they ever had idle transactions blocking a write for too long, the idle transaction was terminated. See the hot standby docs for more details.

Limit who can access prod/staging database

As we alluded to earlier, these types of idle queries are often caused by engineers connecting to a database from their local machine. As a result (and for a plethora of other reasons), we want to severely limit who can access the database. And even for those who can access the database directly, such access should happen rarely, if at all, outside of emergency scenarios.

Beware DB UI client software

If you are using db client software that provides a friendly UI to display tables, etc, it is using queries underneath the hood to generate the data it is displaying. At Compass, we have documented examples of idle transactions created via such software, and we have good reason to suspect it was created underneath the hood rather than by the user explicitly. So be aware.

Preparation

Increase the number of superuser_reserved_connections

So under normal circumstances, accessing the db from your local machine is a no-no. But in an emergency, it is absolutely vital that you are able to do so. Unfortunately, an emergency is likely to make it harder to access the database, because if the database cannot swiftly fulfill requests, other clients (your API servers, etc), are likely to eat up all the available connections. Once connection starvation has occurred, it will be difficult for you to obtain a connection from your local machine to the database to debug the cause of the connection starvation.

Luckily, postgres has your back here. There is a parameter called “superuser_reserved_connections”, which is how many connections are explicit reserved for users with the superuser status. The problem is that the default is only 3 connections! This is fine if no one ever connects to your database from their local machine, but if this ever happens, the last thing you want is for some random people to be connected to your database while you are trying to debug an outage in production (again — a lesson we learned the hard way).

So just increase the value! This does mean there will be fewer connections available to your server to connect to the database, however: 1) there is a good chance that under non-emergency circumstances your database is not using 100% of its max_connections capacity anyway, 2) in the event of an emergency, it will be much more important that you have a VIP lane you can use to connect to the db than it is for the server to have a couple extra connections to use up, and 3) you can always increase max_connections to compensate if you want.

If you’re using an AWS database, on the other hand, they won’t give you access to the superuser role, so this option won’t be very useful. Luckily Amazon RDS postgres and Aurora postgres offer you a “rds superuser” role as well as a rds_superuser_reserved_connections parameter— which reserves connections for users with the the rds_superuser role. Though the “RDS superuser” doesn’t have full superuser access, it is powerful enough to perform all of the debugging steps discussed in this article.

Enable log_locks_wait

It can be hard to investigate locking issues after the issue has passed. Luckily, if you turn on log_locks_wait, Postgres will create a log anytime a query is delayed by a long wait for a lock it needs. This makes it quite easy to investigate the cause of an outage or lock-related db issue even after the issue itself has passed.

No downside, just turn it on!

Section IV— Appendix/Misc

How can I recreate the described issue locally?

Easy! First create two tables (you can run queries on the db using, say, psql). We’ll call one “people” and the other “accounts”. Add a foreign key from “people” to “accounts” called “fk_people_accounts_1”. Then in one session:

That’ll be our idle transaction - notice we started a transaction but never committed or rolled it back. No issues so far. Next let’s try an idempotent migration in a different session:

Notice that the migration, though relatively simple, can’t finish because it’s blocked by the idle transaction?

Now let’s try a simple read query:

Notice that this read also won’t finish, because it’s blocked by the migration.

Now let’s cancel the idle transaction in session 1 by adding a “rollback”:

Notice that now the migration and the read query can finish? Run them as many time as they like — they work fine!

Is my database locking issue caused by deadlocks?

A deadlock is where two different transactions each have greedily acquired a lock the other needs in order to run. This leads to a situation where neither process can complete because each is waiting for the other to complete first.

If your issue is your database locking up indefinitely, it is unlikely that it is caused by a deadlock. Postgres is able to automatically/programmatically detects deadlocks and terminates one so the other can finish. As a side note, the client may want to be ready to retry a transaction cancelled due to a deadlock.

So unlike our idle transaction issue, in which postgres will wait forever for the client to do something, postgres won’t tolerate deadlocks because it knows they will literally never resolve themselves, and there is no valid use case for the user/client/developer to want a deadlock.

Postgres checks for a deadlocks if a lock is taking too long. Detecting deadlocks is a tradeoff — it’s relatively computationally expensive, so the more often the db checks for deadlocks, the more resources it uses. On the other hand, the longer it waits to check for deadlocks, the longer you may have latency, locks, traffic jams, etc caused by those deadlocks. This tradeoff can be balanced by the user adjusting the deadlock_timeout — the amount of time postgres waits for a lock before detecting if there is a deadlock condition.

Read more here.

I’m in a prod incident at the moment, and somehow I’ve found myself reading this article — any advice?

You could try running a query like this to see what queries are blocking which other queries in your db.

Once you’ve found the pid of the query at the root of the blockage, you could then use select pg_cancel_backend(<pid>) or select pg_terminate_backend(<pid>) to end the offending query, as discussed above.

I’m still in a prod incident and the tip above just confused me more because I haven’t read the rest of the article

**WARNING — process with caution**

I mean, have you considered killing all processes currently running on your db? I know it sounds crazy, and it is a bit risky, but it could also be a quick way to bring your database back up (at least temporarily), without needing to do much investigation first:

Any other tips?

If you drop a foreign key on a table, it requires an access exclusive lock NOT ONLY on the table with the foreign key constraint, but ALSO on the table that foreign key is referencing.

So even if you are doing a migration on a table that’s completely unused, if it involves a foreign key pointing at a used table, it can still cause that table to lock up, reads to that table to fail, and all of staging to crash. And yes, that’s another lesson we learned the hard way.

And that’s it!

If you’ve made it this far, thank you and congratulations! Please leave any questions/suggestions/proposed corrections in the comments! You can also email the author using the “pbrown” username at the “compass.com” domain.

--

--