Puzzling Postgres: solving an unreproducible performance issue

William Duclot

Published in

Engineering at Birdie

11 min readJan 3, 2023

TL;DR

Note: this is a spoiler of the investigation below.

First surprise: some Postgres drivers (eg pgx in Go) use prepared statements under the hood, without the user explicitely opting in.

Second surprise: in Postgres, queries executed via a prepared statement can have a different query plan than if executed directly. The query planner can choose to use a “generic” query plan that ignores the value of the prepared statement’s parameters, potentially resulting in a very inefficient query plan for the actual values you send.

Third surprise: Postgres’ query planner has a bug in its cost estimation.

Consequence: we had a query that was particularly slow in production, but it was very hard to reproduce as we didn’t know our ORM was using prepared statements under the hood and nothing was looking problematic with a simple EXPLAIN ANALYZE.

The original bug report in an application using pgx
The bug report to the Postgres community. The conclusion is that there’s indeed an issue in the query planner.

Background

Birdie provides technology for homecare providers to deliver outstanding care. In the backend, our services need to do some access control to ensure that users can only access what they are allowed to: for this we were testing the authorization system SpiceDB, an open-source implementation of Google’s Zanzibar which we had identified as a good fit for our requirements.

SpiceDB’s job is to receive requests like “does user X have permission Y on the resource Z?” and answer with “yes” or “no”. We configured SpiceDB to use Postgres as datastore.

The problem

As I tried to send some shadow traffic to our deployment of SpiceDB, things went very wrong. Response times were expected to be in single-digit-milliseconds, but I was observing it to be in minutes. Timeouts and failures were all over the place. It didn’t look good.

I direct my eyeballs at the most obvious thing to spring to mind: the database. A quick run-through of the AWS RDS monitoring yields a clue: CPU was 100% maxed out. I’m already in discussion with the SpiceDB maintainers at this point (a lovely bunch of people), warning that there seems to be something very wrong with the Postgres adapter (which at the time was kind of beta, CockroachDB was the better-supported option).

I’ve done my share of drilling into Postgres shenanigans in the past, I decided to try and gather as much information as possible to zero in on the problem collaboratively. First stop: AWS Performance Insights.

Performance insights

AWS Performance Insight is an AWS service to monitor your database, in particular, it shows the top queries by “wait type” (which is basically different types of load): CPU, disk read, disk write, locks… It is an unintuitive tool, limited in functionality, terribly buggy, basically unmaintained, but yet one of the most useful tools we have to figure out what is wrong in a database.

(EDIT 2024: Perf Insights has been worked on since then, “buggy” and “unmaintained” isn’t a fair assessment anymore although it’s still limited).

At the time, this is what Performance Insight showed us:

Lots of green. Green means CPU. We already knew that the DB CPU was maxed out so not much learning. The “average active session” unit is a bit tricky to interpret, but it’s good enough to point out that having it at >200 is insanely high (our now-healthy database is below 2).

The useful thing about Performance Insight is that it tells you exactly which database queries are causing this load. I don’t have a screenshot of the queries breakdown at this time, but it looks like this:

In my case the top-most heavy query, that was single-handledly responsible for all this CPU usage, was:

SELECT MIN(id), MAX(id) FROM relation_tuple_transaction
WHERE timestamp >= $1;

A first shot at `EXPLAIN ANALYZE`

Alright, I easily identified a query that’s using up a crazy amount of CPU, that’s a good start. It smells a lot like a missing index, so let’s EXPLAIN ANALYZE this. Digging in the SpiceDB code, I find that this $1 is always the same value (pseudocode below, actual code here).

pg.query(
	"SELECT MIN(id), MAX(id) FROM relation_tuple_transaction WHERE timestamp >= $1",
	["now() - interval '5s'"]
)

The “5s” is actually configurable, but I’m certain it’s the value used for my deployment of SpiceDB. Let’s run the EXPLAIN ANALYZE:

EXPLAIN ANALYZE SELECT MIN(id), MAX(id) FROM relation_tuple_transaction
WHERE timestamp >= now() - interval '5 seconds';

Aggregate  (cost=8.38..8.39 rows=1 width=16) (actual time=0.009..0.009 rows=1 loops=1)
  ->  Index Scan using ix_relation_tuple_transaction_by_timestamp on relation_tuple_transaction  (cost=0.42..8.37 rows=1 width=8) (actual time=0.005..0.006 rows=0 loops=1)
        Index Cond: ("timestamp" >= '2022-03-17 11:17:34'::timestamp without time zone)
Planning Time: 0.183 ms
Execution Time: 0.035 ms

Hm. That’s fast.

The “cost”, ie the estimation of how expensive this query will be, is low: 8.39 is not a lot in “cost units”
The “actual time” is fast: 9 microseconds
The query plan seems perfectly reasonable

I’m confident this is the problematic query, I’m also confident that we’re always querying with now() - interval '5s'. Then… why is my database being overloaded when this query seems so fast and cheap? Volume (queries per second) comes as an obvious hypothesis, but built-in metrics confirm that we’re not seeing a high volume of queries: it does look like we’re dealing with a reasonable volume of very CPU-intensive queries. Which is contradicted by my EXPLAIN ANALYZE. Which confuses me.

Shot in the dark

I have no idea what’s going on and I don’t have much to dig into. In other words, I’m in the dark: let’s shoot in the dark to hopefully hit something that we can dig into.

Among various things I tried, I ended up running a query that gives me an interesting result:

-- Same query, with a different value for `timestamp`
EXPLAIN ANALYZE SELECT MIN(id), MAX(id) FROM relation_tuple_transaction
WHERE timestamp >= '2022-03-15 00:00:00';

Result  (cost=1.07..1.08 rows=1 width=16) (actual time=28103.240..28103.242 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.43..0.54 rows=1 width=8) (actual time=28103.219..28103.220 rows=1 loops=1)
          ->  Index Scan using pk_rttx on relation_tuple_transaction  (cost=0.43..193195.54 rows=1823603 width=8) (actual time=28103.218..28103.218 rows=1 loops=1)
                Index Cond: (id IS NOT NULL)
                Filter: ("timestamp" >= '2022-03-15 00:00:00'::timestamp without time zone)
                Rows Removed by Filter: 548517
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.43..0.54 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=1)
          ->  Index Scan Backward using pk_rttx on relation_tuple_transaction relation_tuple_transaction_1  (cost=0.43..193195.54 rows=1823603 width=8) (actual time=0.015..0.015 rows=1 loops=1)
                Index Cond: (id IS NOT NULL)
                Filter: ("timestamp" >= '2022-03-15 00:00:00'::timestamp without time zone)
Planning time: 0.375 ms
Execution time: 28103.275 ms

Setting the timestamp to a different value, in particular a date that’s further in the past (a couple of days back), gives me a very different result.

The query took very long to execute, 28s here
The query plan is wildly different and much more complex
Interestingly, the cost estimation is still very low. It’s even lower than the estimation I had using “5s ago” as a filter.

That’s a super interesting result. I’m not sure yet how relevant it is to my problem, but it looks shifty enough that I’d like to see if I can draw some correlations with my problem. Here’s what goes through my head:

This execution time is consistent with the response times I’m observing in reality. 28s has to mean a lot of CPU is being spent (very likely from this Rows Removed by Filter: 548517). This query plan is consistent with what the database monitoring shows me.
The “estimated cost” is very low but the query is actually very CPU-intensive: that smells like a query planner issue. That’s not good news: famously, Postgres’ query planner is a black box that users have very little control over, if there’s a problem there I might not be able to find a workaround.
The actual value of timestamp used in the WHERE seems to matter a lot for the query planner.

That gives me a direction for investigation: is it possible somehow that the query plan I get when executing EXPLAIN ANALYZE with the “5s ago” filter isn’t the one that’s being used for SpiceDB? Even if I’m absolutely certain that SpiceDB also uses “5s ago”?

Dig into the planner statistics

When talking about variations in query plans, the usual suspect is Postgres’ use of statistics.

Postgres maintains some stats about the content of your tables (row count, values distribution…) to help create optimised query plans. These statistics change with writes (eg table growing in size) and the query planner might decide to start using a different query plan for a given query. There’s no way to anticipate that change accurately, and there’s no way to force another query plan: that’s why Postgres’ query planner is considered to be a black box.

So when confronted with the problem space of “changing query plans”, “statistics” tingles my spidey sense.

In this case, however, my hypothesis is that I am getting different query plans for the same query on the same database and on the same dataset. That doesn’t really sound like the common issue of statistics changing.

For peace of mind, I run a VACUUM ANALYZE FULL (⚠️ this will lock your table for a while) to definitely clean up statistics and dead-tuples-related noise. No change. Looks like it’s not a statistics problem.

I’m out of theories: looks like my Postgres-fu has a blind spot. it’s time to pay a visit to postgresql.org and stackoverflow.com to learn more about query planning.

Pulling the thread

Honestly, neither the documentation nor Stackoverflow really gives me an epiphany. Nothing seems to explain how a given query on a given dataset would yield different query plans.

But I do smell a faint trail. Although it doesn’t seem to apply to my case, there’s something that keeps appearing through my research: prepared statements (eg this life-saving StackExchange thread).

All over the internet, the only thing that seems to be related to inconsistent query plans is prepared statements. The StackExchange thread explains it well:

PostgreSQL has a built-in heuristics for [prepared] statements whose plans can be cached: it plans them with the actual parameter values for the first 5 times, and then it switches to a generic plan if that plan promises not to perform worse

That would explain it. If prepared statements are somehow involved, PG will take into account the actual value of the timestamp parameter for the first 5 executions: that'll be fast. From the 6th execution onwards, it doesn't consider what value timestamp actually has and always applies a generic query plan: this generic query plan is probably the same as I’ve observed with a different value for timestamp: it is much much slower and CPU-intensive.

Excellent theory! Except…

Nowhere in the code does SpiceDB use prepared statements. The SpiceDB maintainers confirm they don’t use query plans.
Postgres “switches to a generic plan if that plan promises not to perform worse”, but the “generic plan” seems to perform much worse. We saw 28s execution time, and the CPU is maxed out: there’s no way this query plan is better even for different values of the timestamp filter: why would Postgres decide to switch to it?

Hidden prepared statements

Let’s focus on the first open question: “nowhere in the code does SpiceDB use prepared statements”. I’m confident about that, but still… this behaviour of prepared statement matches my observations too closely, and nothing else does: my inner Holmes screams “when you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth”. Even if I don’t understand how prepared statements could be the cause, only prepared statements seem to explain what I’m observing*.*

Around then, I’m also noticing something I should have noticed much earlier. Let’s go back to the very start of my investigation, I identified this query as being the culprit for CPU load:

SELECT MIN(id), MAX(id) FROM relation_tuple_transaction
WHERE timestamp >= $1;

Notice this $1? I didn’t really think anything about it, but now... It’s not saying now() - interval '5s', even though I know that’s the value that’s always used, it says $1. And this $ thing is a prepared statement thing: that’s a parameter placeholder.

I don’t understand how it’s possible that prepared statements are involved, but by now I’m pretty sure that they are.

This StackOverflow question which explains a similar problem gives me another clue. It mentions pgx, which is the Go library that SpiceDB uses. New hypothesis: is it possible that pgx automatically creates prepared statements for queries?

Yes it is. The README mentions as a feature “automatic statement preparation and caching”. So while I was expecting (as did the SpiceDB maintainers) that calling pgx with a statement and parameters would result to something similar to libpq's PQExecParams, it actually creates a prepared statement, more akin to PQExecPrepared.

Something something leaky abstractions.

At this point, we had enough information for Jake from Authzed (the maintainers of SpiceDB) to manage a consistent reproduction of the problem: https://dbfiddle.uk/5N7GmTuz.

Bad generic query plan

Now we know where this bad query plan comes from. There’s still another open question: why would Postgres decide to switch to a generic query plan that’s much worse-performing?

We’ve actually already answered that: the cost estimation of this query plan is completely off, making it a candidate over the custom query plan despite being horrendously slow in reality. In other words, it promises not to be worse, but doesn’t uphold the promise. In other words, it’s a bug in the query planner.

Working around the issue

Fine, I know where this bad query plan comes from: it’s a prepared statement’s generic plan. We have multiple options to fix this in order from “easy but workaroundy” to “complex but addressing the root cause”:

I can fix it for my particular deployment of SpiceDB. Luckily, one of the few things that the query planner does give you some amount of control over are the use of generic query plans. By enabling the setting plan_cache_mode to force_custom_plan, I completely disable the use of generic query plans database-wide: prepared statements will always use a custom query plan. It worked: CPU immediately drops to nothing and my response time goes from minutes to milliseconds. Woop woop.
Other SpiceDB Postgres users will encounter the same issue of course. SpiceDB can work around this issue by making sure that a prepared statement isn’t used for this specific query.
None of this would have been a problem if Postgres’ query planner didn’t yield a bad query plan. It’s a bug of the query planner, it is now in the Postgres community’s hands.

This bug report went quickly over my head, but there was an agreement that there is indeed a problem in the query planner. A patch came out of it but as mentioned in the email chain, “it’s just a band-aid that happens to help in this exact scenario. It’s not doing much for the bad cost estimation that’s the root of the problem”.

Whilst it looks like this might not be fixed soon I hope this article has shown how Postgres performance issues can be debugged!

We’re hiring for a variety of roles within engineering. Come check them out here:

https://www.birdie.care/join-us#jobs-live