Indexing with Postgres: When Less is More
A few weeks ago I was digging into some performance issues on Plangrid Workspaces. Workspaces are a way for people on a project to carve out a smaller set of documents and sheets for focused work with others. We think they’re awesome.
Almost everything Plangrid stores is related to a specific project, and every sheet (page of a blueprint) is grouped that way. One of the interesting things about working at Plangrid has been discovering how the concepts we use for software engineering map to the world of physical construction. Construction projects have “version control” just like software projects: sheets (blueprints) have revisions made, sometimes many times, after construction begins.
As we built the initial implementation of Workspaces, we wanted to be able to treat them as “real” projects as much as possible. To support Workspaces we use a join table that marks specific resources as exposed in the Workspace. For example, project documents that are exposed in a Workspace are noted in the
In the case of sheets, we expose them based on their history set — a grouping for all the historical versions for a sheet. It was this query that I started investigating: “give me all the sheets whose history identifier has been marked as exposed”.
I went into the investigation expecting to find what I’d found before: a contorted join, a query that didn’t specify the conditions needed to utilize a compound index (ie, missing a
project_uid condition), or just a missing index in general. We use SQLAlchemy as our database layer; like any abstraction, it can make it easy to falsely assume a query will perform well in production.
However, skimming the Python query I didn’t see anything that jumped out at me.
The key here — or so I thought when I wrote the query — is including
project in both the
join conditions. This is a pretty common pattern in our code base, since most of our tables are indexed on
(project_uid, uid). Recall that Postgres can use compound indices for partial matches, but only left to right. So our
(project_uid, uid) compound index can be used to filter down to a manageable set of rows for the project, but can’t be used to find a specific row.
“No biggie,” I thought, “this query should be using the index on both
workspace_history_sets, since we’re indexing in both cases.” But when I actually did an
explain analyze, I saw something in the query plan I didn't immediately understand.
Indexes & Query Planning
We have a read-only follower for our production database running on similar hardware that allows us to test query performance without impacting our users. The SQL + plan for the query looked something like the following.
Now you might be thinking that 1824ms isn’t that bad. But it’s a lot more than I expected, especially since I expected this to be utilizing indices aggressively.
There are two halves to this plan, one for
sheets and one for
workspace_history_sets (which makes sense, since we’re hitting two tables); if you squint, you’ll see they’re shaped about the same: index scan, condition re-check, heap scan.
Let’s read the second half of the plan (for sheets) from the inside out, bottom up. At the deepest level of nesting we have two
Bitmap Index Scan operations on two different indices for the Sheets table. Each of those have an index condition (the
project_uid filters, respectively). We take the result of those scans and apply
Bitmap And, which feeds into our
Bitmap Heap Scan on the Sheets table. In order to understand what Postgres is doing, we need to know what each of those operations actually does.
When Postgres uses an Index, it can do so in one of a couple different ways. The best is what’s referred to as an Index Only Scan: this means Postgres is able to use the index alone and doesn’t have to load any additional data to answer the query. Indexes are often cached in memory making this the ideal situation. A Bitmap Heap Scan, on the other hand, means that Postgres uses the index to figure out what portions of the table it needs to look at, and then fetches those from disk to examine the rows. (For the record, when I
explained this query I expected a single Bitmap Heap Scan for each table to fetch the rows for the given project, then limit it further.) Bitmap Heap Scans aren’t inherently bad; in fact, they include a built-in optimization to only fetch from disk once we know what we need which can avoid unnecessary duplicate fetches.
But in this case we’re doing two of them, and then performing a Bitmap And, limiting the result in memory to the common rows from each individual scan. And then there’s that Recheck Condition, which sounds… suspect (but which turned out to be a red herring in this particular case).
When performing a Bitmap Index Scan, Postgres will attempt to store the exact matching tuples in memory; if there isn’t sufficient working memory available, though, it operates in lossy mode, storing page references. In that case it has to perform a second step to further refine the pages to tuples. We know that didn’t happen here because
explain analyze tells us how the recheck did:
Heap Blocks: exact=1800
If the query had been operating in lossy mode, it would have included the number of block removed (i.e.,
Once all of that is done we can do our Bitmap Heap Scan on
sheets. If that sounds a lot like the Bitmap Heap Scan on the index, it’s because it is: Postgres figures out the blocks it needs upfront, and then fetches them from disk all at once. The important (sad) thing here is that we have to go to disk again.
If you’re keeping track, that’s three rounds of disk access just to serve the Sheets portion of this query.
Identifying Poor Query Utilization
You might look at this and say, “but all these accesses are hitting indexes, so it can’t be that bad!” And you’d be correct, to an extent: if the indexes weren’t there, this would be a pathological query as it’d have to scan the entire sheets table.
But we can look at the query and the indexes Postgres chose and apply some critical thought to our data model. The two indexes we’re using on the
sheets table index
(project_uid, uid), respectively. “But”, I hope you’re sputtering, “every sheet in the same
history_set is by definition in the same
The problem is that we’ve told Postgres it needs to limit our Sheets by project and history_set. And we (I) did that thinking it’d help Postgres find the rows we care about. But because I didn’t realize there was an index on
history_set, as well, I actually created more work for Postgres.
After some thinking and experimentation, I wound up with the following query in Python.
Note that the filters on Project are gone completely: we query on
The plan looks better, too.
Look at that: a single Index Scan for
sheets, followed by an Index Only Scan for
workspace_history_sets. And for those of you a little more quantitatively oriented, two orders of magnitude improvement on execution time.
Postgres is very good at doing exactly what you ask it to. Experience (and the excellent explain documentation) can help hone your instincts, but nothing beats verifying your assumptions. For that, it’s important to run your explanations on a production-like database: Postgres uses table size, index cardinality, and resource availability when planning queries.
This is a case where less is really more in a Postgres query. It was a good reminder for me that it’s important to look at what indexes are on your tables and what the query plan for them is. “Less is more” isn’t an axiom you can live by, but “measure twice, cut once” is.