This post covers standard Postgres B-tree indexes. To make it easier to read, I’ve just referred to “indexes” throughout, but it’s worth bearing in mind that the other Postgres index types behave very differently.
At pgMustard, we’ve been looking into multi-column indexes recently: particularly what problems they can cause and what advice we can offer to fix those problems. So I thought that now would be a good time to talk about their disadvantages, how to tell if they’re letting you down, and the alternatives.
In my introduction to indexing, I talked about multi-column indexes and how useful they can be if you have an expensive sort that you want to prepare in advance. They can also be used to speed up retrieving rows that match a filter or constraint, but unfortunately it’s not as simple as “I need to query on columns a, b and c, so I’ll make a multi-column index on a, b and c”.
When multi-column indexes fall flat
Suppose I have a table full of information on fiction books and I want to query it by author. Then you can imagine my index is a filing cabinet full of index cards, organised in alphabetical order by author, with dividers that make it easy to jump to any point in the catalogue. So if I want to find all the John le Carré books, I look him up under “C” and find a list of excellent spy novels. Great.
Now suppose I want to add another column to my index — publication date. If I add it as a second column after the author, then this is similar to sorting all the cards in my filing cabinet, first by author, and then by publication date within each author.
This works fine if I want to know, for example, all the Ursula K. le Guin books published after 1970. Turn to “Guin, Ursula K. le”, and then within that section, take all the cards that are after the first of January 1970. Sweet.
But if I want to know which books in my library were published before 1700? I have to go through every single author and look up which of their books are pre-1700. The index won’t save us much time, in fact it’s probably faster to just look through all the books on the shelves — a sequential scan.
The situation when using Postgres is actually slightly worse than our library example. Since Postgres index scans are always one continuous block, an index scan would run all the way from the start of the index, up to the last pre-1700 book by the last author, without using the index to skip over all the more recent books in between.
This is what we mean when we say that queries which don’t use the leading column(s) in the index will not benefit from the index. If your constraint doesn’t use the first column(s) in the index —for example if it uses only the second indexed column (as in this case with
author), or if it uses the first and third columns, but not the second — then the index doesn’t help narrow down the search space that much.
For this example, we could solve the issue with another index on just
publication_date — although adding an index for every combination of columns used in a query can end up making your writes very expensive. Unfortunately the problems don’t stop there.
Any query that features clauses joined with an
OR statement, or alternatively has a range constraint on a column that isn’t the last used one in the index (like
publication_date = '1985-12-27' AND author > 'G') will not be able to use a single index scan to select precisely the rows it needs. Imagine the process of jumping to a certain point in our filing cabinet of index cards, and then flicking through cards until there can be no more of the books we are looking for left, and you will understand why.
How to spot inefficient index use
In all of these cases, you’ll see similar symptoms. If the problem is bad enough, the query planner will decide not to use the index at all, opting for a sequential scan instead. At other times, you’ll see a large number of
"Rows Removed by Filter" in the query plan.
This is the count of how many rows had to be scanned on the index, but were rejected on inspecting the column values — ie the rows that weren’t needed, but the condition used on the index (
"Index Cond") didn’t save Postgres from having to scan.
Here’s an example from a JSON query plan:
"Actual Rows": 12,
"Index Cond": "author > 'G'",
"Filter": "publication_date = '1985-12-27'",
"Rows Removed by Filter": 2743734,
You can see the number of rows actually returned, the part of the constraint that is used on the index to avoid scanning all the rows (
author > 'G'), the part of the constraint that had to be checked by inspecting the column values (
publication_date = '1985-12-27'), and the amount of extra rows that were scanned on the index unnecessarily (2,743,734)
Combining single-column indexes with bitmap scans
You don’t always need to use multi-column indexes for queries on more than one column. Postgres can use bitmap index scans and a bitmap heap scan to combine the results of more than one index scan.
A bitmap index scan uses the first part of a normal index scan to produce a bitmap of pages based on the constraint. A bitmap is just what it sounds like — a sequence of bits, each set to 1 or 0 depending on whether the page it refers to might contain relevant rows or not.
Since combining sequences of bits with boolean operators like
OR is really cheap in terms of CPU effort, it’s a simple matter to take the bitmap of pages that might contain rows for which
author = 'GUIN, URSULA K. LE' and an equivalent bitmap for
publication_date > '1970–01–01' and produce one for
author = 'GUIN, URSULA K. LE' AND publication_date > '1970–01–01'
Once Postgres has the bitmap of which pages might have relevant rows, a bitmap heap scan reads in the pages recorded in the bitmap, and then filters out any rows in those pages which don’t match the constraint.
This method of combining individual index scans is very flexible, and is often the most efficient way of reading rows with more complex constraints.
If I created an index on author name, and a separate one on publication date, it would solve all three of the problems we raised with multi-column indexes above. For example, finding books written by my two favourite authors can be done by an index scan to create a bitmap containing all the Ursula K. le Guin books, and bitwise
OR-ing that with a similar bitmap for all the books written by John le Carré.
One thing to notice about bitmap scans is that when Postgres eventually fetches the rows from the table in the heap, it reads the pages in order.
This can be an advantage: reading pages in order is quicker than having to jump around in the heap — in fact, the query planner sometimes chooses a bitmap heap/index scan even when it’s only doing one index scan, rather than combining multiple scans.
It can also be a disadvantage: The rows are returned in the order they are stored by Postgres in memory or on disk — ie, no order in particular. Since sorts can be very expensive, this can tilt the balance in favour of a multi-column index if you want to sort a lot of rows.
Single-column indexes vs multi-column indexes
When the use case is exactly right, a multi-column index will win out. Using a single index to skip straight to the one row you want, maybe even using an index-only scan, will be the fastest option — when it’s possible.
But a combination of single-column indexes is unlikely to be a terrible solution. When the query isn’t favourable, a multi-column index can crash and burn, but that’s a lot less likely to happen with single-column indexes, which are more flexible.
The Postgres docs go so far as to say:
“Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time”
Ultimately, benchmarking and testing your use-case are the only real way to work out which is the right solution for you, but it’s definitely wise to consider multiple indexes over a multi-column index.