Postgres indexes for absolute beginners
Indexes are really important for Postgres performance, but they’re often misunderstood and misapplied. This post aims to give you a good grounding in indexes to avoid a lot of beginner mistakes.
Step one: understand what you want to achieve
Because indexes are such a powerful tool, a new index is often viewed as “the answer” to whatever performance problems people are experiencing. Wading straight in and creating an index for every sequential scan in sight is the simplest thing to do, but indexes have costs as well as benefits.
Not only do indexes take up memory, they raise the cost of writing to the table in question. Any speed-up an index may provide for reads isn’t free — it’s offset by more work to keep the index up to date when the data in the table change. So an unused index isn’t just useless — it’s actively harmful to your database’s performance.
First, take the time to understand which bits of your query are running slowly (use the query plan), make a hypothesis as to why they’re slow, and then validate that hypothesis by attempting to speed them up.
In order to understand when the answer might be an index, it’s important to understand the difference between sequential scans and index scans in Postgres.
Sequential scans are the simplest, most obvious way of reading data from a table. Postgres jumps to the first block of memory (“page”) that holds rows for the table in question and reads in all the data, row by row, page by page, and passes it on.
Sequential scans can get a bit of a bad rap. One of the things we often hear from people when we ask them about their current performance analysis is “the first thing I do is look for sequential scans”.
It’s true that using an index on a table can make a big difference to query performance, but it’s also true that if your query just needs to get all of the data from a table in an unordered mass of rows, then things aren’t going to get any more efficient than just reading those rows in directly from consecutive pages.
An index is just a data structure which holds references to rows in a table, stored in a way that makes it easy to find them based on the row’s values in the indexed column(s).
An index scan reads through the index and uses it to quickly look up which entries match the conditions it’s looking for, and return them in the order they’re stored in the index.
Postgres then follows these references to look up the data in the rows from the table in the heap, where it would have found them if it had done a sequential scan.
When indexes are useful — and when they’re not
This two-step process is why, even if the planner could use an index, it doesn’t always choose to. Sure, an index saves you loads of time if it cuts out an expensive sort, or allows you to only read two rows from the table instead of two million. But if you want most of the rows from a table in no particular order, then using an index just introduces an unnecessary extra step and makes Postgres read the pages the table is stored in out of order.
Think of it like reading a book. If you want to read the whole thing, then the most efficient choice is to open it at the first page, start reading, and stop when you reach the end — a sequential scan. If you only want to read what the author has to say about lychees, then the best thing to do is to look up “lychee” in the book’s index, then turn to the pages it lists¹.
There are two common cases when indexes can make a big difference to query performance:
The first case is when a query is looking for a small subset of the rows in the indexed table. This can happen when you filter the rows using a
WHERE statement, or when you are performing a
JOIN on a relatively small number of rows. In these situations, Postgres can use the index to quickly find the row(s) it’s looking for, then read them in.
The second case is when the query needs rows from the table in the order that the index has stored them in — or the reverse. Postgres can scan through the index, collecting references to the rows it needs in the order it needs them in, so that when it reads the rows they will already be sorted.
In pgMustard, we recommend looking at adding an index when we see either a sequential scan which discards a high proportion of the rows it reads, or a sort. In both cases, though, we only suggest improvements to the longer-running operations in the query — it’s important to focus on the parts of the query that will make a real difference to the overall running time.
Different index types
The most common index type by far, and the Postgres default, is a binary tree. There’s a reason for this, and it’s the same as the reason that 90% of the guides, tips and advice out there focus on binary trees.
If your query revolves around filtering data in a way that can be considered “sorting” it, ie using the operators
=, then there’s a reasonable probability that a binary tree index will be a good fit.
Since a large amount of database queries are either equality checks or ordering comparisons on numerical or date types, binary trees suit the majority of cases.
If you want to do something “unusual” — a common example is spatial or geometric data, using queries like “the nearest point to” or “within this locus” — then you should look into Postgres’s other index types, but for everyday queries like
ORDER BY price DESC or
WHERE id = 416, a binary tree will usually be good enough.
How to create an index
Once you’ve decided what you need, creating an index on the column you want to query on is simple. If I want to index my
books table by the
author column, it’s as simple as
CREATE INDEX author_idx ON books (author);
You can also use an expression, wrapped in parentheses, instead of a column. For example, I love reading a good thick fantasy book, but I always skip past the songs — there’s nothing I hate more than when Raistlin’s nefarious exploits are interrupted by some Gil-Galad type nonsense. I’m always searching my books table to find long books, with as few songs as possible. So I want an index on the ultimate fantasy book metric:
CREATE INDEX length_idx ON books ((total_pages - song_pages));
Note the double parentheses — this isn’t a typo! One set of parentheses wraps the list (more on this next) of things you’re indexing on, and the other wraps the expression we’re indexing on to say “treat this as one thing”.
Often, when you create an index to a sort order, you want to sort rows by more than one column. No problem: you can create a multi-column index by listing more than one column in the creation statement. You can also specify which order you’d like the columns to be sorted in.
For example, if my antipathy towards songs in fantasy novels decreases, I might only want to use the amount of songs as a tie-breaker between equally-long books. In that case, I could create an index sorting the books on
total_pages in descending order (biggest first), then break any ties using
song_pages in ascending order (smallest first):
CREATE INDEX sort_idx ON books (total_pages DESC, song_pages ASC);
An important fact to remember is that Postgres can scan an index backwards or forwards, so if I have the index above, there’s no need for me also to create an index on
(total_pages ASC, song_pages DESC), if for some reason I want to find short books that are full of songs.
Next time we’ll cover the other uses — and pitfalls — of multi-column indexes, but they’re perfect for when the main purpose of the index is to pre-calculate a complicated sort order.
Hopefully this has given you a better understanding of when indexes can help you out. As always, benchmarking and testing is key, but it’s good to know the basics.
¹Unless it turns out to be a book about lychees, of course. As with a lot of elements of query planning, row estimates are really important.