The Postgres Performance Tuning Manual: Indexes
How to modify and improve query times in Postgres with the help of indexes.
In a previous post, we discussed how we can use query plans to analyse the run time of queries. In case you missed it, you can read it here:
In this post, we look at how to overcome slow queries by analysing them with Explain
and Analyze
, and using indexes to modify and enhance the query timings.
Postgres supports different kinds of indexing on the table for querying faster.
Multiple column indexes
A multi-column B-Tree index can be used with query conditions that involve any subset of the index’s columns. This index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.
Cover-Index
An index containing all the columns needed for a query, which is there in the select statement.
Unique Index
A unique index is an index used to enforce uniqueness of a column’s value or the uniqueness of a combined value of more than one column.
One of the most misunderstood concepts around indexing is to understand where to use a primary key, unique constraint, or unique index. Let’s understand this using a problem:
Problem statement
We require maximum performance with no duplicate data. Which is the better approach? Primary key, unique constraint or unique index?
The Solution
Note: Multiple null values are not equal, so they are not considered as a duplicate record.
- Postgres automatically creates a unique index in the table when a primary key and unique constraint is defined in the table. Creating unique constraint then would be redundant, and unnecessarily creating indexes degrades the performance of Postgres. According to suggestions by the Postgres product team, create a unique constraint on the table and then there is no need to create a unique index on those columns.
- Postgres creates an index for the defined primary key itself.
- When we create a unique constraint, Postgres automatically creates an index behind the scene.
However, there are cases where even indexing can’t improve performance. One such case is when we do case-insensitive searches. Let’s understand the difference between the query cost in case of case sensitive and case insensitive search in our scheme table. Given we have an index on the column scheme_name
.
EXPLAIN ANALYSE SELECT * FROM schemes where scheme_name = 'weekend_scheme'
QUERY PLAN | Index scan using idx_scheme_name on schemes (cost=0.28..8.29 rows=1 width=384)
Planning Time: 0.155 ms
Execution Time: 0.063ms
EXPLAIN ANALYSE SELECT * FROM schemes where lower(scheme_name) = 'weekend_scheme'
QUERY PLAN | Seq Scan on schemes (cost=0.00..69.00 rows=5 width=384)
Filter: (lower((scheme_name) :: text) = ‘weekend_scheme’ :: text)
Rows removed by filter: 999
Planning Time: 0.119 ms
Execution Time: 0.721ms
Even though we have an index created at scheme_name
, the function lower
degrades the performance as it does an additional effort of converting all the values of scheme_table
to lower case.
Cases when an index is not used (although it is defined).
LIKE ‘%scheme’
will never use an index, butLIKE ‘scheme%’
can possibly use the index.- The upper/lower case function used in
where
clause.
So whenever we want to use a function in our where
clause, we could create the index in the following way to optimise the query. CREATE INDEX idx_scheme_name ON schemes (lower(scheme_name))
EXPLAIN ANALYSE SELECT * FROM schemes where lower(scheme_name) = 'weekend_scheme'
QUERY PLAN | Bitmap heap scan on schemes ((cost=4.32..19.83 rows=5 width=384))
Recheck cond: (lower ((scheme_name) :: text) = ‘weekend_scheme’ :: text)
Heap Blocks: exact=1
Bitmap scan on schemes ((cost=0.00..4.32 rows=5 width=0))
Index cond: (lower ((scheme_name) :: text) = ‘weekend_scheme’ :: text)
Planning Time: 1.784 ms
Execution Time: 0.079 ms
Partial Index
Postgres supports an index over a subset of the rows of a table (known as a partial index). It’s often the best way to index our data if we want to repeatedly analyse rows that match a given WHERE
clause. Let us see how we can enhance the performance of Postgres using partial indexing.
Problem statement
We want to return all the schemes which are supposed to run before 11:00 am.EXPLAIN ANALYSE SELECT * FROM schemes WHERE start_time < '10:00:00'
QUERY PLAN | Seq Scan on schemes (cost=0.00..66.50 rows=9 width=23)
Filter: (start_time < ’10:00:00')
Rows removed by filter: 991
Planning Time: 0.082 ms
Execution Time: 0.226ms
We can create an index on start_time
column but assuming we have a huge database, this may not be optimal for insert, update and delete. So we create an index with a condition. This kind of indexing is used when we know what we need from our select queries. Say we do a heavy read on all the schemes which are started before 10:00:00 and not much when started later.CREATE INDEX idx_scheme_name ON schemes start_time WHERE start_time < '11:00:00'
EXPLAIN ANALYSE SELECT * FROM schemes WHERE start_time < '10:00:00'
QUERY PLAN | Bitmap heap scan on schemes ((cost=4.21..29.30 rows=9 width=23))
Recheck cond: (start_time < ’10:00:00')
Heap Blocks: exact=8
Bitmap index scan on schemes ((cost=0.00..4.21 rows=9 width=0))
Index cond: (start_time < ’10:00:00')
Planning Time: 1.729 ms
Execution Time: 0.075 ms
This reduces the execution time from 0.226
to 0.075
.
Let’s validate that we have not indexed all the schemes where start_time
is after 11:00 AM. EXPLAIN ANALYSE SELECT * FROM schemes WHERE start_time >'12:00:00'
QUERY PLAN | Seq Scan on schemes (cost=0.00..66.50 rows=6 width=23)
Filter: (start_time < ’12:00:00')
Rows removed by filter: 993
Planning Time: 0.101 ms
Execution Time: 0.228ms
This proves that partial data from schemes table is indexed and the rest of the data is not indexed. Our index size is very small and easy to maintain, helping in the maintaining task of reindexing.
Query plan on JOINS
The optimizer needs to pick the correct join algorithm when there are multiple tables to be joined in the select statement. Postgres uses 3 different kinds of join algorithm based on the type of join we are using.
Nested Loop: Here, the planner can use either sequential scan or index scan for each of the elements in the first table. The planner uses a sequential scan when the second table is small. The basic logic of choosing between a sequential scan and index scan applies here too.
Hash Join: In this algorithm, the planner creates a hash table of the smaller table on the join key. The larger table is then scanned, searching the hash table for the rows which meet the join condition. This requires a lot of memory to store the hash table in the first place.
Merge Join: This is similar to merge sort algorithm. Here the planner sorts both the tables to be joined on the join attribute. The tables are then scanned in parallel to find the matching values.
EXPLAIN SELECT schemes.rules FROM scheme_rules JOIN schemes ON (scheme_rules.scheme_id = schemes.id ) where scheme_name = 'weekend_scheme';
Downsides of indexes in production environments
Finding unused indexes
In a large production environment, finding unused indexes is advisable, because indexes eat memory. Postgres wiki page details how we can find index summary, duplicate indexes, and index size.
CREATE/DROP index vs CREATE/DROP index concurrently
Creating and dropping an index in a large database can take hours or even days and the CREATE INDEX
command blocks all the writes on a table (it doesn’t block the reads, but this is still not ideal).
However, an index created concurrently withCREATE INDEX CONCURRENT
will not acquire locks against writes. When creating index concurrently, Postgres first scans the table to build indexes and runs the index once again for the things to be added since the first pass.
Creating an index concurrently also has a downside though. If something goes wrong during the process, it does not roll back, and leaves an invalid index behind. Invalid indexes can be found using the following query.
SELECT * FROM pg_class, pg_index WHERE pg_index.indisvalid = false AND pg_index.indexrelid = pg_class.oid;
Rebuilding indexes
REINDEX rebuilds an index using the data stored in the index table, replacing the old copy of the index. If we suspect corruption of an index on a table, we can simply rebuild that index, or all indexes on the table, using REINDEX INDEX
or REINDEX TABLE
REINDEX is similar to a drop and recreate of the index in that the index contents are rebuilt from scratch. However, locking considerations are rather different. REINDEX locks out writes but not reads of the index’s parent table. It also takes an exclusive lock on the specific index being processed, which will block reads that attempt to use that index.
Another option is to drop index concurrently and create again concurrently.
Conclusion
This post aimed to provide an overview on how Postgres queries the database. By understanding query plans better and carefully taking measures (mostly through indexes), we can get the best performance out of the Postgres database.
There are other ways to enhance query performance, but we’ll save those for later posts.
Further Reading
Efficient, optimised, lean… These are words you’ll often hear at our offices. At GOJEK, getting something to work is only half the battle. Getting it to work as efficiently as possible without minimum intervention, that’s when we do the high fives. We are also looking for more people to join out teams. Head over to gojek.jobs, and grab the chance to help us build better, more efficient solutions.