PostgreSQL Multi-column Indexes in Depth

6 min readSep 10, 2023

Hello Folks,

Did any of the following questions hit your mind while building the multi-column indexes?

If I tell you about my case honestly, these questions came into my mind but I let them pass through or didn’t think much, I used to be in the mode — bhai query slow hai, query k saare columns per index bnaya and done 😀 (I created the index on all the required columns supplied in the query). Are you doing the same? Trust me I was making a big mistake that I realized when found the answer to all these questions. Stay tuned, you will also take a relieved breath after reading this. Read the article till the end as the most important stuff is in the end only.

As I believe in trying things on the machine, so will take you step-by-step from generating the data to understanding the query plan.

I created an employees2 table and populated it with the 2 million records. I referred to this article for the same —

CREATE TABLE EMPLOYEES2 (
  id SERIAL PRIMARY KEY, 
  first_name VARCHAR(50), 
  last_name VARCHAR(50), 
  email VARCHAR(50), 
  mobile_no BIGINT, 
  date_of_birth DATE
);

Create an index idx_mobileno_firstname_lastname and reindex the table using the below queries -

CREATE INDEX idx_mobileno_firstname_lastname 
ON employees2 (mobile_no, first_name, last_name);

REINDEX INDEX idx_mobileno_firstname_lastname;

Let’s try to find answers for all the questions 1-by-1:

1. How does the index scan work when queried on the columns of a multi-column Index?

From Postgres official doc -
A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that do not have an equality constraint, will be used to limit the portion of the index that is scanned.

Let's analyze the query plan using EXPLAIN ANALYZE for different queries

EXPLAIN ANALYZE 
select 
  * 
from 
  employees2 
where 
  mobile_no = 6888888888 
  and first_name = 'RAM' 
  and last_name = 'Zmhslylhuk'

Focus on line no. 1 & 2 in the query plan. The query works well as all the indexed columns(mobile_no, first_name, and last_name) were supplied with the equality constraint.

EXPLAIN ANALYZE 
select 
  * 
from 
  employees2 
where 
  mobile_no = 6888888888 
  and first_name = 'RAM' 
  and last_name <> 'Zmhslylhuk'

As you can see Index scan happened on the mobile_no & first_name and the Filter on the last_name, This happened because the last_name has an inequality constraint. Hope you got the point now from the official doc — equality constraints on leading columns, plus any inequality constraints on the first column that do not have an equality constraint, will be used to limit the portion of the index that is scanned.

EXPLAIN ANALYZE 
select 
  * 
from 
  employees2 
where 
  first_name = 'RAM' 
  and last_name = 'Zmhslylhuk'

As the leading column of the index i.e. mobile_no not supplied in the query, no index scan happened. It’s just a sequential scan. So, it’s mandatory to supply the leading column of the index in the query.

2. Can the non-indexed columns be used with the subset of the indexed columns of an Index?

EXPLAIN ANALYZE 
select 
  * 
from 
  employees2 
where 
  mobile_no = 6888888888 
  and last_name = 'Zmhslylhuk' 
  and email = 'test@yopmail.com'

In this query, we skipped the first_name column of the index. So, it's okay if we skip an index column, but remember leading column of the index is always required. If you notice, here a non-indexed column i.e. email is supplied with the indexed columns and will be part of the Filter.

3. Does the order of the columns defined in the WHERE clause matter?

EXPLAIN ANALYZE 
select 
  * 
from 
  employees2 
where 
  mobile_no = 6888888888 
  and email = 'test@yopmail.com' 
  and last_name = 'Zmhslylhuk'

We just changed the order of the columns supplied in the previous query but the query plan remains unchanged. It gets handled by the query planner.

4. When a query stops using the defined index even if it has all the indexed columns in the where clause?

Query with low-range -

EXPLAIN ANALYZE 
select 
  * 
from 
  employees2 
where 
  mobile_no < 2888888888 
  and first_name = 'RAM' 
  and last_name = 'Zmhslylhuk'

Query with high-range -

ANALYZE 
select 
  * 
from 
  employees2 
where 
  mobile_no < 6888888888 
  and first_name = 'RAM' 
  and last_name = 'Zmhslylhuk'

Did you notice something? We are supplying all the index columns in the query, but as soon as the range increases on the mobile_no column, the query plan is converted to the parallel sequential scan. If there are large range queries to be applied on some column, define that column in the last of the index and put the columns needed for the exact match in the starting. The above query will use the index scan if our index is defined as below -

CREATE INDEX idx_firstname_lastname_mobile_no 
ON employees2 (first_name, last_name, mobile_no);

5. Does the order of the columns defined in the index matter?

This question is an extension of question no. 4, I found this blog really useful regarding this, so the writer of this blog says -

If you are supplying all the column values referenced in the index, I assumed it didn’t matter what order the columns were specified in the index, but Robert Haas recently mentioned this is not always true. For example, if a has many duplicate values, and b has mostly unique values, having a at the start of the index is suboptimal — better to use b first, which will more effectively narrow the search space when looking for matches on a.

If I explain the above quote — it simply says to use the columns with unique values at the start of the index because duplicate values increase the search space even if we supply the exact match in the WHERE clause.

Conclusion

If we create the multi-column indexes mindfully to give the optimal performance it will result in optimal resource utilization and optimal cost. Below are a few thumb rules -

Write all the possible queries that you are going to run optimally and list down all the columns used.
Divide all the listed columns in rule no.1 into three categories
- Need an exact match
- Needed in-range query
- Have duplicate values
Put the columns with the exact match first,
In the last, the range query and with the duplicate values. We need to brainstorm in deciding whether the range should be before or after the columns with duplicate values as it will vary with the data and use cases.

Don’t forget that multiple indexes may be required for the various groups of queries.

Thanks for reading. Happy Learning !!

References