How we dramatically optimised query time in PostgreSQL DB

Amulya Kumar Sahoo
hirewithparam
Published in
5 min readOct 10, 2018

We recently started using BigQuery, which is awesome, to store 30 million rows of data for a faster query. Today I spent some time trying to speed up some of the queries.

The query which returns the highest amount of data took 7.482s. Which in turn increased the API response time to about 8s. Above all that, BigQuery charges based on the amount of data processed too, which was a costly affair for us as, in a single day, the API runs thousands of times which results in multiple TBs of processed data.

I researched about the use cases of BigQuery and found out its an overkill for the small amount of data we are using it for. Instead, I thought why don’t we explore PostgreSQL for our use case and see if we can get comparable query time in PostgreSQL. Even if it gives similar time to BigQuery, it would have been profitable as we needn’t pay for the data processed per query.

Time for some experiments ;)

I spawned a PostgreSQL DB in Google Cloud SQL, created the required table and loaded our dataset into it (using this script).

Basically, we need to query the table to fetch all the records who have the company name as given company. I ran the analyse for the query :

hiringpattern=> explain analyse select * from hiringpattern where lower(company_name)='param';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on hiringpattern (cost=0.00..1505550.16 rows=141459 width=394) (actual time=0.928..164166.470 rows=29434 loops=1)
Filter: (lower((company_name)::text) = 'param'::text)
Rows Removed by Filter: 28262309
Planning time: 0.670 ms
Execution time: 164173.710 ms
(5 rows)

As we can see, the query time, in this case, is more than 164s. Let me tell you, this table is a simple one without any index.

Let's try indexing the column on which we want to query. While finding the correct index, we need to ask a few questions :
1. What is the datatype of the column we want to index?
2. If it’s a string, what kind of operator ( = , like etc) would we use for the query? If its a string is it a TEXT or VARCHAR type string.

As we are using VARCHAR for the column and we want to use like based queries , I found varchar_pattern_ops the one we needed based on which we created our index. (More info)

hiringpattern=> create index patter_ops_company_name on hiringpattern (lower(company_name) varchar_pattern_ops);
CREATE INDEX

I ran the same query again after creating the index.

hiringpattern=> explain analyse select * from hiringpattern where lower(company_name)='param';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on hiringpattern (cost=3384.87..397016.38 rows=141459 width=394) (actual time=13.539..90.757 rows=29434 loops=1)
Recheck Cond: (lower((company_name)::text) = 'param'::text)
Heap Blocks: exact=24111
-> Bitmap Index Scan on patter_ops_company_name (cost=0.00..3349.51 rows=141459 width=0) (actual time=8.136..8.136 rows=29434 loops=1)
Index Cond: (lower((company_name)::text) = 'param'::text)
Planning time: 0.690 ms
Execution time: 92.299 ms
(7 rows)

Now comes the most interesting part of this whole process — Clustering.

CLUSTER is a feature provided by PostgreSQL which clusters the rows of a table according to a specific index. As quoted in Postgres documentation:

When a table is clustered, it is physically reordered based on the index information. Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no attempt is made to store new or updated rows according to their index order. (If one wishes, one can periodically recluster by issuing the command again. Also, setting the table’s fillfactor storage parameter to less than 100% can aid in preserving cluster ordering during updates, since updated rows are kept on the same page if enough space is available there.)

Clustering a table is a time-consuming process. It took about an hour to cluster ~ 28 million rows. Though, we can periodically do this (for a small dataset i.e. <500 million records). Following is the query analysis after clustering the rows:

hiringpattern=> alter table testing_pattern CLUSTER ON patter_ops_company_name;
ALTER TABLE
hiringpattern=> cluster hiringpattern;
CLUSTER
hiringpattern=> explain analyse select * from hiringpattern where lower(company_name)='param';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on hiringpattern (cost=3384.87..397021.61 rows=141459 width=394) (actual time=9.311..39.961 rows=29434 loops=1)
Recheck Cond: (lower((company_name)::text) = 'param'::text)
Heap Blocks: exact=22573
-> Bitmap Index Scan on patter_ops_company_name (cost=0.00..3349.51 rows=141459 width=0) (actual time=4.775..4.775 rows=29434 loops=1)
Index Cond: (lower((company_name)::text) = 'param'::text)
Planning time: 0.091 ms
Execution time: 41.420 ms
(7 rows)

At this point, I was quite satisfied with the response time. But, my hunger for better response was not going away as this dataset was not large and Postgres is too powerful. We could squeeze some more performance.

Our main query was to get rows based on company name as well as based on a specific position. As you guessed, I created another index based on position name and company name.

hiringpattern=> create index title_name_idx on testing_pattern (lower(company_name) varchar_pattern_ops, lower(position_name) varchar_pattern_ops);

DB query analysis after this step:

hiringpattern=>  explain analyze select * from hiringpattern where lower(company_name)='param' and lower(position_name) = 'software engineer';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on hiringpattern (cost=27.74..2788.31 rows=700 width=394) (actual time=0.131..0.353 rows=102 loops=1)
Recheck Cond: ((loweselect * from auth_userr((company_name)::text) = 'param'::text) AND (lower((position_name)::text) = 'software engineer'::text))
Heap Blocks: exact=98
-> Bitmap Index Scan on title_company_name_idx (cost=0.00..27.56 rows=700 width=0) (actual time=0.114..0.114 rows=102 loops=1)
Index Cond: ((lower((company_name)::text) = 'param'::text) AND (lower((position_name)::text) = 'software engineer'::text))
Planning time: 0.147 ms
Execution time: 0.389 ms
(7 rows)

Voila! In the final result, we gained a huge performance improvement (from about 164 s to 0.4 ms).

We used the following set of optimizations:
• The correct set of index i.e. Varchar operator classes and not using the default one.
• Create the index not just on the column on which we need to filter but also on the columns which will be on the result set.
• Cluster the table based on the given index

If you know of other ways of optimizations in DBs (any DB), lets discuss down in comments section.

Thank You!

--

--