Performance Optimisation for Wildcards Search in Postgres (Trigram Index)

Eresh Gorantla
The Startup
Published in
7 min readOct 24, 2020

I am working on one of my projects where Postgres happens to be the relational database and has a big table, currently had 10 Million records and can bloat up more. The challenge we have is wildcard search is becoming costly and takes more time to fetch results.

I would like to demonstrate with a sample database and data in it.

CREATE TABLE user_details (
id BIGSERIAL PRIMARY KEY,
user_uuid UUID NOT NULL,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email_id TEXT NOT NULL,
security_number TEXT NOT NULL,
phone_number TEXT NOT NULL,
created_at TIMESTAMP NULL DEFAULT now(),
updated_at TIMESTAMP NULL DEFAULT now()
);

Insert around 10 Million records in to the table so that we could analyse the performance of search query. Our requirement is to search for name which matches from either from first_name or last_name.

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
do
$$
begin
for
r in 0..10000000 loop
insert
into user_details (user_uuid, first_name, last_name, email_id, security_number, phone_number) values(uuid_generate_v4(), get_random_string(8), get_random_string(7), concat_ws('@', substring(md5(random()::text), 0, 9), 'gmailcom'),upper(concat(substring(md5(random()::text), 0, 12))), cast(floor(random()*(9999999999-1000000000+1))+1000000000 as text));
end loop;
end
$$

Function :- get_random_string (< integer>, <string data>)

CREATE FUNCTION get_random_string(IN string_length INTEGER,  
IN possible_chars TEXT DEFAULT 'abcdefghijklmnopqrstuvwxyz')
RETURNS text
LANGUAGE
plpgsql
AS $$
DECLARE
output
TEXT = '';
i INT4;
pos INT4;
BEGIN
FOR
i IN 1..string_length LOOP
pos := 1 + CAST( random() * ( LENGTH(possible_chars) - 1) AS INT4 );
output := output || substr(possible_chars, pos, 1);
END LOOP;
RETURN output;
END;
$$;

Now we have 10 Million rows in the table, lets try the query for wildcard that matches either first_name or last_name.

Query:
explain analyze select * from user_details where (upper(first_name) like '%UHX%' or upper(last_name) like '%UHX%');

QUERY PLAN                                                                                                
----------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..331553.53 rows=795761 width=98) (actual time=3.773..1023.569 rows=7156 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on user_details (cost=0.00..250977.42 rows=331567 width=98) (actual time=6.108..1
Filter: ((upper(first_name) ~~ '%UHX%'::text) OR (upper(last_name) ~~ '%UHX%'::text))
Rows Removed by Filter: 3380952
Planning Time: 0.068 ms
JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.614 ms, Inlining 0.000 ms, Optimization 0.972 ms, Emission 12.327 ms, Total 14.913
Execution Time: 1024.575 ms

The sequential scan is performed on the query.

We do not have any indexes on columns, the query execution time > 1 sec and this can increase more if more rows matches for the wildcard.

Basic way to improve query performance is to create index. Here, in this case index has to be created for upper case for both first_name and last_name as case insensitivity involved. By default, Btree index is created.

CREATE INDEX in_user_details_first_name_last_name ON user_details USING btree (upper(first_name), upper(last_name))

Let us try to execute query again as index is created on both the columns. The order is important in compound indexes otherwise the index won’t be scanned by postgres query optimizer.

QUERY PLAN                                                                                               
---------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..331553.53 rows=795761 width=98) (actual time=3.069..1040.673 rows=7156 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on user_details (cost=0.00..250977.42 rows=331567 width=98) (actual time=7.160..
Filter: ((upper(first_name) ~~ '%UHX%'::text) OR (upper(last_name) ~~ '%UHX%'::text))
Rows Removed by Filter: 3380952
Planning Time: 4.436 ms
JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.492 ms, Inlining 0.000 ms, Optimization 1.868 ms, Emission 15.460 ms, Total 18.820
Execution Time: 1041.511 ms

Even if there is an index postgres query optimizer didn’t scan index instead did sequential scan.

Now let us re arrange the query to something like this

Query:

explain analyze select * from user_details where (upper(first_name)) = ‘UHXMMMM’;

QUERY PLAN                                                                                               
---------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on user_details (cost=1401.87..110343.74 rows=50750 width=98) (actual time=0.027..0.027
Recheck Cond: (upper(first_name) = 'UHXMMMM'::text)
-> Bitmap Index Scan on in_user_details_first_name_last_name (cost=0.00..1389.18 rows=50750 width=0)
Index Cond: (upper(first_name) = 'UHXMMMM'::text)
Planning Time: 0.098 ms
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.756 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.756 m
Execution Time: 0.905 ms

Now Postgres query optimizer had used the index that we created earlier. The Btree index doesn’t support for wildcard searches.

Let’s check if index is scanned or not.

Query:

select indexrelname, idx_scan, idx_tup_read, idx_tup_fetch from pg_stat_user_indexes where relname = ‘user_details’;

indexrelname                        |idx_scan|idx_tup_read|idx_tup_fetch|
------------------------------------|--------|------------|-------------|
user_details_pkey | 0| 0| 0|
in_user_details_first_name_last_name| 4| 0| 0|

The index has executed 4 times already (happens to be I executed them for 4 times).

Other alternatives

There are different kind of indexes are present in postgres.

Full Text Search with its full text indexes is not for the LIKE operator at all, it has its own operators and doesn't work for arbitrary strings. It operates on words based on dictionaries and stemming. It does support prefix matching for words, but not with the LIKE operator

Some forums and posts suggested to use varchar_pattern_ops, text_pattern_ops but that didn’t help much.

But I read few articles that both the above patterns works well for prefix matching strings ie., ‘UHX%’.

Then comes the solution Trigram indexes for “LIKE” operators.

pg_trgm is a PostgreSQL extension providing simple fuzzy string matching. It's operational and conceptual overhead is much lower than that of PostgreSQL full-text search or a separate search engine. On top of that it can speed up LIKE, ILIKE, ~ and ~* with trigram indexes, a new index type added by the extension.

In general, pg_trgm can help when:

  1. You need fuzzy case-insensitive string matching.
  2. You want to speed up LIKE, ILIKE, ~ or ~*.
  3. You want to search for patterns that aren’t left-anchored (e.g. %hello%). Such patterns aren't supported by B-tree indexes.

Theory on pg_trgm index

A trigram is a sequence of three consecutive characters in a string. For example, the trigrams of Hello are Hel, ell, and llo. PostgreSQL splits a string into words and determines trigrams for each word separately. It also normalizes the word by downcasing it, prefixing two spaces and suffixing one. Non-alphanumeric characters are considered to be word boundaries. Note that downcasing makes trigrams case-insensitive. Can be seen the values simply like below.

CREATE EXTENSION pg_trgm;
select show_trgm('Hello');
show_trgm |
-------------------------|
{ h, he,ell,hel,llo,lo }|

Let us see trgm index on actual table.

select show_trgm(first_name) as first_name_trgm, show_trgm(last_name) as last_name_trgm from user_details limit 20;first_name_trgm                      |last_name_trgm                   |
-------------------------------------|---------------------------------|
{ t, tu,ct ,gct,ouq,qgc,tuo,uou,uqg}|{ q, qi,buz,iyb,qiy,uza,ybu,za }|
{ j, jd,dss,jds,pd ,ppd,ssu,sup,upp}|{ r, rx,dfh,fh ,hjd,jdf,rxh,xhj}|
{ s, sr,csx,rvc,srv,sxz,vcs,xzj,zj }|{ i, ih,ewq,hew,ihe,mg ,qmg,wqm}|
{ v, vd,bul,dlr,lrs,rsb,sbu,ul ,vdl}|{ n, nd,blc,cs ,dfb,fbl,lcs,ndf}|
{ i, im,cl ,hum,imv,mcl,mvh,umc,vhu}|{ k, ky,byp,kyr,pd ,rby,ypd,yrb}|
{ v, vj,dwv,jdw,nq ,vjd,vyn,wvy,ynq}|{ z, zd,dfd,dpz,fdp,pzp,zdf,zp }|
{ s, sr,koo,nvk,oov,ov ,rnv,srn,vko}|{ c, cd,cdx,dxv,kt ,ukt,vuk,xvu}|
{ g, gy,dw ,frq,gyz,qdw,rqd,yzf,zfr}|{ p, pm,byt,mby,pmb,tuu,uu ,ytu}|
{ m, mc,cvz,ind,mcv,nd ,qin,vzq,zqi}|{ u, us,izs,swi,sy ,usw,wiz,zsy}|
{ q, qj,jjn,jnn,nnx,nxu,qjj,uj ,xuj}|{ x, xr,rwy,vv ,wyy,xrw,yvv,yyv}|
{ a, ac,ach,ay ,chp,day,hda,hph,phd}|{ a, ac,acw,cwf,fgo,gop,op ,wfg}|
{ j, je,eww,gw ,iyg,jew,wiy,wwi,ygw}|{ g, ga,axi,fx ,gax,ikf,kfx,xik}|
{ r, rt,isu,qis,rtq,suy,tqi,uys,ys }|{ x, xe,bux,df ,ebu,uxd,xdf,xeb}|
{ g, gb,bdu,dus,gbd,oyp,soy,uso,yp }|{ f, fe,cw ,eno,fen,nou,ouc,ucw}|
{ n, ne,ctr,ect,mnp,nec,np ,rmn,trm}|{ o, ob,arx,bev,eva,obe,rx ,var}|
{ v, vy,cra,kno,noc,ocr,ra ,vyk,ykn}|{ u, uk,ave,eo ,gav,kga,ukg,veo}|
{ v, vd,dis,dq ,god,isg,odq,sgo,vdi}|{ v, vr,lrr,olr,ri ,rol,rri,vro}|
{ l, lj,jvn,kk ,kok,ljv,nko,okk,vnk}|{ x, xi,fjj,ifj,jjp,jpr,pr ,xif}|
{ b, bg,bgd,djf,dw ,edw,fed,gdj,jfe}|{ e, ef,beu,efb,eue,ex ,fbe,uex}|
{ b, br,bit,brc,ccw,cwb,it ,rcc,wbi}|{ c, cy,cyw,dn ,ndn,ond,won,ywo}|

This is how trigram index are scanned and fetch the results. But only issue with trigram index is the search pattern has to be at least 3 characters. If it is less than that the index can not be applied.

GIN Index

From the documentation,

“GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words.”

We will be using a Gin index across multiple columns in our table. Along with our index, we will be passing a special option called gin_trgm_ops. I will explain more about this option and how it benefits us shortly.

CREATE EXTENSION IF NOT EXISTS pg_trgm;CREATE INDEX in_user_details_first_name_last_name ON user_details USING gin (upper(first_name) gin_trgm_ops, upper(last_name) gin_trgm_ops);

Let us execute the previously executed query again:

explain analyze select * from user_details where upper(first_name) like '%UHX%' or upper(last_name) like '%UHX%';QUERY PLAN                                                                                                                                                        |
------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Gather (cost=8711.89..323940.95 rows=795761 width=98) (actual time=6.550..22.685 rows=7156 loops=1) Workers Planned: 2 Workers Launched: 2
-> Parallel Bitmap Heap Scan on user_details (cost=7711.89..243364.85 rows=331567 width=98) (actual time=2.160..4.407 rows=2385 loops=3) |
Recheck Cond: ((upper(first_name) ~~ '%UHX%'::text) OR (upper(last_name) ~~ '%UHX%'::text)) | Heap Blocks: exact=7008 |
-> BitmapOr (cost=7711.89..7711.89 rows=812001 width=0) (actual time=2.344..2.344 rows=0 loops=1) |
-> Bitmap Index Scan on in_user_details_first_name_last_name (cost=0.00..3657.00 rows=406000 width=0) (actual time=1.375..1.375 rows=3972 loops=1)|
Index Cond: (upper(first_name) ~~ '%UHX%'::text) |
-> Bitmap Index Scan on in_user_details_first_name_last_name (cost=0.00..3657.00 rows=406000 width=0) (actual time=0.967..0.967 rows=3185 loops=1)|
Index Cond: (upper(last_name) ~~ '%UHX%'::text) | Planning Time: 0.109 ms |
JIT: Functions: 6 |
Options: Inlining false, Optimization false, Expressions true, Deforming true |
Timing: Generation 1.785 ms, Inlining 0.000 ms, Optimization 0.218 ms, Emission 2.515 ms, Total 4.518 ms |
Execution Time: 23.852 ms |

If you observe the query optimizer did index scan on the gin index.

Let us see the index results and find out whether they are scanned or not.

select indexrelname, idx_scan, idx_tup_read, idx_tup_fetch from pg_stat_user_indexes where relname = 'user_details';
indexrelname |idx_scan|idx_tup_read|idx_tup_fetch|
-----------------------------------------------------|--------|------------|-------------|
user_details_pkey | 0| 0| 0|
in_user_details_first_name_last_name_text_pattern_ops| 0| 0| 0|
in_user_details_first_name_last_name | 4| 14314| 0|

The gin index is scanned 4 times ironically i executed the query twice, as we have 2 gin indexes scanned twice.

Some limitations of gin index

  1. Search pattern should have at least 3 characters.
  2. GIN does not support null values.
  3. GIN Index will take more time for creation on large tables. Fortunately we have option in Postgres to create it concurrently unless we want to lock out writes that might happen during the index creation.

References:

--

--

Eresh Gorantla
The Startup

Experience in Open source stack, microservices, event-driven, analytics. Loves Cricket, cooking, movies and travelling.