Performance Optimisation for Wildcards Search in Postgres (Trigram Index)

Published in

The Startup

7 min readOct 24, 2020

I am working on one of my projects where Postgres happens to be the relational database and has a big table, currently had 10 Million records and can bloat up more. The challenge we have is wildcard search is becoming costly and takes more time to fetch results.

I would like to demonstrate with a sample database and data in it.

CREATE TABLE user_details (
    id                BIGSERIAL           PRIMARY KEY,
    user_uuid         UUID                NOT NULL,
    first_name        TEXT                NOT NULL,
    last_name         TEXT                NOT NULL,
    email_id          TEXT                NOT NULL,
    security_number   TEXT                NOT NULL,
    phone_number      TEXT                NOT NULL,
    created_at        TIMESTAMP           NULL DEFAULT now(),
    updated_at        TIMESTAMP           NULL DEFAULT now()
);

Insert around 10 Million records in to the table so that we could analyse the performance of search query. Our requirement is to search for name which matches from either from first_name or last_name.

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
do 
  $$
    begin
      for r in 0..10000000 loop
        insert into user_details (user_uuid, first_name, last_name,   email_id, security_number, phone_number) values(uuid_generate_v4(), get_random_string(8), get_random_string(7), concat_ws('@', substring(md5(random()::text), 0, 9), 'gmailcom'),upper(concat(substring(md5(random()::text), 0, 12))), cast(floor(random()*(9999999999-1000000000+1))+1000000000 as text));
end loop;
end
$$

Function :- get_random_string (< integer>, <string data>)

CREATE FUNCTION get_random_string(IN string_length INTEGER,  
       IN possible_chars TEXT DEFAULT 'abcdefghijklmnopqrstuvwxyz') 
  RETURNS text
    LANGUAGE plpgsql
      AS $$ 
        DECLARE
          output TEXT = '';
          i INT4;
          pos INT4;
          BEGIN
            FOR i IN 1..string_length LOOP
              pos := 1 + CAST( random() * ( LENGTH(possible_chars) - 1) AS INT4 );
              output := output || substr(possible_chars, pos, 1);
            END LOOP;
            RETURN output;
          END;
       $$;

Now we have 10 Million rows in the table, lets try the query for wildcard that matches either first_name or last_name.

Query:
explain analyze select * from user_details where (upper(first_name) like '%UHX%' or upper(last_name) like '%UHX%');

QUERY PLAN                                                                                                
----------------------------------------------------------------------------------------------------------
Gather  (cost=1000.00..331553.53 rows=795761 width=98) (actual time=3.773..1023.569 rows=7156 loops=1)    
  Workers Planned: 2                                                                                      
  Workers Launched: 2                                                                                     
  ->  Parallel Seq Scan on user_details  (cost=0.00..250977.42 rows=331567 width=98) (actual time=6.108..1
        Filter: ((upper(first_name) ~~ '%UHX%'::text) OR (upper(last_name) ~~ '%UHX%'::text))             
        Rows Removed by Filter: 3380952                                                                   
Planning Time: 0.068 ms                                                                                   
JIT:                                                                                                      
  Functions: 6                                                                                            
  Options: Inlining false, Optimization false, Expressions true, Deforming true                           
  Timing: Generation 1.614 ms, Inlining 0.000 ms, Optimization 0.972 ms, Emission 12.327 ms, Total 14.913 
Execution Time: 1024.575 ms

The sequential scan is performed on the query.

We do not have any indexes on columns, the query execution time > 1 sec and this can increase more if more rows matches for the wildcard.

Basic way to improve query performance is to create index. Here, in this case index has to be created for upper case for both first_name and last_name as case insensitivity involved. By default, Btree index is created.

CREATE INDEX in_user_details_first_name_last_name ON user_details USING btree (upper(first_name), upper(last_name))

Let us try to execute query again as index is created on both the columns. The order is important in compound indexes otherwise the index won’t be scanned by postgres query optimizer.

QUERY PLAN                                                                                               
---------------------------------------------------------------------------------------------------------
Gather  (cost=1000.00..331553.53 rows=795761 width=98) (actual time=3.069..1040.673 rows=7156 loops=1)   
  Workers Planned: 2                                                                                     
  Workers Launched: 2                                                                                    
  ->  Parallel Seq Scan on user_details  (cost=0.00..250977.42 rows=331567 width=98) (actual time=7.160..
        Filter: ((upper(first_name) ~~ '%UHX%'::text) OR (upper(last_name) ~~ '%UHX%'::text))            
        Rows Removed by Filter: 3380952                                                                  
Planning Time: 4.436 ms                                                                                  
JIT:                                                                                                     
  Functions: 6                                                                                           
  Options: Inlining false, Optimization false, Expressions true, Deforming true                          
  Timing: Generation 1.492 ms, Inlining 0.000 ms, Optimization 1.868 ms, Emission 15.460 ms, Total 18.820
Execution Time: 1041.511 ms

Even if there is an index postgres query optimizer didn’t scan index instead did sequential scan.

Now let us re arrange the query to something like this

Query:
explain analyze select * from user_details where (upper(first_name)) = ‘UHXMMMM’;

QUERY PLAN                                                                                               
---------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on user_details  (cost=1401.87..110343.74 rows=50750 width=98) (actual time=0.027..0.027
  Recheck Cond: (upper(first_name) = 'UHXMMMM'::text)                                                    
  ->  Bitmap Index Scan on in_user_details_first_name_last_name  (cost=0.00..1389.18 rows=50750 width=0) 
        Index Cond: (upper(first_name) = 'UHXMMMM'::text)                                                
Planning Time: 0.098 ms                                                                                  
JIT:                                                                                                     
  Functions: 2                                                                                           
  Options: Inlining false, Optimization false, Expressions true, Deforming true                          
  Timing: Generation 0.756 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.756 m
Execution Time: 0.905 ms

Now Postgres query optimizer had used the index that we created earlier. The Btree index doesn’t support for wildcard searches.

Let’s check if index is scanned or not.

Query:
select indexrelname, idx_scan, idx_tup_read, idx_tup_fetch from pg_stat_user_indexes where relname = ‘user_details’;

indexrelname                        |idx_scan|idx_tup_read|idx_tup_fetch|
------------------------------------|--------|------------|-------------|
user_details_pkey                   |       0|           0|            0|
in_user_details_first_name_last_name|       4|           0|            0|

The index has executed 4 times already (happens to be I executed them for 4 times).

Other alternatives

There are different kind of indexes are present in postgres.

Full Text Search with its full text indexes is not for the LIKE operator at all, it has its own operators and doesn't work for arbitrary strings. It operates on words based on dictionaries and stemming. It does support prefix matching for words, but not with the LIKE operator

Some forums and posts suggested to use varchar_pattern_ops, text_pattern_ops but that didn’t help much.

But I read few articles that both the above patterns works well for prefix matching strings ie., ‘UHX%’.

Then comes the solution Trigram indexes for “LIKE” operators.

pg_trgm is a PostgreSQL extension providing simple fuzzy string matching. It's operational and conceptual overhead is much lower than that of PostgreSQL full-text search or a separate search engine. On top of that it can speed up LIKE, ILIKE, ~ and ~* with trigram indexes, a new index type added by the extension.

In general, pg_trgm can help when:

You need fuzzy case-insensitive string matching.
You want to speed up LIKE, ILIKE, ~ or ~*.
You want to search for patterns that aren’t left-anchored (e.g. %hello%). Such patterns aren't supported by B-tree indexes.

Theory on pg_trgm index

A trigram is a sequence of three consecutive characters in a string. For example, the trigrams of Hello are Hel, ell, and llo. PostgreSQL splits a string into words and determines trigrams for each word separately. It also normalizes the word by downcasing it, prefixing two spaces and suffixing one. Non-alphanumeric characters are considered to be word boundaries. Note that downcasing makes trigrams case-insensitive. Can be seen the values simply like below.

CREATE EXTENSION pg_trgm;
select show_trgm('Hello');show_trgm                |
-------------------------|
{  h, he,ell,hel,llo,lo }|

Let us see trgm index on actual table.

select show_trgm(first_name) as first_name_trgm, show_trgm(last_name) as last_name_trgm from user_details limit 20;first_name_trgm                      |last_name_trgm                   |
-------------------------------------|---------------------------------|
{  t, tu,ct ,gct,ouq,qgc,tuo,uou,uqg}|{  q, qi,buz,iyb,qiy,uza,ybu,za }|
{  j, jd,dss,jds,pd ,ppd,ssu,sup,upp}|{  r, rx,dfh,fh ,hjd,jdf,rxh,xhj}|
{  s, sr,csx,rvc,srv,sxz,vcs,xzj,zj }|{  i, ih,ewq,hew,ihe,mg ,qmg,wqm}|
{  v, vd,bul,dlr,lrs,rsb,sbu,ul ,vdl}|{  n, nd,blc,cs ,dfb,fbl,lcs,ndf}|
{  i, im,cl ,hum,imv,mcl,mvh,umc,vhu}|{  k, ky,byp,kyr,pd ,rby,ypd,yrb}|
{  v, vj,dwv,jdw,nq ,vjd,vyn,wvy,ynq}|{  z, zd,dfd,dpz,fdp,pzp,zdf,zp }|
{  s, sr,koo,nvk,oov,ov ,rnv,srn,vko}|{  c, cd,cdx,dxv,kt ,ukt,vuk,xvu}|
{  g, gy,dw ,frq,gyz,qdw,rqd,yzf,zfr}|{  p, pm,byt,mby,pmb,tuu,uu ,ytu}|
{  m, mc,cvz,ind,mcv,nd ,qin,vzq,zqi}|{  u, us,izs,swi,sy ,usw,wiz,zsy}|
{  q, qj,jjn,jnn,nnx,nxu,qjj,uj ,xuj}|{  x, xr,rwy,vv ,wyy,xrw,yvv,yyv}|
{  a, ac,ach,ay ,chp,day,hda,hph,phd}|{  a, ac,acw,cwf,fgo,gop,op ,wfg}|
{  j, je,eww,gw ,iyg,jew,wiy,wwi,ygw}|{  g, ga,axi,fx ,gax,ikf,kfx,xik}|
{  r, rt,isu,qis,rtq,suy,tqi,uys,ys }|{  x, xe,bux,df ,ebu,uxd,xdf,xeb}|
{  g, gb,bdu,dus,gbd,oyp,soy,uso,yp }|{  f, fe,cw ,eno,fen,nou,ouc,ucw}|
{  n, ne,ctr,ect,mnp,nec,np ,rmn,trm}|{  o, ob,arx,bev,eva,obe,rx ,var}|
{  v, vy,cra,kno,noc,ocr,ra ,vyk,ykn}|{  u, uk,ave,eo ,gav,kga,ukg,veo}|
{  v, vd,dis,dq ,god,isg,odq,sgo,vdi}|{  v, vr,lrr,olr,ri ,rol,rri,vro}|
{  l, lj,jvn,kk ,kok,ljv,nko,okk,vnk}|{  x, xi,fjj,ifj,jjp,jpr,pr ,xif}|
{  b, bg,bgd,djf,dw ,edw,fed,gdj,jfe}|{  e, ef,beu,efb,eue,ex ,fbe,uex}|
{  b, br,bit,brc,ccw,cwb,it ,rcc,wbi}|{  c, cy,cyw,dn ,ndn,ond,won,ywo}|

This is how trigram index are scanned and fetch the results. But only issue with trigram index is the search pattern has to be at least 3 characters. If it is less than that the index can not be applied.

GIN Index

From the documentation,

“GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words.”

We will be using a Gin index across multiple columns in our table. Along with our index, we will be passing a special option called gin_trgm_ops. I will explain more about this option and how it benefits us shortly.

CREATE EXTENSION IF NOT EXISTS pg_trgm;CREATE INDEX in_user_details_first_name_last_name ON user_details USING gin (upper(first_name) gin_trgm_ops, upper(last_name) gin_trgm_ops);

Let us execute the previously executed query again:

explain analyze select * from user_details where upper(first_name) like '%UHX%' or upper(last_name) like '%UHX%';QUERY PLAN                                                                                                                                                        |
------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Gather  (cost=8711.89..323940.95 rows=795761 width=98) (actual time=6.550..22.685 rows=7156 loops=1)                                                              Workers Planned: 2                                                                                                                                            Workers Launched: 2                                                                                                                                             
  ->  Parallel Bitmap Heap Scan on user_details  (cost=7711.89..243364.85 rows=331567 width=98) (actual time=2.160..4.407 rows=2385 loops=3)                      |
        Recheck Cond: ((upper(first_name) ~~ '%UHX%'::text) OR (upper(last_name) ~~ '%UHX%'::text))                                                               | Heap Blocks: exact=7008 |                                                                                                                                  
        ->  BitmapOr  (cost=7711.89..7711.89 rows=812001 width=0) (actual time=2.344..2.344 rows=0 loops=1)                                                       |
              ->  Bitmap Index Scan on in_user_details_first_name_last_name  (cost=0.00..3657.00 rows=406000 width=0) (actual time=1.375..1.375 rows=3972 loops=1)|
                    Index Cond: (upper(first_name) ~~ '%UHX%'::text)                                                                                              |
              ->  Bitmap Index Scan on in_user_details_first_name_last_name  (cost=0.00..3657.00 rows=406000 width=0) (actual time=0.967..0.967 rows=3185 loops=1)|
                    Index Cond: (upper(last_name) ~~ '%UHX%'::text)                                                                                               | Planning Time: 0.109 ms |                                                                                                                                          
JIT: Functions: 6                                                                                                                                                                                                                                                                                                              |
  Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                                   |
  Timing: Generation 1.785 ms, Inlining 0.000 ms, Optimization 0.218 ms, Emission 2.515 ms, Total 4.518 ms                                                        |
Execution Time: 23.852 ms                                                                                                                                         |

If you observe the query optimizer did index scan on the gin index.

Let us see the index results and find out whether they are scanned or not.

select indexrelname, idx_scan, idx_tup_read, idx_tup_fetch from pg_stat_user_indexes where relname = 'user_details';
indexrelname                                         |idx_scan|idx_tup_read|idx_tup_fetch|
-----------------------------------------------------|--------|------------|-------------|
user_details_pkey                                    |       0|           0|            0|
in_user_details_first_name_last_name_text_pattern_ops|       0|           0|            0|
in_user_details_first_name_last_name                 |       4|       14314|            0|

The gin index is scanned 4 times ironically i executed the query twice, as we have 2 gin indexes scanned twice.

Some limitations of gin index

Search pattern should have at least 3 characters.
GIN does not support null values.
GIN Index will take more time for creation on large tables. Fortunately we have option in Postgres to create it concurrently unless we want to lock out writes that might happen during the index creation.

References:

Indexes in PostgreSQL - 7 (GIN)

We have already got acquainted with PostgreSQL indexing engine and the interface of access methods and discussed hash…

habr.com