Indexing in PostgreSQL and applying it to JSONB

Galangkangin Gotera
12 min readMay 24, 2021

--

From e-commerce, news websites to the campus website, Most web applications probably use a database of some kind. Searching is one of, if not the most frequent, operations using a database. When you do a search query without any set-up, your database will go over all the rows and return the matching rows. This will cause slow searches when you have tens of thousands of rows, while you only need to search for a few rows.

This article will discuss several things. Note that the discussion is focused on PostgreSQL:

  1. Index: what is an index? How can index help search?
  2. Basic index: Creating indexes for searching exact matches on PostgreSQL?
  3. Indexing prefix searches: indexing prefix searching such as abcd% , mediu% , or galangkan%
  4. Indexing pattern searches: indexing pattern searching such as %abcd% , %kami% , %sato%
  5. Indexing a function output: what if you want to search for columns that match in case-insensitive?
  6. JSONB: A powerful PostgreSQL data type that makes MongoDB obsolete
  7. Basic JSONB index: Creating indexes for searching exact JSONB field values on PostgreSQL
  8. Indexing Pattern in JSONB: applying prefix and pattern indexes for JSONB
  9. Why Does the Speed Increase Varies: Some reasons why indexes speed up vary. Sometimes by 68x, 350x, or barely even 2x
  10. Closing: Closing remarks on indexes

1. Index

A database index is a data structure that allows faster searching at the cost of slightly slower write (insert, update, delete) time. Some possible data structure includes B-Tree, Hashing, GIN, trigram. Here, we will use B-Tree, GIN, and trigram indexes. You can look up these data structures if you’re curious or unfamiliar, but knowledge of those data structures is not required to follow this article as I will be talking about cases where to use each index type.

The data structure is represented in a separated table, so adding an index also increases memory usage. However, the benefits far outweigh the cost. Indexing narrows down the search space so that the database finds the results faster. On large databases, indexing is a must because search results should be given promptly.

b-tree index example

Above is an example of a b-tree index. The numbers on each node represent the indexed column keys which can be integers, strings, etc., while the leaves represent the data stored. There are 18 rows stored here, where they have indexed values 4, 8, 16, 20, 25, 27, …

Here's the difference when searching on the indexed table vs. the not indexed table. Let's say you’re searching for all rows with value = 49:

  • Non-indexed: sequential scan on all possible rows. So you go over all 18 rows and return all rows with value = 49. In total, it compares 18 times.
  • Indexed: you go from the top of the b-tree. It compares the value and finds the largest node with a value less than your key, 32 (took 3 compares). Then traverses to the middle. Do the same to get key 43 (took 3 compares). Then traverses to the middle and do the same to get your key 49 (took 2 compares). In total, it compares 8 times.

This might not seem like a large improvement. Indeed it is for smaller tables. But indexes get better when more data is introduced. The speed of the b-tree grows at a logarithmic rate, while the speed of the sequential scan grows at a linear rate. Therefore, for 100k data, the sequential scan will need to scan for all 100k data, while the above b-tree will only scan about 2 * log3(100k) = 20 keys. That’s over 5000x speed improvement! However, in practice, there are other limiting factors, so it's that bombastic. I will show actual results later down the article.

2. Basic Index

Let’s use this Postgres table for reference:

TABLE surat_tuntutan {
id int
nomor_perkara int
status TEXT
}

The most common index is b-tree indexes. You can create them in Postgres like this:

CREATE INDEX surat_tuntutan_nomor_perkara ON surat_tuntutan(nomor_perkara)

For integers, this supports mathematical comparisons such as <, >, and = so queries like these will use the index:

SELECT * FROM surat_tuntutan WHERE nomor_perkara < 10;SELECT * FROM surat_tuntutan WHERE nomor_perkara > 50 AND status = 'free';SELECT * FROM surat_tuntutan WHERE nomor_perkara = 70

Note the second query. Although ‘status’ is not indexed, Postgres is smart and chooses the nomor_perkara index and then filters the rows with status = ‘free’ from the resulting index scan.

What about B-Tree indexes for strings? The above index only supports equality for strings. So if nomor_perkara is a string, only these types of queries will use the index:

SELECT * FROM surat_tuntutan WHERE nomor_perkara = 'FM123'

How about pattern searches like nomor_perkara LIKE 'FM%' ? on to the next section!

3. Indexing Prefix Searches

Btree indexes actually have a couple of options to choose from. Such as text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops. Here, we will focus on text_pattern_ops.

If you create a btree index using text_pattern_ops option, it creates a “different” b-tree implementation which in a nutshell stores your strings in lexicography order which allows searching for prefix pattern easily. You can watch this video for an in-depth explanation of text_pattern_ops. here’s how to create the index:

CREATE INDEX surat_tuntutan_nomor_perkara_prefix_idx ON surat_tuntutan(nomor_perkara text_pattern_ops);

Very similar to the basic index, add text_pattern_ops after the column name you want to index. With this, now searches with prefix patterns such as FM% will use the index.

Here’s an example using and not using the index:

here I seeded 100k data into the database

It took 50.691 ms without the index, while it took 0.739 ms with the index. So it's about a 68x speedup! The speed difference will continue to increase. On 10k data, it took roughly the same time for the indexed table, while it took about 6 ms on a sequential scan. That matches what I said about sequential scan increases time linearly on the input. So, on a million rows, it will take about 500 ms on a sequential scan, while it may take the same time for the indexed columns.

speed increases of indexes belike

Now that you understand how powerful indexes can be let's move on from btree indexes into the more powerful indexes type.

4. Indexing Pattern Searches

Now let’s skip to a hard string matching problem: pattern matching! Let's say you want to do queries like SELECT * FROM surat_tuntutans WHERE nomor_perkara LIKE %Q.1.3% . Btree index cannot support this, so we’ll look into a different type of index: GIN and trigram indexes.

GIN (General Inverted Index) indexes are indexes on composite or multi-value data types such as arrays, jsonb, etc. such that it maps each value to the rows. You can read more about GIN indexes and examples here.

Trigram is a method to split each string entry into substrings of size three. For example, the string ‘kamisato’ will be split into an array [‘kam’, ‘ami’, ‘mis’, ‘isa’, ‘sat’, ‘ato’]. Then, when you’re searching for pattern ‘misa’, it will also split into array [‘mis’, ‘isa’] and find all rows containing those two elements on their arrays. How to find the rows containing those two elements? GIN indexes.

Fortunately, PostgreSQL GIN indexes provide a trigram index option, gin_trgram_ops . So to get this powerful trigram index + GIN index combination, you can use GIN(column_name gin_trgm_ops) . GIN() tells postgre that we are using the GIN index, and gin_trgm_ops is the extra implementation detail on the GIN index. Let’s create a trigram index:

CREATE INDEX surat_tuntutan_nomor_perkara_pattern_idx ON surat_tuntutan USING GIN (nomor_perkara gin_trgm_ops);

Now some testing:

Seeded 100k data, where the query should return 26 matches

About 4x speed increase from using the index. Not great, but not bad also.

5. Indexing a Function Output

Here is where it gets exciting. It would help if you had a little bit of experience creating postgres stored procedures to follow along. Previously when we create indexes, we use the values as is. It turns out, postgres can not only index columns but also immutable functions of one or more columns. So, you can create an index like this:

CREATE INDEX ON surat_tuntutan_fun_index ON SURAT_TUNTUTAN(transform(nomor_perkara) text_pattern_ops)

transform is a custom function that removes all punctuations and makes all characters lowercase:

create or replace function transform(str text)
returns text
language sqlimmutableas $$select regexp_replace(lower(str), '[&\.,\s''-]', '','g');$$;

surat_tuntutan_fun_index basically creates an index where the keys are transform(nomor_perkara) with text_pattern_ops btree. To use this index on searching, you use the same function call:

SELECT * FROM SURAT_TUNTUTAN WHERE transform(nomor_perkara) LIKE 'Q.1.3%'
experimentation with 100k seeded data. Here i renamed transform() to be lowercase_remove_punctuation()

Without indexes, it took 307.5 ms. With indexes, it only took 0.878 ms. That's over 350x speed increase!

Using this method, you can index anything. Your function can also use two or more columns. Say you want to search for rows that depend on two columns, you can write a function f(column1, column2) that returns a value representing these two columns. Now the sky is the limit!

This method will come in handy when indexing our extraordinary data type discussed below.

6. JSONB Data type

Ever heard of MongoDB? MongoDB is a NoSQL database where each row is stored as a denormalized JSON. It is said MongoDB is said to better than postgres if:

  • the schema changes a lot. Since MongoDB stores as a JSON, it’s quite easy to change.
  • You have a one-off nested attribute that is not used anywhere else. You can define such attributes on the same table in mongo, while in Postgres, you need to create a new table and use relationships.

Sadly, MongoDB storing scheme comes with way more bad things. As shown in this amazing article, it turns out, having no relationship really reduces the use-cases of MongoDB to a small fraction.

In version 9.15, Postgres released the JSONB data type. It’s basically just a data type that stores JSON as binary form. So basically, it’s how MongoDB stores its JSON. With JSONB, you can use Postgres powerful tools such as indexes, data types, optimizations, relationships while having the power to have ‘nested’ attributes.

On my current group project app, we had such a use case. We have a field called ‘Daftar Terdakwa’ inside Surat Tuntutan, which stores a list of people and their information. JSONB is perfect for storing this because these people’s data is not used anywhere else. The simplified ‘schema’ looks like this:

TABLE surat_tuntutan {
id int
nomor_perkara int
status TEXT
daftar_terdakwa: [
{
name string
phone_number string
address string
}
]}

Now we define daftar_terdakwa to be type JSONB, and we can store its value to any valid JSON value we want. In this case, it’s a list of objects. Here’s the table definition:

TABLE surat_tuntutan {
id int
nomor_perkara int
status string
daftar_terdakwa JSONB
}

Here’s an example row:

id | nomor_perkara | status | daftar_terdakwa
121621 | 736/H.17.20/DD.2.5/2007 | Tutup | [{"NIK": "0075813708576826", "nama": "Gita Enzi", "agama": "Hindu", "pekerjaan": "Wiraswasta", "pendidikan": "SMA", "tempatLahir": "Makasar", "wargaNegara": "Indonesia", "jenisKelamin": "Perempuan", "tanggalLahir": "2002-08-21", "tempatTinggal": "Surabaya"}, {"NIK": "9844358813241006", "nama": "Hara Jayendra", "agama": "Kristen", "pekerjaan": "Pengusaha", "pendidikan": "S1", "tempatLahir": "Makasar", "wargaNegara": "Indonesia", "jenisKelamin": "Perempuan", "tanggalLahir": "2000-01-13", "tempatTinggal": "Pekanbaru"}]

You can see the daftar terdakwa is just a normal JSON. It’s a list of objects with size 2.

If you want to search for rows with “nama”: “Gita Enzi”, you can do this query:

SELECT * FROM surat_tuntutan WHERE daftar_terdakwa @> '[{"nama": "Gita Enzi"}]';

The @> is a contains operator. So the query means to find all rows where one of the list elements contains an object with {"nama": "Gita Enzi"} as one of the fields.

7. Basic JSONB index

Remember GIN indexes on section three? If you’re looking for exact match queries like from the last part of the previous section, pure GIN indexes are the perfect match since GIN indexes the values of the objects to the rows. Creating the index:

CREATE INDEX surat_tuntutan_daftar_terdakwa_nama ON surat_tuntutan USING GIN(daftar_terdakwa)

To query, you can use the @> query explained above.

Searching on 100k data. Only took 5ms

8. Indexing Pattern in JSONB

Bah! indexing exact matches are quite boring. What if we want to search for patterns? let's say we want to get rows that the JSONB contains {"name": "%ayaka%"}

We will use our knowledge from indexing a function output section. Let's create a function that returns a string of values for all “name” fields on the list:

The function basically iterates over every element on the list of objects and then appends the field's value with key ‘wanted_key’ to the array. For example:

select jsonb_values_of_key(jsonb $jsonb$[
{
"nama": "Galangkangin",
"npm": "1806696969",
"kelas": "A"
},
{
"nama": "Akbar",
"npm": "1806696970",
"kelas": "B"
},
{
"nama": "Krisna",
"npm": "1806696971",
"kelas": "C"
},
{
"nama": "HuTao",
"npm": "1806696972",
"kelas": "A"
},
{
"npm": "1806696973",
"kelas": "A"
}
]$jsonb$, 'nama')::text;

will return Galangkangin Akbar Krisna HuTao

Using this function, this problem now reduces to indexing a single string! We know this! we can use trigram indexes:

CREATE INDEX surat_tuntutan_nama_values_idx ON surat_tuntutans
USING GIN (jsonb_values_of_key(daftar_terdakwa, 'nama') gin_trgm_ops);

This is a normal index declaration if you have been following section 5. The jsonb_values_of_key(daftar_terdakwa, 'nama') returns a string, which is then indexed normally using a trigram index.

To query using this index, you can do queries such as:

SELECT * FROM surat_tuntutans WHERE jsonb_values_of_key(daftar_terdakwa, 'nama') LIKE '%ris%'

The left-hand side is the same as the index, and the right-hand side is your query string.

Experiments:

on 100k data

over 10x increase. However, this one is quite impressive since the time is relatively large compared to the previous queries.

Note that it doesn’t matter what your functions do, it just matters that it returns a string so it can be indexed. So you can do any weird string matching you want. Other function examples:

9. Why Does the Speed Increase Varies

You might be wondering why does the speed increase from the experiments vary? Sometimes it's 68x faster, sometimes it's 350x, and sometimes it’s barely even 2x faster. That’s because it depends on a lot of things, such as:

  • The size of the table
  • The number of returned rows
  • How your tables are ordered
  • etc

Some examples when indexes speed increase arent that good:

  • the number of returned rows is large, and there’s an ORDER BY clause. Then, both the index and the sequential scan may spend a lot of time sorting, so the speed increase may not be that much.
  • You do not have an ORDER BY clause, and there are a lot of matches. Postgres is smart because once it gets the required number of matches defined by LIMIT, it will immediately return the rows without checking the other rows. If you want a sequential scan to read all rows, you can use an ‘ORDER BY’ clause.
  • If the table size is small, postgres may choose not to use the index altogether because the analyzer may think it's faster to read all the rows.

Here are examples where index gives great speed-up from the sequential scan:

  • The number of returned rows is relatively small. Indexes can only scan these rows, while sequential scan must scan all rows.
  • There's a SELECT * ORDER BY column query where you indexed column . This is a huge speedup because not only sequential scan needs to read all columns, it must also sort them. While if indexed, postgres can streamline the rows.

However, I think it’s still best to index your columns if used for searching since your table will generally contain distinct enough values for indexes to work.

10. Closing

In my opinion, indexes are what separates real apps from toy apps. When creating indexes, you are also thinking of the application’s scaling. Therefore, you think that a lot of people will use your application. PostgreSQL has many different indexes to choose from. Knowing when to use what index is crucial.

The advantages of using indexes rather than a separate technology such as elastic search or meilisearch are that you have full control over how you want to search and maybe lessen your application's development and maintenance cost.

References

--

--