The concepts behind the built-in indexes of PostgreSQL

There may be an index better than the default for your database.

Rain Wu
Random Life Journal
4 min readJun 3, 2021

--

As the amount of data in the database increases, the performance of the query may start to be unsatisfactory. However, we definitely can’t let the service continue to slow down because of this bottleneck, a suitable solution is necessary.

Photo by Denny Müller on Unsplash

Using table index is one of the most common solutions to slow reading, it could be built byCREATE INDEX or similar commands in ORM. In most cases, if the column we used to build is the same as we used for search, the performance will be significantly improved.

But, did I use index properly in this way?

Although all of the problems of slow reading speed seem to be similar, the content of the data and the business scenario may be very different, B-tree is not the master key for all of them.

This article will simply talk about the difference between some common indexes and their respective suitable scenarios.

The Concepts of Database Index

In modern relational databases, indexes act as a faster way to search while searching by a particular column.

For example, a binary search tree for the license plate number in the user profile. While locating the user who owns the license plate with the id XD-2871, using a binary search tree will definitely faster than linear search.

However, while improving the query speed, there will also be additional space occupation and additional maintenance costs.

  • In addition to the essential data table, the binary search tree also needs to be stored.
  • While create or update a record, the essential data table and binary search tree were both needed to be modified.

Base on the concept above, each kind of RDB provides a variety of indexes to meet various search optimization needs.

Commonly Used Indexes in PostgreSQL

At the time of writing this article, the latest version of PostgreSQL is 13, and the following content is based on this version. There are 5 indexes were arranged on the official document currently: B-Tree, GIST, SP-GIST, GIN, and Brin.

B-Tree

The B-Tree index is the default option of index creation, it would be used if no options specified. Any data type that can be sorted into a well-defined linear order can be indexed by a B-Tree index.

It works well in most linear-based searches, and rarely encounters extreme cases of poor performance. An obvious limitation of this index is data type must be sortable, and it is also somewhat uncertain for non-built-in types.

Usage scenarios:

  • identity card number, UUID, license plate id

GIST

GIST stands for Generalized Search Tree, it is also a tree structure model but more like a template for the custom data type. With the appropriate access methods, it can make up for the former’s type restriction.

A lot of extensions of the PostgreSQL index are base on it because of the considerable flexibility for data with special types.

Usage scenarios:

  • full-text search, vector, distance

Sp-GIST

Although GIST already provides a flexible interface for implementation, it is still not enough due to the balanced structure of R-Tree.

Some unbalanced structures like quad-tree and suffix-tree can repeatedly divide the search space into partitions of different sizes, this could be very helpful since we can locate a record step by step with more meaningful and efficient branches.

Usage scenarios:

  • phone number, map grid, IP address

GIN

GIN stands for Generalized Inverted Index, the inverted relationship makes it perform excellent while fetching all records that contain a particular element. Most of the time the columns applied by this index are composite values, such as arrays and text.

Usage scenarios:

  • keyword searching, search by tag

Brin

Brin is the abbreviation of the Block Range index, which mainly depends on the statistics information for each block of records. It can provide better performance while the data has a high correlation with physical storage structure.

Usage scenarios:

  • time-series data, ZIP code, geometry bounding box

Wrap Up

When I am thinking about which index to choose, it will be very helpful to get clues through the statistical information of the target tables, such as correlation:

postgres=# select correlation from pg_stats where tablename='house' and attname='zip_code'

In addition, a sandbox experiment was highly recommended, it can give us more reliable evidence since the data inside most of the productional database are not as clean as the sample data.

Hope this article helps you further understand PostgreSQL indexes :)

--

--

Rain Wu
Random Life Journal

A software engineer specializing in distributed systems and cloud services, desire to realize various imaginations of future life through technology.