Introduction To Database Indexing

6 min readOct 20, 2023

What is index ?

A book index is a user-friendly reference tool typically found at the end, listing keywords or terms in alphabetical order along with corresponding page numbers where those terms are discussed in the book. It serves as a valuable navigational aid, helping readers quickly locate specific information, making it essential for researchers, students, and anyone seeking to access relevant content efficiently. In the digital age, book indexes are also available in electronic formats for e-books, maintaining their functionality in aiding readers in search and navigation within digital texts.

First, Let’s ask How Data is stored in database?

Data is typically stored on storage devices in fixed-size units called pages or blocks. These units can range in size from a few kilobytes to several megabytes (PostgreSQL default is 8k), depending on the storage system and configuration. The DBMS reads and writes data in these units.

If the database engine lacks any extra metadata regarding the data.. It will need to do full scan for every query. Imagine having millions or billions of records. It can take seconds.

Why Index?

Index act as quick reference guide for a database. it helps the database Engine to find specific data in a table more rapidly, similar to an index in a book that helps you quickly locate information on a particular topic or keyword. Indexes are essential for improving query performance in a database.

We have two Main types of index:

Clustered Index (Primary key):

Determines the physical order of data in a table.
Each table can have only one clustered index.
When you create a clustered index on a table, the data rows are rearranged in the order of the indexed column(s).
records are stored in the leaf nodes.

Non Clustered Index (or Secondary Index):

Doesn’t change the physical order of data in a table.
Table can have multiple non-clustered indexes
Stored in a separate data structure that includes a copy of the indexed column(s) along with a pointer to the its address in the storage

So in general indexes make read queries faster

But if indexes make queries faster why don’t we create index for every column in the table ?

to answer this we need know how indexes works under the hood.

Database indexes are akin to data structures stored on disk. These structures can take various forms, including B+ Tree indexes, Hash indexes, Bitmap indexes, and more
Every time a record is updated, and a change is made to the value in a column that is part of an index, the database engine swings into action. It’s compelled to access the index pages, modify them to reflect the updated data, and then write these modified index pages back to the storage. This process is essential to keep the indexes in sync with the underlying data, ensuring that queries continue to perform efficiently even as the dataset evolves.
So while the indexes enhance the efficiency of data retrieval operations. They provide swift access to the desired information.they comes at a cost when it comes to write operations (inserting, updating, or deleting data).

To visualize this, let’s see how clustered and non-clustered B+ tree indexes are stored in the database:

if you don’t know about B+ tree you can check this link
https://www.programiz.com/dsa/b-tree#:~:text=B%2Dtree%20is%20a%20special,%2Dbalanced%20m%2Dway%20tree
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Clustered Index:

Clustered index arrange the physical location of records in order, these ordered data rows are stored in the leaf nodes of the index

The following figure shows how a clustered B+ index on the id column. The database builds a B+ tree using the id values of the records as keys. The records themselves are stored in the leaf nodes of the B+ tree.

Fig1 showing very simple B+ trees as you expect the leaf nodes are stored on storage in unit called Page , But not only the records are stored but also the other nodes and in reality of course the amount of data will be significantly larger than what we have in this simplified diagram. when we add new record with id of 16 the B+ tree, structure will change to be like this:

Figure 2: Clustered index on Id after adding 16

The database engine performs a series of operations when adding a new record to the B+ tree. It begins by retrieving the relevant index pages, then proceeds to restructure the B+ tree to accommodate the new record. Once the restructuring is complete, the engine saves the updated pages and writes them back to the storage, ensuring that the B+ tree remains in its optimized state.

Non clustered Index:

Non Clustered index doesn’t physically order data rows according to the index value. the leaf nodes contains the record id or pointer to page or block in the storage (depending on the database engine).
The following figure shows how a non-clustered B+ index on the email column. The database builds a B+ tree using the email values of the records as keys.

Figure 3 shows a B+ tree with email as the key, which is very similar to Figures 1 and 2, but with one important difference: the records are not stored in the leaf nodes. Instead, the leaf nodes only contain the ID or location of the record in storage, depending on the database. Most databases use the ID as the value, but PostgreSQL uses the location of the record on storage. When we add a new record with email q@e.com and ID 16 to the B+ tree, the structure will change to be like this:

Figure 4: Non-clustered index on email after adding q@email.com

So as you saw every time we insert or update or delete, The database engine performs a series of operations when adding a new record to the B+ tree. It begins by retrieving the relevant index pages, then proceeds to restructure the B+ tree to accommodate the new record. Once the restructuring is complete, the engine saves the updated pages and writes them back to the storage, ensuring that the B+ tree remains in its optimized state

That is why it is not a good idea to add index to every column.

but it make perfect sense to create index on the columns that are most used in read queries.

Are all indexes stored in B+ tree?

No, there are multiple types of data structure that are used to build indexes
the common indexes types are :

B+ Tree Indexes:
. B+ Tree (Balanced Tree) is one of the most common index structures.
. It’s a self-balancing tree structure that allows for efficient data retrieval, insertion, and deletion
. n a B+ Tree index, each node contains multiple keys and pointers to child nodes, with the properties that make it balanced.
. The root node is at the top, and leaf nodes at the bottom contain actual data pointers.
. Well-suited for range queries
Hash Indexes:
. Hash indexes use a hash function to map keys to locations in the index.
. They are very fast for exact-match queries but are less efficient for range queries.
. Hash indexes are typically used in memory-based databases or as a secondary index in some databases.
Bitmap Indexes:
. Typically used for low-cardinality columns, which are columns with a relatively small number of distinct values.
. Excellent for read-heavy workloads and complex queries, they can be less efficient for write operations.
Sparse Indexes:
. Stores information about only a subset of the rows in a table, rather than every row. They are designed to optimize query performance for specific use cases where indexing all rows is not necessary.
. save space compared to traditional indexes because they don’t need to store index entries for every row in the table.

There are many other indexes types like: LSM Tree, GiST, Gin, R-Tree, and SSTable. but I am still reading about them.

Finally:

Understanding database indexing is vital when working on the fundamental aspects of system design. Many improvements in your applications depend on this detailed knowledge. Selecting the right index can greatly enhance your application’s performance.

References:

https://blog.bytebytego.com/p/database-indexing-strategies
https://www.youtube.com/watch?v=-qNSXK7s7_w
https://www.youtube.com/watch?v=7g6PG71puX4&t=229s