Things every developer absolutely, positively needs to know about database indexing — Kai Sassnowski

Jolly Fish
Technically True🚀
4 min readJul 15, 2020

This article is based on the talk by Kai Sassnowski at LARCON (2018)

Things every developer absolutely, positively needs to know about database indexing — Kai Sassnowski

Indexing Databases

Indexing improves query performance by ordering the indexed field of a database (not the entire database) such that retrieval based on such fields is highly optimized. Ordering is required to apply efficient algorithms, such as Binary Search, for searching the database. In database indexing, the indexed fields are ordered as B-Trees (Balanced Trees) which guarantees that all queries (based on indexed fields) take the same number of steps

The leaf nodes of the B-Tree form a doubly linked list which links adjacent leaf nodes bidirectionally

The fields that are indexed are in-memory for fast access. Non-indexed fields are stored on disk. The indexed fields are stored in-memory along with an internal Row-ID which links the indexed fields to non-indexed data on disk

Why don’t we use Index on all fields?

As a rule of thumb, you want to have as many indices on the database as necessary, however, as few as possible

Although using more indices improves read speeds, however, it brings down write speeds as the B-Trees are to be re-balanced after every insertion

Viewing the Access Mechanism

To view how the database is going to access the data and use the indices, we can run the query prefixed with a EXPLAIN keyword. The access-method might be ALL(entire DB being scanned), INDEX(all indices in-memory being scanned one by one, no algorithm used), RANGE(ranged scanning done to the indices), CONST(algorithmic lookup within the indices). Further, the estimated number of rows to be scanned for the query is also shown

Scenario

Assume a table employees with fields date, group, & salary. The goal is to calculate the sum of entries for which year is 2013. One simple query to perform this would be:

SELECT SUM(salary) FROM employees WHERE YEAR(date) = 2013;

The above runs quite slow (with ALL access type) as no indexing has been done so far. If we do add indexing to the date field and re-run the same query, the performance does not improve. The reason for this is Pitfall #1: function usage

Pitfall #1: Function Usage

When a function is applied to the indexed field, the indexing on the field is ignored. This is because the order of the output of the indexed field (after applying the function) might not be related to the original order. Hence, indexing does not work in such a case.

Thus, in our case, we can substitute the function call YEAR by

SELECT SUM(salary) FROM employees WHERE 
date BETWEEN '2013-01-01 00:00:00' AND '2013-12-31 23:59:59';

Although the above query introduces an index-able field (RANGE type access), however, the performance becomes worse (if the index is forced, otherwise MySQL automatically default to ALL type access). The cause of this is the data that is being stored on disk. The salary field is now stored on disk and not in-memory (as it would have been in the ALL type access). For every valid record, the DB has to access the salary field from the disk which degrades performances significantly.

Thus, we can put the salary field on the index as well, such that now our indices are: date > salary. This greatly improves performance.

Scenario (Continued)

If we have to return the salary sum for a specific group, say group = 14, we can alter the query to

SELECT SUM(salary) FROM employees WHERE 
date BETWEEN '2013-01-01 00:00:00' AND '2013-12-31 23:59:59'
AND group = 14;

This results in an ALL access type once again as group is not indexed and must be fetched from disk. If we add group to the index, we will have the following ordering of indices date > salary > group. This, however, does not reduce the search space as much as expected because of Pitfall #2: unordered multi column indices

Pitfall #2: Multi-Column Indices

When multiple fields are added to the index, the indexing is done left to right. Thus, the first column is ordered, then the next column is ordered based on the previous, i.e. to decide order of duplicates in the first column, the second column is used, and so on. Thus, in our case, the ordering is done by date first, then by salary and finally by group. Hence, a filter using the date and group will be limited to the ordering of date as group is unordered with respect to date

Thus, to address this, we have to index in the order: date > group > salary

Although the reasoning behind this is sound, the performance remains the same. The reason being Pitfall #3: In-equality operator use

Pitfall #3: In-equality operator use

Although the indexing can be used from left to right, however, as soon as there is an inequality operator (such as BETWEEN) on any of those columns, the index stops at and beyond that point. Therefore, the fixed version would involve changing index order to group > date > salary

However, this reduces the performance of the query in the first part of the scenario (INDEX type access) as the date field is the second field in the order of indices. Thus, the choice of indices and their order is highly influenced by the use case

The combination of all these steps results in a highly optimized performance.

--

--