Squeezing Performance from SQLite: Indexes? Indexes!

Are your queries taking too long to run? In this episode of the series, I will explain how indexes work in SQLite, and how to design indexes that can take your most important queries to the next level of performance.

Luckily we don’t need to understand the Dewey Decimal System to understand how SQLite Indexes work!

Primary Keys, Query Planning, and Binary Search

When you define a table in SQLite, you have the option of denoting an INTEGER column as the AUTO_INCREMENT PRIMARY KEY, or in lieu of that: using the default rowid primary key column that is supplied for you. When you add new records to that table, they are assigned an automatically incremented, unique value for the primary key column and are stored in increasing order of that primary key in the database file.

The benefit of storing the records in an order sorted by the primary key is that when you want to locate one of the records with its id, SQLite can perform a binary search in O(log n) time, as opposed to a full O(n) scan of the table.

Comparison of Binary Search: O(log n) to Full Scan: O(n)

Imagine we’ve got a table that contains information about tags, defined with the following CREATE TABLE statement:

CREATE TABLE tags (
title TEXT,
description TEXT,
created TEXT
);

This table doesn’t explicitly specify a PRIMARY KEY column, so rowid is used instead.

When we run SELECT title FROM tags WHERE rowid = 6 SQLite does a binary search, first looking at the middle record with rowid = 9/2 = 4, realizes that’s too low, and tries rowid = (9-4)/2 + 4 = 2 + 4 =6 and bingo! That took 2 steps, and is a lot faster than iterating through each record until it finds the 6th record (which clearly would’ve taken 6 steps).

But wait, why couldn’t SQLite just immediately jump to the 6th record? Well, even though the rowid is auto-incrementing and unique, we could’ve deleted rows in the table and thereby created gaps where the 6th record might not be the record with rowid = 6.

The algorithms SQLite uses to determine how to execute a query are grouped up into a module called the “Query Planner”, and performing that determination is called “query planning”.

Indexes

So far, we’ve only been talking about primary keys. What if we don’t necessarily know the rowid for the tag we need, but instead wanted to search for the “Kotlin” tag:

SELECT * FROM tags WHERE title = "Kotlin";

As of the current state of our schema, SQLite will need to iterate one-by-one through the whole table until it finds the record where title = "Kotlin" holds true: an O(n) operation.

If you prefix the query from above with EXPLAIN QUERY PLAN you can verify that SQLite will need to perform a full scan of the table just to find the “Kotlin” tag:

EXPLAIN QUERY PLAN SELECT * FROM tags WHERE title = "Kotlin";

An “index” on a table, in SQL parlance, is an ordered data structure which can be used to find a record or its primary key in O(log n) time. When you create an index for a column or a set of columns, SQLite maintains an ordered list of the data within the index’s columns as well as their records’ primary key values.

Let’s add an index of the title column to the tags table:

CREATE INDEX tag_titles ON tags (title);
Representation of the tag_titles index

Now, when we run SELECT * FROM tags WHERE title = "Kotlin", SQLite can use the tag_titles index to perform a binary search on the title values to determine that the record we’re looking for has rowid = 5 and can then use that value in a second binary search against the actual tags table to return the full record.

One binary search to find the rowid for “Kotlin”, another to find the data.

In short: using the tag_titles index brings the query’s performance from O(n) to O(log n + log n) = O(log n) which is similar to if we were to query on the primary key.

You can prefix the query with EXPLAIN QUERY PLAN to verify that the index is used.

EXPLAIN QUERY PLAN SELECT * FROM tags WHERE title = "Kotlin";

“Covering Indexes”

In situations where you tend to do a lot of querying for subsets of your table’s data, you can utilize what are called “Covering Indexes” to eliminate that second binary search from the query planning operation.

Say our application needs to be able to display the description as well as the title of a particular tag when you search for it.

SELECT title, description FROM tags WHERE title = "Kotlin";

We saw that with the index described in the last section: one binary search is done to find the rowid of the record where the title is “Kotlin”, and a second is done to find the remainder of the data that was required. Let’s create a new index which includes the description as well as the title:

CREATE INDEX tag_title_desc ON tags (title, description);

Now when you run the query, SQLite’s query planner is smart enough to recognize that we only need data which is contained within the index and can return that data immediately after the index search without resorting to looking at the actual records in the tags table.

EXPLAIN QUERY PLAN 
SELECT title, description FROM tags WHERE title = "Kotlin";

The tag_title_desc is considered a “covering index” because it covers all the fields needed to satisfy the query.

One O(log n) operation is better than two, especially when the data set is huge.
- some smart person

Side Note: Although the tag_title_desc defines description as a field on the index, it won’t help you if you’re trying to query the tags table by description. This is because the index is sorted by title first, and description second, and there’s no good way to binary-search the descriptions using an index like that. If we need that functionality, we should consider adding a new index on the description field.

Using Indexes with ORDER BY

When you use ORDER BY without an index on the column to be sorted, SQLite needs to build up a temporary data structure that contains the sorted results each time the query is executed. That data structure will use up space in memory (or on disk, depending on the circumstances) and will be torn down after the query is executed. This can be a lot of extra work that might noticeably degrade performance.

However, in addition to helping with quickly finding records: indexes are used by the SQLite query planner to speed up sorting when the ORDER BY clause references indexed columns. For example:

SELECT * FROM tags ORDER BY title ASC;

Here, SQLite doesn’t need to build up the temporary data structure and can simply iterate through the tag_titles index. When fetching the data for the result set it will perform binary searches to retrieve each tags record.

Both sorting with a non-covering index and without are O(n log n) operations, but eliminating the need to generate the temporary data structure cuts out a lot of work.

When a covering index can be used, the sort is even faster because SQLite can omit those binary searches all-together and the query becomes O(n) because it only needs to iterate through the index and return the covered columns from the index.

When should I consider creating a new index?

Because indexes are sorted data structures and their benefit comes from how binary search works, it’s important to ensure that your indexed columns have what is called “high cardinality”. All this means is that the indexed data has a lot of uniqueness.

Why is this important? Well, without a lot of uniqueness to your data, the binary search can end up becoming an O(n) linear scan anyway.

The Costs

Additionally, it’s important to remember two important costs to creating and maintaining indexes on your tables:

  1. Size
    Indexes need to be stored in the database file, and their structure is similar to tables. You should remember that adding an index on a column or a set of columns effectively doubles the storage used to keep the data those columns represent.
  2. Insert/Update/Delete Performance
    When you insert, update, or delete records in your indexed tables, it’s important that the indexes are kept up to date as well as in the correct order. Each of these operations will take slightly longer when there are indexes on the table than when there are no indexes. If you’re going to be doing a lot of updates, consider dropping your indexes before the operation and re-creating them afterwards.

Conclusion

Indexes are a powerful tool at your disposal to boost performance of your application’s data retrieval. They help with querying and sorting the data, and when covering indexes are used correctly can make lookups incredibly fast. Covering indexes take the whole idea to the next level.

Indexes also come with some overhead: they will use additional space, and can increase the time it takes to insert or update records because SQLite needs to keep the indexes up to date. It’s important to weigh the pros and cons before you add an index to your table.

This post touches on the basics of indexes in SQLite, but you can read a whole lot more about how SQLite’s query planner works and its use of indexes in the official documentation. Additionally, for the sake of making this post more approachable, I purposefully glossed over how the indexes are built and maintained. You can read more about how SQLite (and other databases) maintain indexes by learning about B-trees.

Finally, a lot of what we discussed in this post can be applied to how indexes work in other database systems.

If you enjoyed this post, please tap that heart button to recommend it to your followers. Also, you should check out the other posts in my series: “Squeezing Performance from SQLite”: