SQL Indexing for Devs — Part 2: Practical Indexing Advice
Part 1 of this post dug into execution plans to explain how indexes can make SQL queries run more efficiently. In this part, I’ll stick to extremely simplified practical guidelines: which columns to index & how to write SQL queries to take advantage of those indexes. This guide is purposefully not exhaustive: SQL indexing is both a science & an art with many detailed resources dedicated to it. Instead, I hope to provide some generally applicable advice that, at the very least, can point to productive indexing strategies.
Your mileage on generic advice about SQL indexing will vary significantly depending on which database engine you use, the distribution of your data (# of rows, # of unique values, presence or absence of NULL
s, etc.), and the types of SQL statements normally executed against that data. Database engines usually let you inspect execution plans (see Microsoft’s SQL Server Management Studio’s option to Include Actual Execution Plan or PostgreSQL’s ANALYZE
command). Use the output from those tools to confirm that your index tuning & query optimization efforts are headed in productive directions.
Finally, as explained below, indexes come with a maintenance cost. This post focuses on the benefits of indexing. That does NOT mean that you should index every last column you’re using in your queries.
It’s impossible to overemphasize this: there is no one-size-fits-all to SQL indexing. Use these guidelines as rules of thumb.
TL;DR
Indexing basics:
- Don’t design indexes in a vacuum
- Don’t index columns that are frequently updated
- Indexes can include multiple columns & the order of columns in your index definition matters
- If your query doesn’t need a column, get it out of your query
Focus on columns used in WHERE
clauses:
- Start with columns used in
WHERE
equality comparisons - Then consider the columns used in
WHERE
range comparisons - Don’t forget about SARGability
Then, move on to the other parts of your queries:
- Review columns used in
JOIN
clauses - Review columns that appear in
GROUP BY
&ORDER BY
clauses - Finally, an index that covers your
SELECT
columns can keep your query from ever accessing the underlying table
How to create an index
The SQL syntax for creating an index looks something like this:
CREATE INDEX
ix_index_name
ON
table_name (column1 [, column2, ...]);
An index is defined on one or more columns from a single table. For reasons we’ll cover below, the order of the columns in the index definition matters (a lot).
Advice
#0a: Don’t design indexes in a vacuum
Indexes are designed to optimize the specific ways your application accesses specific records in your database. That means that they are closely tied to the queries your application executes against a database. For example, if you built your application to always access a users
table by filtering on a first_name
column, indexing on first_name
makes sense. But if you change those queries to filter that table using user_id
instead, it’s up to you to update those indexes so they continue to match your application’s database access patterns.
#0b: Don’t index columns that are frequently updated
As explained in Part 1, indexes aren’t free. They are their own data structure separate from the table they reference. Anytime the data in a column referenced by an index is updated in the database, the index needs to be updated as well. Write-heavy columns are therefore not the greatest candidates for indexing.
#0c: Indexes can include multiple columns & the order of columns in your index definition matters
While you can define multiple indexes on the same table, when you query that table, the query optimizer will only choose one of those indexes for that particular query — it won’t run that query by somehow combining separate indexes that cover different columns. This means that if you’re filtering a query with multiple columns in a WHERE
clause, you should consider defining a single index that covers all of those columns.
The order that those columns appear in your index definition has a significant impact on which queries can use it. In general, a query can only benefit from an index if the columns in the query’s WHERE
clause appear in the leading columns of the index definition. If any of the leading columns in the index definition aren’t used by the query, the index won’t be very helpful. This is because the keys of the index are stored in the same order that the columns are defined in the index’s definition, so the second column in the index is only useful if we’ve filtered by the first one, and the third is only useful if we’ve filtered by both the second and the first one, etc. [3, 4]. See the appendix down below for an example.
In general, it’s better to have one slightly more ‘generic’ index that multiple queries can take advantage of than multiple indexes that are highly specialized for specific queries. If you can tweak the column order or add a column to an existing index to have it cover multiple queries, strongly consider doing that rather than incurring the overhead of adding an additional index. Don’t forget that column order matters: adding a rightmost column to an index definition may not have a detrimental impact on queries that are already using that index while potentially allowing others to benefit from it. And finally, don’t go wild with your indexing (see #0b). You can absolutely have too many indexes on a table & too many columns in an index.
An example
The classic example for thinking about the importance of column order in an index is to consider an alphabetized phone book. We can think of a phone book as an index where the “keys” are last names followed by first names and the “values” are phone numbers. Phone books are dope if you need to find a phone number for a specific last name and first name. They’re also pretty useful if you’re not sure about the first name but know the last name. In each of these “queries”, we’re relying on the leading columns of our phone book index. However, trying to look up a phone number by first name alone is… painful. The phone book is sorted by last name, so you have to painstakingly look through each record to find the numbers with the first name you’re looking for. Each record in the book might as well be randomized! Database indexes work the same way.
#0d: If your query doesn’t need a column, get it out of your query
This one has nothing to do with indexes and may seem obvious. If you’re SELECT
ing a bunch of columns but only using one or some of them, you’re making the database engine do more work than it needs to by retrieving those columns & passing that data from the database, across the network, to your application.
Where can this come up?
- Any
SELECT
queries with a*
. Using the*
in aSELECT
can be fine when you’re running ad hoc queries to explore the contents of a table, but it shouldn’t be in your production code. - Object Relational Mapping (ORM) frameworks. While ORMs hide the SQL your application uses to interact with databases, they’re still generating SQL statements that go through the same execution process we reviewed in Part 1. It’s worth spending some time tracing the SQL queries these helpful frameworks are executing to ensure they’re not returning all the columns in a table when you may only need one or two.
#1: Focus on columns used in WHERE
clauses
#1a: Start with columns used in WHERE
equality comparisons
Indexes shine brightest when they help the database engine find the exact rows it’s looking for without scanning every row in a table (this is why they’re closely tied to the design of specific queries). If you’re using =
to compare a column to a value in a WHERE
clause, it’s a great candidate for indexing:
...
WHERE
column1 = 'The greatest ever'
...
This is a common pattern with primary keys. Some database engines create indexes on the columns used in primary keys by default.
Before blindly indexing a column just because it’s used in a WHERE
clause, you should consider its selectivity— the fraction of rows it can identify based on the uniqueness of its values out of the total number of rows in a table [1]. That is, if you have an index on a primary key, you’re guaranteed to have a highly selective index. Every value in it will uniquely identify a row in a table. On the other hand, BIT
columns can only have two values: 0
or 1
. An index here will contain many rows with the same two values, so it won’t be as helpful when trying to narrow down a result set. (Consider using filtered indexes in this case!)
Prioritize indexing the highly selective columns that appear in equality comparisons in WHERE
clauses [2] [7].
# 1b: Then consider the columns used in WHERE
range comparisons
In addition to equality comparisons, WHERE
statements can also be used with operators like >
, <
, & BETWEEN
. These operators let us compare a value to range of values. After prioritizing the columns used in equality comparisons, consider indexing the columns used in range comparisons.
# 1c: Don’t forget about SARGability
The intimidating word SARGable means “Search ARGument-able.” When a comparison in a WHERE
clause can take advantage of an index, it’s considered SARGable [5]. The implication there is that you can inadvertently write a WHERE
clause that can’t use an index. To maintain SARGability & therefore your query’s ability to benefit from an index:
- Don’t wrap columns in your
WHERE
comparisons in a function. - Only use the wildcard
%
operator at the end of a string. [6]
If you’re stuck with a query that calls a function in a WHERE
clause, try to see if you can call that function on a constant value instead of a column. Other workarounds include using computed columns or calling those functions in the index definition. This stack overflow post has great examples of the Dos & Don’ts of maintaining SARGability.
#2: Then consider the columns used in JOIN
clauses
As we saw in Part 1, JOIN
clauses are one of the first parts of a query logically processed by the query optimizer. When writing most JOIN
s, we need to specify which columns to JOIN
ON
. The columns that appear in these JOIN
clauses are great candidates for indexing.
Oftentimes, they’ll be primary keys, so they may get indexed by default depending on your database engines.
Foreign keys also frequently appear in JOIN
s & so should be considered for indexing.
#3: After that, review columns that appear in GROUP BY
& ORDER BY
clauses
Up until now, we’ve been using indexes to find specific rows of a table or limit result sets based on specific values, the exact sort of things that FROM
, JOIN
, and WHERE
clauses do. But the usefulness of indexes doesn’t stop at finding & filtering. They also help with sorting & grouping. We’ve talked about how indexes are a sorted representation of specific columns in a table. If the columns in an index appear in an ORDER BY
or GROUP BY
clause, the database can use that index to figure out what order to return our results in [8].
#4: Finally, an index that covers your SELECT
columns can keep your query from ever accessing the underlying table
If every column referenced in your query, including the ones in your SELECT
statement, are in your index, you have what’s called a covering index. Without a covering index, the database may use an index to figure out which rows of a table it needs to read data from, but it still needs to somehow access those table rows to retrieve that data. Covering indexes are cool because they let your database do all its work on the index — it doesn’t have to take that relatively expensive step of reading data from the underlying table since it can get all the data it needs from the index. Instead of just using the index to find the relevant rows, the database can use the index to both find the rows & get the column data. Pretty sweet if the maintenance costs of a larger index aren’t too bad.
There are two places these columns should go in an index if you see a benefit to including them:
- the rightmost columns in your index definitions, i.e. after columns used to filter, join, sort, or group your results.
- in the
INCLUDE
clause of your index definition if your database engine supports it. [5]
#5: Other Considerations: UNIQUE
and filtered indexes
An index can be defined as UNIQUE
which not only creates a database-enforced uniqueness constraint on a column, but also allows the query optimizer to optimize queries that use the UNIQUE
column(s) included in the index. [9]
Filtered indexes can also be created by defining the index with a WHERE
clause, in which case the index will only be used if a query has a matching WHERE
condition. These can be great for columns that have a small number of unique values, only some of which are ever commonly queried [10].
References
[1] https://blog.toadworld.com/2018/09/05/unselective-indexes-selectivity
[2] https://stackoverflow.com/questions/2292662/how-important-is-the-order-of-columns-in-indexes
[3] https://sqlhints.com/tag/is-order-of-column-order-in-index-matters/
[4] https://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys
[6] https://stackoverflow.com/questions/799584/what-makes-a-sql-statement-sargable
[8] https://dba.stackexchange.com/questions/11031/order-by-column-should-have-index-or-not
[9] https://stackoverflow.com/questions/20515501/unique-vs-non-unique-index