SQL Indexing for Devs — Part 2: Practical Indexing Advice

Published in

Launch School

9 min readFeb 4, 2020

Part 1 of this post dug into execution plans to explain how indexes can make SQL queries run more efficiently. In this part, I’ll stick to extremely simplified practical guidelines: which columns to index & how to write SQL queries to take advantage of those indexes. This guide is purposefully not exhaustive: SQL indexing is both a science & an art with many detailed resources dedicated to it. Instead, I hope to provide some generally applicable advice that, at the very least, can point to productive indexing strategies.

Your mileage on generic advice about SQL indexing will vary significantly depending on which database engine you use, the distribution of your data (# of rows, # of unique values, presence or absence of NULLs, etc.), and the types of SQL statements normally executed against that data. Database engines usually let you inspect execution plans (see Microsoft’s SQL Server Management Studio’s option to Include Actual Execution Plan or PostgreSQL’s ANALYZE command). Use the output from those tools to confirm that your index tuning & query optimization efforts are headed in productive directions.

Finally, as explained below, indexes come with a maintenance cost. This post focuses on the benefits of indexing. That does NOT mean that you should index every last column you’re using in your queries.

It’s impossible to overemphasize this: there is no one-size-fits-all to SQL indexing. Use these guidelines as rules of thumb.

TL;DR

Indexing basics:

Don’t design indexes in a vacuum
Don’t index columns that are frequently updated
Indexes can include multiple columns & the order of columns in your index definition matters
If your query doesn’t need a column, get it out of your query

Focus on columns used in WHERE clauses:

Start with columns used in WHERE equality comparisons
Then consider the columns used in WHERE range comparisons
Don’t forget about SARGability

Then, move on to the other parts of your queries:

Review columns used in JOIN clauses
Review columns that appear in GROUP BY & ORDER BY clauses
Finally, an index that covers your SELECT columns can keep your query from ever accessing the underlying table

How to create an index

The SQL syntax for creating an index looks something like this:

CREATE INDEX
  ix_index_name
ON
  table_name (column1 [, column2, ...]);

An index is defined on one or more columns from a single table. For reasons we’ll cover below, the order of the columns in the index definition matters (a lot).

Advice

#0a: Don’t design indexes in a vacuum

Indexes are designed to optimize the specific ways your application accesses specific records in your database. That means that they are closely tied to the queries your application executes against a database. For example, if you built your application to always access a users table by filtering on a first_name column, indexing on first_name makes sense. But if you change those queries to filter that table using user_id instead, it’s up to you to update those indexes so they continue to match your application’s database access patterns.

#0b: Don’t index columns that are frequently updated

As explained in Part 1, indexes aren’t free. They are their own data structure separate from the table they reference. Anytime the data in a column referenced by an index is updated in the database, the index needs to be updated as well. Write-heavy columns are therefore not the greatest candidates for indexing.

#0c: Indexes can include multiple columns & the order of columns in your index definition matters

While you can define multiple indexes on the same table, when you query that table, the query optimizer will only choose one of those indexes for that particular query — it won’t run that query by somehow combining separate indexes that cover different columns. This means that if you’re filtering a query with multiple columns in a WHERE clause, you should consider defining a single index that covers all of those columns.

The order that those columns appear in your index definition has a significant impact on which queries can use it. In general, a query can only benefit from an index if the columns in the query’s WHERE clause appear in the leading columns of the index definition. If any of the leading columns in the index definition aren’t used by the query, the index won’t be very helpful. This is because the keys of the index are stored in the same order that the columns are defined in the index’s definition, so the second column in the index is only useful if we’ve filtered by the first one, and the third is only useful if we’ve filtered by both the second and the first one, etc. [3, 4]. See the appendix down below for an example.

In general, it’s better to have one slightly more ‘generic’ index that multiple queries can take advantage of than multiple indexes that are highly specialized for specific queries. If you can tweak the column order or add a column to an existing index to have it cover multiple queries, strongly consider doing that rather than incurring the overhead of adding an additional index. Don’t forget that column order matters: adding a rightmost column to an index definition may not have a detrimental impact on queries that are already using that index while potentially allowing others to benefit from it. And finally, don’t go wild with your indexing (see #0b). You can absolutely have too many indexes on a table & too many columns in an index.

An example

The classic example for thinking about the importance of column order in an index is to consider an alphabetized phone book. We can think of a phone book as an index where the “keys” are last names followed by first names and the “values” are phone numbers. Phone books are dope if you need to find a phone number for a specific last name and first name. They’re also pretty useful if you’re not sure about the first name but know the last name. In each of these “queries”, we’re relying on the leading columns of our phone book index. However, trying to look up a phone number by first name alone is… painful. The phone book is sorted by last name, so you have to painstakingly look through each record to find the numbers with the first name you’re looking for. Each record in the book might as well be randomized! Database indexes work the same way.

#0d: If your query doesn’t need a column, get it out of your query

This one has nothing to do with indexes and may seem obvious. If you’re SELECTing a bunch of columns but only using one or some of them, you’re making the database engine do more work than it needs to by retrieving those columns & passing that data from the database, across the network, to your application.

Where can this come up?

Any SELECT queries with a *. Using the * in a SELECT can be fine when you’re running ad hoc queries to explore the contents of a table, but it shouldn’t be in your production code.
Object Relational Mapping (ORM) frameworks. While ORMs hide the SQL your application uses to interact with databases, they’re still generating SQL statements that go through the same execution process we reviewed in Part 1. It’s worth spending some time tracing the SQL queries these helpful frameworks are executing to ensure they’re not returning all the columns in a table when you may only need one or two.

#1: Focus on columns used in `WHERE` clauses

#1a: Start with columns used in `WHERE` equality comparisons

Indexes shine brightest when they help the database engine find the exact rows it’s looking for without scanning every row in a table (this is why they’re closely tied to the design of specific queries). If you’re using = to compare a column to a value in a WHERE clause, it’s a great candidate for indexing:

...
WHERE
  column1 = 'The greatest ever'
...

This is a common pattern with primary keys. Some database engines create indexes on the columns used in primary keys by default.

Before blindly indexing a column just because it’s used in a WHERE clause, you should consider its selectivity— the fraction of rows it can identify based on the uniqueness of its values out of the total number of rows in a table [1]. That is, if you have an index on a primary key, you’re guaranteed to have a highly selective index. Every value in it will uniquely identify a row in a table. On the other hand, BIT columns can only have two values: 0 or 1. An index here will contain many rows with the same two values, so it won’t be as helpful when trying to narrow down a result set. (Consider using filtered indexes in this case!)

Prioritize indexing the highly selective columns that appear in equality comparisons in WHERE clauses [2] [7].

# 1b: Then consider the columns used in `WHERE` range comparisons

In addition to equality comparisons, WHERE statements can also be used with operators like >, <, & BETWEEN. These operators let us compare a value to range of values. After prioritizing the columns used in equality comparisons, consider indexing the columns used in range comparisons.

# 1c: Don’t forget about SARGability

The intimidating word SARGable means “Search ARGument-able.” When a comparison in a WHERE clause can take advantage of an index, it’s considered SARGable [5]. The implication there is that you can inadvertently write a WHERE clause that can’t use an index. To maintain SARGability & therefore your query’s ability to benefit from an index:

Don’t wrap columns in your WHERE comparisons in a function.
Only use the wildcard % operator at the end of a string. [6]

If you’re stuck with a query that calls a function in a WHERE clause, try to see if you can call that function on a constant value instead of a column. Other workarounds include using computed columns or calling those functions in the index definition. This stack overflow post has great examples of the Dos & Don’ts of maintaining SARGability.

#2: Then consider the columns used in `JOIN` clauses

As we saw in Part 1, JOIN clauses are one of the first parts of a query logically processed by the query optimizer. When writing most JOINs, we need to specify which columns to JOIN ON. The columns that appear in these JOIN clauses are great candidates for indexing.

Oftentimes, they’ll be primary keys, so they may get indexed by default depending on your database engines.

Foreign keys also frequently appear in JOINs & so should be considered for indexing.

#3: After that, review columns that appear in `GROUP BY` & `ORDER BY` clauses

Up until now, we’ve been using indexes to find specific rows of a table or limit result sets based on specific values, the exact sort of things that FROM, JOIN, and WHERE clauses do. But the usefulness of indexes doesn’t stop at finding & filtering. They also help with sorting & grouping. We’ve talked about how indexes are a sorted representation of specific columns in a table. If the columns in an index appear in an ORDER BY or GROUP BY clause, the database can use that index to figure out what order to return our results in [8].

#4: Finally, an index that covers your `SELECT` columns can keep your query from ever accessing the underlying table

If every column referenced in your query, including the ones in your SELECT statement, are in your index, you have what’s called a covering index. Without a covering index, the database may use an index to figure out which rows of a table it needs to read data from, but it still needs to somehow access those table rows to retrieve that data. Covering indexes are cool because they let your database do all its work on the index — it doesn’t have to take that relatively expensive step of reading data from the underlying table since it can get all the data it needs from the index. Instead of just using the index to find the relevant rows, the database can use the index to both find the rows & get the column data. Pretty sweet if the maintenance costs of a larger index aren’t too bad.

There are two places these columns should go in an index if you see a benefit to including them:

the rightmost columns in your index definitions, i.e. after columns used to filter, join, sort, or group your results.
in the INCLUDE clause of your index definition if your database engine supports it. [5]

#5: Other Considerations: `UNIQUE` and filtered indexes

An index can be defined as UNIQUE which not only creates a database-enforced uniqueness constraint on a column, but also allows the query optimizer to optimize queries that use the UNIQUE column(s) included in the index. [9]

Filtered indexes can also be created by defining the index with a WHERE clause, in which case the index will only be used if a query has a matching WHERE condition. These can be great for columns that have a small number of unique values, only some of which are ever commonly queried [10].