Well-designed database indexes can make SQL queries run faster. However, they aren’t free: indexes require both storage space and maintenance as data in a database changes. This article will try to explain how to maximize the value of indexes given those costs from a developer’s perspective.
Here’s a rough mental model: indexes can speed up some reads (
SELECTs) while slowing down some writes (
Refining that: indexes can improve a database’s ability to process
JOIN clauses — regardless of whether they’re used in a read or write operation — while increasing the amount of work required to write changes to a column covered by an index since that index now needs to be updated as well. Indexes can also squeeze efficiencies out of
GROUP BY clauses,
ORDER BY clauses, and the retrieval of columns in
In part one of this article, I’ll try to explain why it’s important to first understand a database’s execution plans to understand why indexes might be helpful.
Disclaimer: this article only applies to relational databases like PostgreSQL and Microsoft’s SQL Server, not NoSQL databases. Even then, every database engine is different, so this article will stick to broad generalizations rather than dive into implementation details that differ between database engines.
The Exciting Life of a SQL Statement
To understand how indexes can make our applications run faster, we should start by talking about the journey a single SQL statement takes when it leaves our application to head to a database.
Step 1: Declarative Instructions
SQL is a declarative language. It describes “what to do” rather than “how to do it” . Because you don’t specify how to return or update data using SQL, just which data you’d like to work with, the choice of how to follow your instructions is left up to a part of the database engine called the query optimizer. The query optimizer’s job is to figure out what you want it to do and how to do it.
So, our humble SQL statement starts off as a set of declarative instructions that we send to the query optimizer: “Query Optimizer, I want you to do the following thing and don’t care how you do it.” We’re kind of like a hands-off boss with a vision.
Here’s what this looks like:
Step 2: Logically Processed Instructions
How does the obedient query optimizer translate our declarative SQL instructions into an imperative execution plan?
One of the cool things about SQL is that we can basically read queries line-by-line as if they were written in plain English: “Give me the first name and favorite color of every user in the state of Illinois, sorting the results by the date the user joined our service.”
While this syntax is nice for us humans, it’s not how the query optimizer sees SQL. Here’s how it interpreted our query: “Go to the
users table and find all the rows that have a value of ‘IL’ in their
state column. Then, get rid of every column except for
date_joined. Sort the results by
date_joined, then get rid of that column too so all that’s left is
That sounds awfully similar to our original query, but its order got all jumbled. Normally, if you give someone instructions and they do Step 5 before Step 3, that’s a problem. You don’t bake the pizza before taking it out of its box, right? Thankfully, we gave declarative instructions. We said we needed something but didn’t care how we got it. That gave the query optimizer free reign to interpret our request however it likes.
The query optimizer always does this to our instructions. It changes their order to logically process the query. Here’s what that order looks like :
If the declarative nature of SQL is such a benefit of the language, why should we care how the database chooses to follow our instructions? The insight we gain from digging down to that level is that if we can make any one of those steps easier on the optimizer, every subsequent step it takes will be that much easier — the efficiencies cascade down and make everyone’s lives better. As we’ll see later, finding ways to make the
JOIN , and
WHERE clauses of a query more efficient can have the biggest impact on the performance of our queries.
Step 3: Execution Plans
Once the query optimizer has mulled over our SQL and turned it into a set of logical steps, it starts figuring out how it can take each of those steps by putting together an execution plan. It needs to do this because it has multiple tools at its disposal to take any individual step. What does that mean? Well, if you’re
JOINing two tables, the optimizer can choose to implement that
JOIN by building a hash table, or it can use a nested loop, or it can do something called a sort merge . Filtering a table in the
WHERE clause? Should we perform a table scan to find the rows we can keep? Can we do a seek instead? Oh, but what if we use an index? Should we scan or seek from there? Can we even do that? These are the types of questions the optimizer has to ask for each step of your query on its quest to choosing the ‘right’ execution plan.
Like some of us, the query optimizer wants to do as little work as possible — in this case, the ‘right’ execution plan is the ‘easiest’ execution plan.
‘Easiest’ means different things to different query optimizers. One of the ways query optimizers think about ‘easy’ is by assigning costs to potential execution plans. Each step in the plan will get some kind of cost depending on how much work the optimizer thinks will be required to take that step. The plan with the lowest total cost, the ‘cheapest’ plan, will be declared the ‘easiest’ one and therefore the one that gets executed .
So where do those costs come from? In general, three places :
- The complexity of your SQL statements. Every
JOIN, every part of a
ORDER BYis a step that the query optimizer has to consider. The more things your query is trying to do, the more steps the optimizer will need to take, the more expensive the plans it needs to consider.
- Statistics about the distribution of data in your tables. How many rows are in the table you’re querying against? Are there lots of unique values in the table or are most of them the same? The optimizer uses these statistics to estimate how much work any individual step of a plan will take so it doesn’t actually have to process that particular step to find out.
- The presence or absence of indexes. (I told you we’d get to these!)
As developers, if we’re trying to help the query optimizer do as little work as possible so that we can get our query results as quickly as possible, we can really only focus on the 1st and 3rd items on the list above: simplify our queries or index our tables. (Database statistics are kind of outside of our control unless our applications are storing unnecessary data in the database. Responsibility for the maintenance and accuracy of statistics typically falls on a database administrator.)
We’ve watched our fledgling SQL statement become a beautiful execution plan. We’ve talked about how that plan depended on the specific instructions in the original SQL statement, and how the query optimizer had to choose the plan from a number of potential plans, each of which had a cost associated with it. We can finally talk about the magic of indexes: indexes can unlock better execution plans for a given SQL statement.
Unlocking Efficient Execution Plans
How can indexes impact execution plans?
Let’s go back to the logical order of query processing we reviewed above. We can think of the first steps in the logically processed order as accessing and then filtering our data: the
FROM clause specifies which table’s data to access, while the different conditions of the
WHERE clause instruct the database which rows it should continue to process. These rows will get grouped if there’s a
GROUP BY clause. Any columns not referenced in the
ORDER BY clauses will be thrown away.
Without the help of a well-designed index, the database engine has to perform the relatively slow process of a full table scan to evaluate each row in the table against the conditions in the
WHERE clause. Depending on the query and the data, this might just be the most efficient way to get this particular job done.
In case it’s not, we can turn to indexes.
What is an index?
Indexes can be thought of as an ordered list of key-value pairs that make it easy for a database engine to find rows in a table that have specific values :
- The keys of the index are defined as the values from specific columns in a SQL table. When we create an index, we explicitly state which columns to use as its keys. Indexes are defined on columns of specific tables: that is, an index’s keys can only refer to the columns from a single table.
- The values of the index are usually one of two things depending on how the index is defined. They can either be pointers to the rows of the table which contain the values saved in the key OR they can be values from other columns on those rows.
Very often, indexes are organized into Balanced-Trees (B-Trees) to optimize searching.
An important thing to remember is that indexes are separate data structures from the SQL tables they reference. This is why they can hurt the performance of writes to a database: they need to be updated any time the data in the columns they reference is updated.
How do indexes impact execution plans?
In our sad, non-indexed world, the only way the database engine could fulfill a query was by traversing through each row of a table, evaluating its contents against the conditions of the
WHERE clause. Indexes change that. They give us another way to get to the relevant rows. If our index key includes the columns specified in the
WHERE clause, we can use it to quickly figure out which rows we want to work with without slowly scanning the entire table! The query optimizer loves when it can do this because searching through an index is usually a much cheaper operation than evaluating every row of a table. We’ve unlocked an easier plan! 
In Part 2, I’ll share some guidelines about how to structure both our queries and our indexes to give the query optimizer the best shot at finding the cheaper plan.
 Designing Data-Intensive Applications, p. 42–43