SQL Indexing for Devs — Part 1: Execution Plans

Ido Moskovich
Feb 4, 2020 · 8 min read

Well-designed database indexes can make SQL queries run faster. However, they aren’t free: indexes require both storage space and maintenance as data in a database changes. This article will try to explain how to maximize the value of indexes given those costs from a developer’s perspective.

Here’s a rough mental model: indexes can speed up some reads (SELECTs) while slowing down some writes (INSERTs, DELETEs, & UPDATEs).

Refining that: indexes can improve a database’s ability to process WHERE & JOIN clauses — regardless of whether they’re used in a read or write operation — while increasing the amount of work required to write changes to a column covered by an index since that index now needs to be updated as well. Indexes can also squeeze efficiencies out of GROUP BY clauses, ORDER BY clauses, and the retrieval of columns in SELECT statements.

In part one of this article, I’ll try to explain why it’s important to first understand a database’s execution plans to understand why indexes might be helpful.

Disclaimer: this article only applies to relational databases like PostgreSQL and Microsoft’s SQL Server, not NoSQL databases. Even then, every database engine is different, so this article will stick to broad generalizations rather than dive into implementation details that differ between database engines.

Photo by Hanna Morris on Unsplash

The Exciting Life of a SQL Statement

Step 1: Declarative Instructions

So, our humble SQL statement starts off as a set of declarative instructions that we send to the query optimizer: “Query Optimizer, I want you to do the following thing and don’t care how you do it.” We’re kind of like a hands-off boss with a vision.

Here’s what this looks like:

Step 2: Logically Processed Instructions

One of the cool things about SQL is that we can basically read queries line-by-line as if they were written in plain English: “Give me the first name and favorite color of every user in the state of Illinois, sorting the results by the date the user joined our service.”

While this syntax is nice for us humans, it’s not how the query optimizer sees SQL. Here’s how it interpreted our query: “Go to the users table and find all the rows that have a value of ‘IL’ in their state column. Then, get rid of every column except for first_name, favorite_color and date_joined. Sort the results by date_joined, then get rid of that column too so all that’s left is first_name and favorite_color.”

That sounds awfully similar to our original query, but its order got all jumbled. Normally, if you give someone instructions and they do Step 5 before Step 3, that’s a problem. You don’t bake the pizza before taking it out of its box, right? Thankfully, we gave declarative instructions. We said we needed something but didn’t care how we got it. That gave the query optimizer free reign to interpret our request however it likes.

The query optimizer always does this to our instructions. It changes their order to logically process the query. Here’s what that order looks like [4]:

  1. FROM, including JOINs
  2. WHERE
  3. GROUP BY
  4. SELECT
  5. ORDER BY
  6. TOP / LIMIT

If the declarative nature of SQL is such a benefit of the language, why should we care how the database chooses to follow our instructions? The insight we gain from digging down to that level is that if we can make any one of those steps easier on the optimizer, every subsequent step it takes will be that much easier — the efficiencies cascade down and make everyone’s lives better. As we’ll see later, finding ways to make the FROM , JOIN , and WHERE clauses of a query more efficient can have the biggest impact on the performance of our queries.

Step 3: Execution Plans

Like some of us, the query optimizer wants to do as little work as possible — in this case, the ‘right’ execution plan is the ‘easiest’ execution plan.

‘Easiest’ means different things to different query optimizers. One of the ways query optimizers think about ‘easy’ is by assigning costs to potential execution plans. Each step in the plan will get some kind of cost depending on how much work the optimizer thinks will be required to take that step. The plan with the lowest total cost, the ‘cheapest’ plan, will be declared the ‘easiest’ one and therefore the one that gets executed [2].

So where do those costs come from? In general, three places [3]:

  1. The complexity of your SQL statements. Every JOIN, every part of a WHERE clause, every GROUP BY and ORDER BY is a step that the query optimizer has to consider. The more things your query is trying to do, the more steps the optimizer will need to take, the more expensive the plans it needs to consider.
  2. Statistics about the distribution of data in your tables. How many rows are in the table you’re querying against? Are there lots of unique values in the table or are most of them the same? The optimizer uses these statistics to estimate how much work any individual step of a plan will take so it doesn’t actually have to process that particular step to find out.
  3. The presence or absence of indexes. (I told you we’d get to these!)

As developers, if we’re trying to help the query optimizer do as little work as possible so that we can get our query results as quickly as possible, we can really only focus on the 1st and 3rd items on the list above: simplify our queries or index our tables. (Database statistics are kind of outside of our control unless our applications are storing unnecessary data in the database. Responsibility for the maintenance and accuracy of statistics typically falls on a database administrator.)

We’ve watched our fledgling SQL statement become a beautiful execution plan. We’ve talked about how that plan depended on the specific instructions in the original SQL statement, and how the query optimizer had to choose the plan from a number of potential plans, each of which had a cost associated with it. We can finally talk about the magic of indexes: indexes can unlock better execution plans for a given SQL statement.

Unlocking Efficient Execution Plans

Let’s go back to the logical order of query processing we reviewed above. We can think of the first steps in the logically processed order as accessing and then filtering our data: the FROM clause specifies which table’s data to access, while the different conditions of the WHERE clause instruct the database which rows it should continue to process. These rows will get grouped if there’s a GROUP BY clause. Any columns not referenced in the SELECT or ORDER BY clauses will be thrown away.

Without the help of a well-designed index, the database engine has to perform the relatively slow process of a full table scan to evaluate each row in the table against the conditions in the WHERE clause. Depending on the query and the data, this might just be the most efficient way to get this particular job done.

In case it’s not, we can turn to indexes.

What is an index?

  • The keys of the index are defined as the values from specific columns in a SQL table. When we create an index, we explicitly state which columns to use as its keys. Indexes are defined on columns of specific tables: that is, an index’s keys can only refer to the columns from a single table.
  • The values of the index are usually one of two things depending on how the index is defined. They can either be pointers to the rows of the table which contain the values saved in the key OR they can be values from other columns on those rows.

Very often, indexes are organized into Balanced-Trees (B-Trees) to optimize searching.

An important thing to remember is that indexes are separate data structures from the SQL tables they reference. This is why they can hurt the performance of writes to a database: they need to be updated any time the data in the columns they reference is updated.

How do indexes impact execution plans?

In Part 2, I’ll share some guidelines about how to structure both our queries and our indexes to give the query optimizer the best shot at finding the cheaper plan.

References

[2] https://www.red-gate.com/simple-talk/sql/sql-training/the-sql-server-query-optimizer/

[3] https://gerardnico.com/data/type/relation/engine/optimizer

[4] https://www.microsoftpressstore.com/articles/article.aspx?p=2201633&seqNum=3

[5] https://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#BABDGFIE

[6] https://docs.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide?view=sql-server-ver15

[7] https://www.red-gate.com/simple-talk/sql/learn-sql-server/sql-server-index-basics/

Launch School

Publications of the Launch School Community