How to choose a table index for your SQL database

Published in

The Software Firehose

7 min readAug 10, 2016

A lot of people seem to be working with SQL databases without much technical training recently, and I figured many such people might want some help optimizing their queries for read, write, and update operations on the database without having to pick up a lot of theory.

This one’s for the absolute beginners to indexes. So if you know how a database index works, you probably shouldn’t be reading any further.

RDBMS are very robust and powerful systems if used in the right way, even for the amounts which many people wrongly consider ‘big’ data. Choosing the right table index for your database is one of the easiest ways to optimize database performance, and it’s also the primary database optimization that a developer should be concerned with.

The person who can optimize queries by creating an index is the person who knows what the queries are.

This also means that if you want to future-proof your work, you should also be able to anticipate what future queries your application will be making to the database.

If you want to know how database indexes work in detail but without all the other fluff, www.use-the-index-luke.com is an excellent free resource. It won’t do justice to that resource if I tried to cover the details that it explains so well, so check it out if you’re curious (and you can stop reading this post right away).

The Levels of Optimization

Before we start, it helps to know that you can always optimize performance at many levels, and even with your SQL database, that can happen at at least two levels:

Application Level: By examining exactly what data you need for your application, and requesting only that data without any overhead or extra data.
Query Level: By choosing better indexes for your most common queries, and sometimes by restructuring your query in more efficient ways to use existing indexes or a simpler operation.

Throughout this quick ‘tutorial’, we’ll be focusing entirely on query level optimization using indexes.

What is an index?

Let’s try to understand an index using an analogy. Say we have a table called ‘customers’ in your database, with their names and phone numbers.

To build our analogy, let’s imagine an equivalent solution: we have a book/register containing a large list of customers and their phone numbers with the respective row number.
For e.g., a book with hundreds of thousands of entries in this format:

...  ...                      ...
678. Hunter, Alex             9123220322
679. Rhodes, William          9873344323
680. Mata, Juan               9234232234
681. Cardoza, Johann          9858389233
...  ...                      ...

Imagine, for some reason, that our business’s service department has to often find out the customer’s name given his phone number. You can imagine how hard a task this is, having to go through each number from beginning to end sequentially and check if it matches that number.

You’re probably thinking: why don’t we just write down all these number-name pairs in ascending order of the numbers, i.e. why don’t we just sort the phone numbers?

The issue with that is quite evident: writing new data becomes impossible. Why? Because the physical order is dependent on when the data is entered, so we’d have to change the physical order every time new data is entered. (It’s also interesting to note that by simply entering data as and when it comes, the row number is automatically sorted by time of entry.)

So we try a different solution.

We buy another book/register, and list down all the phone numbers in ascending order along with the row number of that data in our original customer book, leaving enough space to add new data in the future if required. This is your ‘index’ (yes, just like a textbook’s index at the back).

...  ...                      ...
9123220322                    187
9123224195                    892
9123298322                    14
...  ...                      ...

Now it becomes very easy for us to find the customer name by using both books. This may be slower in some cases (e.g. for small tables where sequential checking is faster due to the overhead of keeping an index), but on average this will definitely be faster.

Now, I’d like you to note that the part about ‘leaving enough space’ is not technically accurate in terms of how it’s implemented with databases, and was used purely for lack of a better analogy.

Most disk-based databases use a B-tree (or balanced tree) data structure to store indexes — this allows quicker searching of data on average. Visit http://use-the-index-luke.com/sql/anatomy/the-tree to learn more.

When you define an index on the column of a database table, the database management system (DBMS, like MySQL, Postgres etc.) is essentially creating another ‘table’ with the data of that particular column sorted, allowing quicker search time. How you create the index is specific to your DBMS. Note that the overhead introduced for maintaining indexes may not be worth it for tables with a small number of rows.

Now, getting to the point of this article:

So, which column of the table should I define the index on?

Make a list of all the queries your application is making to the particular table, and figure out which column is the most commonly used WHERE clause or JOIN ON clause. That’s the column you should define your index on.

For e.g., if WHERE phone = 923123121 or JOIN ON customers.phone = foo.bar are cases that appear often in your query, then you should (in most cases) define your index on the ‘phone’ column.

Multi-column Indexes

When your WHERE (or JOIN ON) clauses contain more than one condition, involving more than one column, then you could define an index on two columns. For e.g. if you define an index on the (age, phoneno) columns of a table, then the index will be stored in order of sorted age, and for each age, the phone-number will be sorted. This quickens up two types of queries:

ones with ‘WHERE age = xx AND phoneno = xxxxxxxxx’ and
one with just ‘WHERE age = xx’

However, a query with WHERE phoneno = xxxxxxxxx would still be slow, because we defined the index on (age, phoneno), and not (phoneno, age). Thus, the order of definition of multicolumn indexes matters.

Primary Keys

Primary keys are simply columns with
- an index defined on them,
- a UNIQUE constraint (not duplicates) and,
- a NOT NULL constraint

Concluding Remarks

A few things I’d like you to note:

Time your queries. High performance = low query time, from an application developer’s perspective.
The size/optimization tradeoff: As mentioned earlier, the overhead of introducing indexes will cause your queries to be slower if your table isn’t big enough.
Trade-offs: With every optimization, there can be various ‘gotchas’ — optimizing one type of query can slow down the other and so on. If you want to understand the nuances, or even just figure out why your queries are slow, learn more details from a resource like the one mentioned at the end of this post.

Performance Gains: A Real Example

I’ll narrate an actual example from personal experience when database optimization helped us. For our product Engaze, once we had developed a barebones/minimal functioning user interface and backend, we realized that requests to our data API was taking quite a bit of time.

Looking at the timings for requests in the Chrome console, one request revealed itself as the main culprit, the ActivitySuggestions request— it was fetching a list of activities for the user after some heavy data analytics.

To quantify things, ActivitySuggestions was taking 18 seconds! That’s a lot of time, honestly.

Running the Flask profiler, we figured more than 95% of that time was spent in one particular function. Guess what? That function had 6 SQL queries to our database that was slowing things down.

My colleague Abhit performed the first (application) level of optimization, by combining information in 2 queries neatly into one better query. Now, the ActivitySuggestions request took only 8 seconds, a significant improvement!

I continued to examine what was wrong, and realized that two of the remaining queries took ~2.87 seconds to complete. Well, I put all the queries in a list, looked at their WHERE clauses, and it became quite obvious to me what the index had to be for our table of about ~30 million rows (then). So I defined a multi-column index for that table.

Can you guess how much time the 8 second request took now? 700ms! Yes, 0.7s. The 2.87 second SQL query itself took only 0.13 seconds. That’s at least a 50 times increase purely from SQL query optimization!

If you’re still curious about how indexes work, I recommend you read the articles on http://use-the-index-luke.com. It’s quite a lot of fun if you’re just a beginner using databases and care only about the theory that has relevance for your performance-optimization needs as an application developer.

The author is an education enthusiast, an apprentice flaneur and works as a Software Engineer at Nilenso Software, a boutique software consultancy and employee-owned software cooperative. His other publications include:

The Software Firehose

Unrefined, rapid-fire snippets of practical software engineering

medium.com

Hacking Learning

Accelerating learning, cultivating curiosity, and other explorations in scaling happiness through education

medium.com

How to choose a table index for your SQL database

The Levels of Optimization

What is an index?

So, which column of the table should I define the index on?

Concluding Remarks

Performance Gains: A Real Example

The Software Firehose

Unrefined, rapid-fire snippets of practical software engineering

Hacking Learning

Accelerating learning, cultivating curiosity, and other explorations in scaling happiness through education

Written by Shaaz Ahmed