Database Evolution Part 1: Evolving indexes

Published in

Bazaar Engineering

6 min readMay 9, 2022

A significant thing to keep in mind when creating scalable software is that your database schema is your bread and butter, a system can only scale as much as the schema and storage engine allows it. This applies to indexes as well, an index created a long time ago based on a particular usage pattern might not be useful if the usage pattern and field cardinality has changed.

Indexes are our first and primary line of defense against scalability issues. So it is important to understand how to evolve them with your schema. Because indexes can always evolve, just like database schemas or software architecture.

A brief summary of indexes and their need

An index is a data structure used to reduce disk seeks when querying some records from a database, the trade-off is that they require extra storage and may increase write transaction latency.

The need for indexes is simple, as a business scales, data increases, and table lookups to access that data take longer, this is not scalable, we need a way to avoid disk accesses as much as possible and indexes are a very prominent and surefire way to accomplish this goal.

A car sales use case

Imagine you work for a company that advertises cars for sale. So far, the utmost priority for your business has been giving potential customers the ability to search cars by name and year, and compare the prices to make their bet. A schema for a table car_make would look something like this:

+---------+---------------------+--------------+-------+
| make_no |        name         | release_year | price |
+---------+---------------------+--------------+-------+
|      11 | Audi R8             |         2008 | 20000 |
|      12 | Volkswagen Scirocco |         2018 | 10000 |
+---------+---------------------+--------------+-------+

Following is the query most used to lookup car makes:

SELECT name, release_year, price FROM car_make WHERE release_year = :year:

We have an index on release_year and this has served us well so far.

Let’s assume the requirements have slightly changed and the business now wants support for a color filter as well, giving the users ability to search on both year and color at the same time:

+---------+---------------------+--------------+-------+-------+
| make_no |        name         | release_year | color | price |
+---------+---------------------+--------------+-------+-------+
|      11 | Audi R8             |         2008 | white | 20000 |
|      12 | Volkswagen Scirocco |         2018 | black | 10000 |
+---------+---------------------+--------------+-------+-------+

Now our query will look something like this:

SELECT name, release_year, color, price FROM car_make WHERE release_year = :year: and color = :color:

The problem

If we look at this query, now we have two filtering fields: release_year and color. This implies something about our data, now we can have multiple colors against a single release_year. This will decrease the lookup time of our index since the cardinality for release_year will drop.

Composite indexes to the rescue

A composite index is an index made as a combination of two or more fields. We get the following benefits when using a composite index.

Fewer I/O operations than single index because of increased selectivity of the index.
Lower data retrieval latency.
Ability to index multiple fields instead of individual fields so the query optimizer does not have to choose which index to use(will expand on this below)

Proof of concept

Let’s apply the index to our data using the below query:

CREATE INDEX release_year_color ON TABLE car_make (release_year, color)

Let’s test out this index on some sample data and analyze the query using the EXPLAIN keyword.

So these are our current indexes, as you can see we have the new composite index release_year_color added

Right now we only have two car makes, each with a unique year and color.

Let’s try out the query and analyze which index is used.

What’s this? the old release_year index was used, but why is that?

Enter the query optimizer

As SQL is a declarative language, we only provide the database engine with the set of instructions on what we need, and the execution plan is decided by the query optimizer, this involves several choices, but let’s focus on the ones relevant to our use case:

Whether to use table scan or index scan.
Which index to use.

So it’s apparent that the query optimizer decided that the release_year index is a better choice for this query, why is that?

This comes back to cardinality, if we look at our current data, we have 2 rows and cardinality for release_year = 2, this is as high as cardinality as we can get given the current data so the query optimizer felt no need to use the composite index as it would not provide any significant performance gains over the release_year index.

Remember our prediction that the release_year column will likely be repeated for new colors, in that case, the composite index will be used. Let’s test that out. This is our new sample data as per the predicted evolution.

We just have a new car_make entry for each car but with a new color and same year. Let’s run the same query on this dataset and analyze it.

It seems now the optimizer decided to go with our new index.

Takeaway

When you first introduce a new index, and it is not being used, you should question if the data will evolve to use the new index. If you’re certain, you can remove the old indexes or wait for the natural adoption of the new index.

When NOT to use a composite index

Composite indexes significantly increase write latencies, so let’s say you have a logs table that has significantly more write operations than reads, indexing might not be a good option, It is better to have two versions of the log datastore, one optimized for writes and one for reads.
If the query using the composite index has very low usage frequency and you have to decide between another single field index with higher usage, then better to opt for a single field index. This comes down to usage patterns.

Composite index performance

Let’s increase the data somewhat and check the difference between performances.

Using the release_year index we get 1.8 ms

Using the release_year_color composite index we get 1.6 ms

Not much difference since this is a fairly small dataset, but even in such a small dataset, seeing any performance difference at all is a good indication.

In conclusion

We discussed the importance of scalable schema design and how indexes can help achieve that.
We analyzed a car sale use case and understood when to go for a composite index.
We observed how the query optimizer decides which type of index to use based on cardinality.
We tested this out by customizing our data in a way that the optimizer would pick up our new index.
We contemplated when not to go for a composite index.
We tested out performance gains on a dataset that conforms to ur expected usage pattern.

Disclaimer:

Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages it’s colleagues and friends to share knowledge, experiences and opinions in written form on it’s medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on this medium publication mostly describe and highlight the opinions of the author, which might or might not be the actual and official perspective of Bazaar Technologies.