Part 2 — Common Indexing Mistakes that Lead your query Slow

Rehmat Sayany
Westwing Tech Blog
Published in
7 min readSep 27, 2022

--

This is the second part of my blog series. If you haven't read the first part please have a look at https://tech.westwing.de/part-1-what-is-indexing-why-its-important-to-know-a61aa18dbdf1 as its compulsory to gain the basic knowledge of query execution plan before optimizing your query.

Let's start with an example. Assume you are working on a Point of sale system and you have a payment table as shown below.

Payment table of Point of sale system.

Your Manager wants a sale report of all the sales in the year 2022. It looks simple you write a query

SELECT sum(amount) FROM payment where year(payment_date)= 2022

It took 8 seconds to fetch the data and we got 202249.53. Looks good. Right?

Result of Query.

Tomorrow Morning you show the result to your manager, he says the result looks good but the query is slow can you please optimize?

So you think I need to improve the read performance so I need to index on payment_date as it's the only column on where clause. You added an index on payment_date and now you want to see the execution plan.

Execution Plan

Wait. What? For some reason, we have an index on the payment_type column but it's not been considered. The acces_type is ALL this means a full table scan has been performed.

This brings me to the first mistake that we did.

FUNCTION USED ON INDEX COLUMN

If we look carefully at our query above. The index we have created is not effective. Indeed any time an indexed column is involved in a function the index can not be used. The reason is that we have an index on column payment_type, not on the return value of function YEAR().

To optimize the query the workaround is rewriting it differently in order to isolate the indexed column from the function.

Let’s test the following equivalent query:

 SELECT sum(amount) FROM payment where payment_date between '2022-01-01 00:00:00' AND '2022-12-31 23:59:59'

Let's see the execution plan again in the image below.

Whaaattttt ?? Its acces_type is still ALL which means still a full table scan has been performed. But wait what? something is changed. We can see the possibe_keys column with the index. Now at least the query see our index but still, it's not considering it.

If we look carefully at the query the query is referring to the sum of amount which is not in the index so MySQL has to perform a read from disk for the amount column for ALL the rows ( 64506 ). That's Bad. !!!

So If that is the case let's add one more index to the amount column and let's see what will be the execution plan now.

Now the current index looks like this payment(payment_date, amount)

Execution Plan

Finally, it's voluntary choosing the index because it doesn't have to read from the disk anymore. This type of query is also called index-only Scan because the index is very specific for this query.

The next day your manager applause your performance and say we have a feature request. We want the same report but this time we want sales for a single Staff.

Ohhh that's easy now because we know how simple these indexes are. Let's put the index on staff_id and update our query and see the execution plan.

Now the current index looks like this payment(payment_date, amount, staff_id)

EXPLAIN SELECT sum(amount) FROM payment where payment_date between '2022-01-01 00:00:00' AND '2022-12-31 23:59:59' AND staff_id=1

So it looks fine again. The acces_type value is range again which is good but something is still not right though. The rows column. The query still thinks it has to look from 32253 rows which is the same as when we were not limiting the query by staff_id.

As we are limiting our result set by adding a constraint that we only want 2022 sales for the staff_id =1, this rows count should also be decreased. So this problem takes us to the second pitfall.

COLUMN ORDER MATTERS

The index on A and B is not the same as an index on B and A. So if we change the order of our multi-column index like this payment(staff_id, payment_date, amount) and rewrite the query.

SELECT sum(amount) FROM payment where staff_id=1 AND payment_date between '2022-01-01 00:00:00' AND '2022-12-31 23:59:59'

The count of rows will be decreased and we will optimize our query because now we are looking for only a single staff which has id =1 and then the sales of that staff which took place in 2022.

Let's take another example to understand this. Let's suppose you have a customer table.

Consider the indexes:

create index idx_fl on name(first_name, last_name);

The below query will work fine as we are limiting our query by First_name and then last_name.

SELECT * FROM sakila.customer where first_name= ‘LINDA’ and last_name=’Williams’

But this one will be not optimal

SELECT * FROM sakila.customer where last_name='Will

As we skip the first_name column in this query so our index will not work here because as soon as you skip any column in the index your index will stop working there and when you will run the execution plan it will be like this as shown in the below image, which means it does not use any index and a Full table scan is performing in the background.

Execution plan

Now this time you have got enough idea of optimizing the query. Let's take this as a challenge. So Let's continue the Customer example and now we want to write a query where the first name is not LINDA and the last name is WILLIAMS.

That's easy, right?

SELECT * FROM customer where first_name !='LINDA' and last_name='WILLIAMS'

But if you will see the execution plan now

It's still doing a Full Table Scan. This takes us to the third pitfall.

INEQUALITY OPERATOR

Yes, you can use the index left to right but as soon as you have an inequality operator on any of the columns the index stops there.

So always try to avoid the inequality operator or use it at the end of your where clause, in this way if you have an index on last_name the below query will be optimized.

SELECT * FROM sakila.customer where last_name='WILLIAMS' AND first_name !='LINDA'

There are many more common mistakes that we do while writing queries. Let's discuss that so we can tune our query more efficiently.

SELECT Specific fields instead of SELECT *

The SELECT statement is used to retrieve data from the database. In the case of large databases, it is not recommended to retrieve all data because this will take more resources on querying a huge volume of data.

If you want to retrieve all actors who appear in Love Story, don’t write the query this way:

       SELECT * FROM actor
-> INNER JOIN sakila.film_actor ..
-> INNER JOIN sakila.film ..
-> WHERE sakila.film.title = 'Love Story';

That returns all columns from all three tables. Instead, write the query as follows:

   SELECT actor.* FROM sakila.actor...;

Wildcard String Searches

MySQL won’t improve search performance whenever you have leading wildcards because MySQL will be unable to use the index. If you change to ‘anyLetter%’ then it will be able to use the index.

So if you want to search all customers whose first_name start with ‘j’. The below query will work fine and will use your index.

select * from customer where first_name LIKE 'j%';

But this one will be not optimal

select * from customer where first_name LIKE '%j%';

As it will do a full table scan.

WHERE Clause Optimization

You might be tempted to rewrite your queries to make arithmetic operations faster while sacrificing readability. Because MySQL does similar optimizations automatically, you can often avoid this work, and leave the query in a more understandable and maintainable form.

Removal of unnecessary parentheses: Avoid unnecessary and complex parathesis.

Where ((a AND b) AND c OR (((a AND b) AND (c AND d)))) ...

Change it to something like this.

where (a AND b AND c) OR (a AND b AND c AND d) ...

Constant folding: Simplify complex logic that will help MySQL parser to parse your clause faster.

(a<b AND b=c) AND a=5

Change it to

b>5 AND b=c AND a=5

Counts: COUNT(*) on a single table without a WHERE is retrieved directly from the table information for MyISAM and MEMORY tables.

For example, the below query will be read from the memory and doesn't need to be optimized.

SELECT COUNT(*) FROM CUSTOMERS 

IS_NULL

MySQL performs the optimization for the IS NULL operator in the same way that it does for the equal (=) operator.

so this means both queries will use indexes if we have an index on first_name

SELECT * FROM customer where first_name IS NULLSELECT * FROM customer where first_name= ''

--

--

Full Stack developer @westwing passionate about NodeJS, TypeScript, React JS and AWS.