Parag Garg
Jul 8 · 6 min read

Table of Contents

In Tokopedia we follow microservice architecture and almost every service has its own database. Data on each service is increasing exponentially over the time. Therefore, having the right database, schema and proper indexes is paramount to match Tokopedia’s scale. In this post, I want to share insights from some of the challenges we have faced while working on MySQL database in Tokopedia. Let’s unleash some MySQL chi!!

Adding pagination to an API.

Let’s take an example where we need to create a paginated API which returns users data. For this we need to write SQL query, a basic version of query for this example would look like below :

SELECT id, name, company FROM users ORDER  BY id ASC LIMIT  10 offset 10;

In the above Query we are using limit and offset to paginate over the result. But will it work efficiently for 10000000 offset ? The answer is no, because MySQL will have to scan all 10000000 records and then return 10 records. If data is very large this paginated query will start slowing down for large offsets. To solve this problem instead of passing offset (page number), we can pass last id(primary key) visited. MySQL has index on id as it is a primary key. Therefore, MySQL can quickly find id in this index and will fetch us first 10 rows. explain plan will show that number of rows scanned are 10 only. So the updated query will look like below:

SELECT id, name, company FROM users WHERE id > 10 ORDER BY id ASC LIMIT 10;

Not Equal Query in MySQL

Let’s take another example where we need to find all the records with != (Not equal) constraint. Indexes in MySQL do not work on queries with != constraint . Performance of != query on large data (for millions of record) can be very poor, because MYSQL will iterate each and every row and check if condition is true or not. Here is the example query below.

SELECT id, name, company FROM users WHERE company != 'tokopedia' LIMIT  10;

One way to solve above problem is to categorise the users. So we can add new column category with possible values 1 (Tokopedia), 2 (Non Tokopedia) user and create index on category. So updated query for above problem can be seen below.

SELECT id, name, company FROM users WHERE category = 2 LIMIT 10;

In the above query, we have index on category so it will return first 10 rows very efficiently.

Optimising OR Query

OR queries are not supported by indexes and therefore can be expensive if not optimised correctly. Let’s take an example of OR query and then optimise it to use indexes:

INNER JOIN company c
ON = u.company_id
WHERE c.address_id = 1
Index is not used and all rows are scanned (total dummy rows in company are 4)

Now even if we have index on address_id and name separately, all the rows will be scanned because of OR in where clause. We can check this by usingexplain command.To optimise above query we can rewrite above query as below:

JOIN company c
ON = u.company_id
AND c.address_id = 1
JOIN company c2
AND = u.company_id;
Index is used and total rows to be scanned are 1 for each table.

Above we are trying to Join Users table on two different conditions with company table so in both cases index is used .We can see from the explain command total rows scanned in each table is 1 and both idx_address_id and idx_company_name are used. Alternatively, we can use UNIONalso for above case as given below:

JOIN company c
ON = u.company_id
AND c.address_id = 1
JOIN company c
ON = u.company_id

Count(1) vs Count(*) efficiency

Most of the time, to count total rows we use Select Count(*) from table_name. But this query is not that efficient because MySQL has to get all the columns before actually calculating the count. We can replace above query with below query to make it more efficient:

SELECT Count(primary_key_column) FROM table_name

Since, primary_key_column (Column of type primary key) is already indexed and indexes are stored in memory count query will be fast. We can use any column which is indexed and is not null type.

Covering Index

Covering index is a special type of composite index where all the columns to be selected exist in the index. For example instead of writing Select * from user where id in (1,2,3) where we only need to know the first_name and last_name of the user we can create index i_f_l(id,first_name,last_name) and rewrite query to Select id, first_name, last_name from user where id in (1,2,3). Since all the selected columns are already available in the index, MySQL does not need to further lookup on the disk. It makes you query very fast!.

Note : While covering index can help, creating it on large number of columns would decrease the performance.

Why ??

  1. Index is stored in RAM and takes lot of space. When number of indexed columns increase, index size also increases. If total size of index increases more than RAM, then? In that case, all indexes will not be present in RAM and rest of the index will be stored on disk. When MySQL loads index from disk to RAM, page swaps increase. As a result, disk IO increases which ultimately increases CPU.
  2. On each insert/update/delete operation, index has to be changed as well. This makes these operations more expensive.

Sargable Queries

Just because we add an index to our table doesn’t mean we will get its benefit. Query running against that table needs to be written in such a way that it takes advantage of that index.

SARGable, or “Search Argument-able”, queries therefore are queries that are capable of utilising indexes.

We Should not use functions or Cast statements on the columns which have index. For Example:

SELECT ...FROM   ...WHERE  created_at + interval 30day < now();

In this query MySQL optimiser can’t use an index on created_at, even if one exists. It will literally have to evaluate created_at + INTERVAL 30day for every row of the table. We can rewrite this query to use index by removing the expression on created_at, so updated query is :

SELECT ...FROM   ...WHERE created_at < NOW() — INTERVAL 30days;

Composite Index

Composite index consists of multiple columns (e.g., index a_b(a,b) is a composite index which contains 2 columns in one index). Let’s have a look at few peculiar cases and see how MySQL makes use of the index a_b.

No Index used:
Case 1. where b=1
Why? Because the order in which columns are declared is very important. For the above index a_b, a,b is the order which means that a has to be used in where clause.

Partial index used (only a column of the index):
Case 2. where a = 1
Case 3. where a > 1 and b = 2
Why? Because if first column in index (a in this case) uses range expression like (<, <=, >, >=), then the remaining columns in index (b in this case) will not be used by the index while executing query.

Full index used (both columns a and b of the index):
Case 4. where a = 1 and b = 1
Case 5. where b = 1 and a = 1
Why? Because = has been used everywhere in clause, therefore columns can come in any order in where clause)
Case 6. where b > 1 and a = 1
Why? Because a is the first column in index and uses = in where clause. Read case 3 above again if you are confused.

We have shared some of the common problems faced while optimising queries in a database and how to solve them. if you have any suggestion or problem optimising a query please reach us by commenting below.

Tokopedia Engineering

Story from people who build Tokopedia

Thanks to Rajya Vardhan Mishra

Parag Garg

Written by

Senior Software Engineer @ Tokopedia |

Tokopedia Engineering

Story from people who build Tokopedia

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade