Following is a sequence of few events which happened with me on a normal work day as a software engineer at ClearTax.
It was a normal day, I was working on some new feature additions in the software and suddenly an issue pops up.
The Problem At Hand
Few customers have suddenly started to feel slowness and intermittent failures in a particular feature. I immediately checked our logs and noticed that a particular query was taking much longer than expected for a few users, and even timing out for few others.
Pretext to understand the issue: We use MySQL 5.7(InnoDb) as our main database. The table being talked about is a frequently used table, having around 300M rows approximately, with a lot of additions and deletions from the table happening on a regular basis.
Some more details after initial triaging
I started wondering what happened suddenly that query performance degraded. When I dug deeper into the issue using the EXPLAIN command, I noticed that this query was not using index which it should be using.
Since this feature was being used extensively in production from past few months without much issues and no change pertaining to this feature went in last few weeks, there is no way the given query was wrong. But to verify, I checked the query on other databases with similar schemas (QA and dev environment) and even on replica database. The same query used the required index on these databases, which I got to know using EXPLAIN command.
At this point, I was literally clueless about why this is happening and I panicked and rushed towards the coffee dispenser, my usual escape when I am stuck with some problem and had quite a few cups of coffee and regained my consciousness. Then I started reading more about why this can be happening.
The Root Cause
Before running any query, MySQL optimiser makes decisions about the execution plan (which can be seen via EXPLAIN), which in turn uses table statistics and conditions in clauses like WHERE, ORDER BY, etc. So, when MySQL optimizer is not picking up the right index in EXPLAIN, it could be caused by outdated table statistics. Table Statistics can become outdated if there are a lot of insertions and deletions on the given table (which is true in our case).
We found the root cause to be the Outdated Table Statistics. As per the MySQL documentation, these statistics for a table can be re-estimated explicitly by using the following methods:
- ANALYZE Operation. This clears the current statistics and new statistics are collected when the table is accessed next time. Statistics are collected using some random sample pages from new data in the table
- OPTIMIZE Operation. This rebuilds/defragments the table data so that data is stored in more compact way, updates the indexes accordingly and then does an ANALYZE on the table.
In my opinion, the latter option (OPTIMIZE) should be used here, since there are a lot of deletions on this table which could have resulted in table data to be highly fragmented. This can help us in using storage more efficiently along with providing the solution to the above problem.
OPTIMIZE operation uses Online DDL for InnoDB tables, which reduces the downtime for concurrent DML operations. This operation is completed “in place” and an exclusive table lock is taken briefly, only during the prepare and commit phase of the operation. During the prepare phase, a lock is taken briefly over metadata and a new table(intermediate) is created. During the commit phase, table metadata changes are committed.
OPTIMIZE operation cannot be done using INPLACE algorithm when FULLTEXT indexes are present. There are some FULLTEXT indexes on this table. So, table copy method is used instead. which implies having a longer downtime window for concurrent DML operations.
Of course, downtime cannot be an option here since this is a key table and this will mean downtime for the complete application, which is not acceptable.
I was literally stuck here for some time wondering how to achieve this without impacting the user experience and avoiding downtime.
How to achieve this without Downtime?
Then I started thinking, can I do something similar to what OPTIMIZE operation does internally in case of Online DDL which allows concurrent DML operations. Basically I was thinking to go ahead with the following steps and I should be able to avoid downtime.
- Create a new table similar to the existing table
- Copy the table data into the new table
- Run an ANALYZE on the new table
- Swap the current table with the new table in an atomic operation with a one time lock, which will be done in perceivably zero downtime.
The DML operations will not be blocked on the main table while this process is going on. All the writes on this table can be copied to new table using triggers which I’ll create prior to beginning the data copy operation.
💡At this point, it struck my mind that I’m already familiar with a tool which can take care of most of these steps for me.💡
Percona Toolkit (pt-online-schema-change)
Percona toolkit is a collection of advanced command-line tools, which can be used to perform a variety of tasks which can be too cumbersome to perform manually. This toolkit contain multiple tasks which can be used as a “one-off” script or some tasks which can be required to be done on a daily basis. These can be ideal, since these are production tested tools with comprehensive documentation.
They have a tool for online schema migrations called pt-online-schema-change. This tool helps in altering the table structure without blocking concurrent DML operations. This tool emulates how MySQL internally performs ALTER operation but on a copy of the original table. That will imply that the read and writes on the original table are never blocked.
This works by creating a copy of the table without any data, modifying the table as desired and then copying the data from the original table into this new table. When the copy is complete, it replaces the original table with the new table (it also drops the old table by default), using an atomic RENAME TABLE operation which require a one-time lock on the table for a very brief period of time. Any modifications to data in the original tables during the copy will be reflected in the new table, because the tool creates triggers on the original table to update the corresponding rows in the new table.
Since this creates a new table does a one time copy of data, the table will not need to be defragmented again. So just running an ANALYZE will be enough.
Using this tool for the problem at hand
The only catch here is that pt-online-schema-change is built to run database migrations, so it expects an ‘ALTER TABLE’ command.
This is not a big concern, since I can obviously find a column to modify which will not have much impact on performance. There was a field with type as
VARCHAR(500) which can be modified to something like
VARCHAR(501), and will not have much performance impact. If this is not possible in a particular case, one can obviously add a new column and then remove that column in two different operations.
The following command can be used to achieve our desired output:
pt-online-schema-change --alter="<YOUR-COMMAND-HERE>" --analyze-before-swap --host=<DB-HOSTNAME> --port=<DB-PORT> --user=<DB-USER> --password=<DB-PASSWORD> --execute D=<DB-NAME>,t=<TABLE-NAME>;
Here, the parameter
--analyze-before-swap is used to tell percona toolkit to run an ANALYZE operation on the new table before swapping the table with the old one.
Time taken by this command to run depends upon the amount of data present in the table. This will take a hell lot of time since the amount of data in this table is significantly huge but will not have a downtime.
PS: I hope you like the article, please do share your feedback and correct me if I’m wrong at any point. This may not be a new thing for many of you, but I just wanted to share my experience, which may help somebody out who is stuck in a similar situation.
Also, if you like the article please share it with the folks who would be interested in this and please click the applause icon to increase it’s visibility for others.