Write Spike Wars, Episode III: A New Hope

Improving monitoring and query tuning techniques.

Published in

mercos-engineering

8 min readAug 29, 2018

If all that problem with write throughput was actually caused by unoptimized SQL, it's better to improve our skills of monitoring and handling those queries. In this last blog post, the last part of the trilogy, I'll give some advices about how to detect and fix those nasty queries. With a bit of skill (and luck) this post may aid you in the quest for an extremely well optimized MySQL database.

Your database after some optimizations.

Step #1: Find the smoke before the fire

First of all, you need some kind of monitoring to detect weird behavior in your database, and preferably, one that it's not your customers calling the support team (but this may work too). Two metrics are very useful for that matter:

Write Throughput Spikes (17:00 and 21:00)

Set up and alarm for those metrics remembering to use the maximum value, not the average value, that way you'll be notified more easily. In those examples an alarm threshold for 40% CPU and 6 MB could be a good start, because those values are above whats normally expected.

Let's follow this occurrence:

At **10:20** I received an alert of a spike that went **over 50%**.

Bonus Tip #1: sometimes we want more info to get the exact time of the spike (will be useful in the next step). For that you can enable the performance_schema that can give you lots of interesting data. With RDS you can enable the enhanced monitoring and get that data too. The UI for this feature is very slow and unresponsive, my advice is: click and wait (a lot of time) for every action, don't keep clicking at everything because it will not work. If you survive that, you are gifted with some extra and very useful metrics. Here are my "best picks" to help getting a better insight:

Useful metrics to check when things go wrong

The explanation for each metric can be found in the official RDS documentation.

Bonus Tip #2: Another good metric is IOPS, but lately I’m not using it that much to trace weird database behavior.

Know your enemy: Slow Query Log

The alarm has triggered, now you must find the root cause. As you may remember from my previous post slow queries can affect write throughput. So first of all, here is my advice: don't bother (too much) looking at other cause of problems while YOU KNOW that some queries need optimization, they are probably the root cause. So how can we get this slow query log? Well first you should make sure it's enabled, although usually enabled by default, in RDS it's just some clicks in the Parameter Group, but this depends on your installation, Google it, no big deal. After that you must download it from the database, which also depends on your installation and probably can be done by UI if you host your DB in some Cloud Provider, but again: Google it (easiest blogpost of my life).

On AWS, just find the latest and download it.

But here are some useful knowledge. To parse that file, MySQL has a tool called mysqldumplslow that isn't ideal but may help you eventually, my suggestion is that you download percona-toolkit which has the fabulous pt-query-digest that summarizes the data inside the slowquery.log. To use it, it is very simple:

pt-query-digest mysql-slowquery.log --since '2018-07-27 13:20:28' --until '2019-07-27 13:22:42'

Choose the time-range of the event, remembering that your database is probably storing their logging data with UTC so in my case I have to add 3 hours from 10:20. The output will be like this:

We have an overview of all queries executed during the time slice, for example, a total of 331.19M rows were examined. Bellow that, we have a rank with the most time consuming queries (the sum of all occurrences) and some columns giving more information (full explanation at the official docs). My recommendation: catch queries that have fewer calls, but represent a big percentage of the total time consumption of the report. For example, the query ranked at #6 was executed only 9 times ( the others around 300~) while having almost the same total time of occurrence (3,4 %). Those abnormalities usually are the ones that cause "random spikes" of behavior in the database. After that, just roll down the report to take the query itself. Now we can start the fun part: query tuning.

One last piece of advice: never underestimate any of the queries in this report. Following my advice, you will likely find the culprit earlier but, every query in this report is a possible source of problem, after all, it's called slow query log for a reason.

Query Explaining

We got our query to tune, but what to do? The first step is to run the EXPLAIN statement on that query to check if the indexes are working as expected. Here is an example of a query being explained:

explain select *
    from Orders orders
     join Itens itens on orders.id = itens.order_id
    where seller = 'Bond' and creation_date = curdate();

Explain output

There is a lot going on here, first check the "rows" which show how many rows are probably going to be scanned per instruction. Remember that the total number of rows scanned are the product of the value of every row of the explain output (well not always), so in this case it would be 3 x 1 = 3. Things start get worse when you get things like 324 x 4567 =~ 1.4 Million. After that, the "key" show which index key is being used and the "possible keys" is self explanatory.

This is just a simple overview, the subject is much broader so I recommend these slides for a better overview and the MySQL docs to really dive deep.

Query Profiling

We used the EXPLAIN and now we know beforehand the possible issues with the query. But we can go further and actually profile every tiny aspect of the query execution, which can be very useful sometimes. To use it, it's quite simple:

SET profiling = 1; # activate profiling
<YOUR QUERY>; # run the query
SET profiling = 0; # disable profiling
SHOW PROFILES; # show every query that was run during the profile
SHOW PROFILE ALL FOR QUERY <QUERY_ID_FROM_PREVIOUS_STEP>; # profile details for the selected query

That last statement will show a detailed information like this:

Given that the query execution by itself was 22 s 642 ms, we can see that "Sending data", which means that data is being read from the disk, is really the issue with our query. To have a description of every state, check the docs.

Remember that all the states are in occurrence order, so in that example the "Creating sort index" happened later and so on. This can be helpful to find odd behaviors which may give some advice about which aspect of the query you should tune.

The minimum that you should know about indexes

There's a lot to know, but I'll give some general guidance.

Indexes usually use a B-Tree, which is not the same as a Binary Tree, to hold their data.
B-Trees are very efficient, O(log n) for search. For 1 Million rows, at most 20 comparisons will occur. And there are plenty of optimizations on top of that.
Indexes are stored "as a separated table". So if you have a table with 10 millions rows and 5 indexes on it. You will have 5 "auxiliary tables" which have the columns used for the index and up to the same amount of rows (no duplicates). Keep that in mind to understand the impact of IOPS, database size and write throughput during the creation of new indexes, as it may impact your production environment.
The order of the columns in the where clause doesn't matter for a index, but the order of the columns in the index itself does if you are not using all the columns. For example, index(a, b) will works the same with WHERE a = ? and b = ? andWHERE b = ? and a = ? , but it will not work with WHERE b = ? , the restriction must contains the columna because the definition index definition started with a .
The LIKE statement can be optimized by a index if the wildcard is at the end. For example LIKE "James%" is fine, LIKE "%James%" is useless.
Indexes also help other operations like GROUP BY and ORDER BY , because the index itself is ordered. When you use an ORDER BY and have an index for that column, the "ordering operation" is almost free of charge.

There's much more on the subject. MUCH more. I recommend the following: Use The Index Luke and Rick James's site, which is very ugly, but very rich in content (give a look on this one about composite indexes).

Binary logs

Finally, It will not always be easy to find the culprit with all those tools and we may use another one, which was not designed for monitoring: the Binary Log. The binary log is a big file having the register of everything that happened in the database and is used for MySQL replication. We can use that file to track down which inserts happened and at which time, and possibly find some interesting behavior. We wrote a simple app to parse those Binary Log files and print some useful data. Give it a look!

Conclusion

So that's it. I hope that those advices help you on tracking down those nasty slow queries. If you think that I made some mistake, or I'm missing some important aspect tell me! And I finally put an end in this Blog Post Trilogy about Databases and now I'm free to write about other subjects, keep an eye on. ;)

And we are hiring Infrastructure Engineers / DevOps/ Other Fancy Name. Join us and help me handle our database issues.