Write Spike Wars, Episode II: The Attack of the Reads

First and foremost, this is not like Star Wars that you may watch the Episode IV before the Episode I. I DEMAND that you first read the actual Episode I of this series of blogposts: Write Spikes Wars, Episode I: The AWS RDS m3.2x.large Menace. Come back here when you are done and together we shall rule the database.

Sorry, it's too hard to change the gif subtitle.

So we were still having some write spikes, but now it wasn't enough to actually harm our users. We knew from our monitoring that it only happened in weekdays, never on weekends or holidays, but we still couldn't figure out what was triggering it. Knowing only that it was indeed an "excessive writing" problem, we created some hypothesis of the cause:

  1. We allow tons of products (over 30k) to be input in the system through an XLS file. Not only products, but actually some other related entities as categories and alike.
  2. Our public API that allows our partners to create almost anything in our system. Orders, products, customers, categories and so on.
  3. MySQL misconfiguration.

Let's see what we did about every one of those suspects:

1. The "big sheet"

Our code responsible for the XLS import is far from perfection. Through the time, we worked to change it from a "dangerous portion of legacy code" to a "somewhat well tested place that I rather be very far away". For historical reasons, some portions of our code don't handle transactions properly, which means auto commit is enabled by default at lot of places, and of course that our beloved file importer was no different.

Would one single transaction for every product/category in a huge file, in a highly concurrent environment be an overhead capable of causing our spikes? After improving our logging system, we noticed that it was indeed common (but not a rule) to have big file import during those spikes. So let's fix this transaction management issue, which is as simply as wrapping a transaction over the entire import process, it should be simple. And it was, but after the fix we were still having those spikes. Next!

2. The Public API

And what about our public API? In our business model it is expected that our customers use our API for keeping their ERPs in sync with our system. But if you ever deployed something like it, you know that not every client will implement this integration in the “optimal way”. We had very weird cases of people parsing JSON "manually" byte per byte, so when one day we changed the spacing of the JSON output we broke their integration. And of course, there were some edge cases like people checking for updates in the system using constructs likewhile true: api.get(). DDoS should not be a feature.

Even so, our monitoring didn't catch signs of something off, like surges of requests during the write spike periods. So far we didn't have an throttling system in our API so we implemented it as it seemed to be a good time for it. As you imagine, the spikes were still happening, but at least it solved a lot of other problems regarding our API protection.

3. MySQL misconfiguration

Our database is old, and it has been tweaked along the years, tweaks that in the early days were not documented and we don't know anymore why they were applied. Some of them were changed to try fix some problems with IOPS that we had in the past along with the old trick (but not quite advised) to increase the RDS disk size to get more IOPS for a smaller price. And you know what? Maybe those IOPS problems were already a presage for this write problem.

So we went back to factory default just to see what happened. And well… nothing happened. Dead end. At least now we knew that our database didn't have any weird configuration magic applied to it.

Our Custom Tooling to track the problems down

And well, along all of this we actually wrote one app to read MySQL's binlogs to facilitate tracking portions of the application which were actually doing more inserts/updates in our database. It didn't help us to find the problem of write throughput, but it was actually very useful for benchmarking some operations and seeing which accounts of our app were doing more inserts/updates (and at which table). It is very interesting, give a look at https://pypi.org/project/mysql-binlog-explorer/, it may be useful for you one day.

Time to ask for help: The Consulting Company

We waited until the very end to use this last resort, after all there is always the chance to pay a lot to get an answer like: "change this config from 0 to 1 and all your problems disappear". That's why we tried everything before getting a third party in the game. But after so many work, paying for that kind of answer would be very welcome anyway. So we hired Nazar.IO.

We told to them basically what I've written so far (with more details). After around 2 weeks of monitoring and trying to follow the lead of a "MySQL" problem they moved forward to profiling some slow queries.

And one day they came back with one query…

The Root of All Evil (Right below our noses edition)

So here it is, the simplified version of the infamous query. Two points in here:

The Kraken
  1. It was fast actually "fast", around 2,5 seconds of total execution time. Some reports were much slower than that.
  2. Running the above statement 8 times in parallel PUT THE WHOLE SYSTEM DOWN.
And finally the true evil have shown its true form.

And what is this query about? It's triggered by a search field where you can type down an name/email/id from a account in the system and you go to the its details. It's a feature that only our internal support team have access, its not used by our customers. Remember when I said that the spikes didn't happen on weekend or holidays? That's why. And why this query was so destructive? If we use an explain command we can see that it's using file sort (which besides the name not always means disk sort, but in that case it was) and if we profile it, there's a lot of data being moved internally. The fix was easy: just add an index to company.name column and another to user.email which both are used by one like statement. We had to remove the leading wildcard to be able to use the indexes too, small business impact though. No more downtimes.

And the question: Why we didn't notice that earlier?

  1. The query wasn't that slow and it wasn't commonly executed to show up easily in the slowquerylog. Remembering: it was only used by our internal support team. The spikes only occurred when by luck (or lack of) two or more members of the support team used the search at the same time.
  2. During the spikes/downtimes/slowdowns every other query was slow too, so it gets really hard to differ cause from consequence.
  3. Prejudice: It was a WRITE problem. Even knowing that bad queries could go to disk, we definitely didn’t imagine that bad queries could make that huge impact on write throughput. Living and learning.

This last aspect, the prejudice, indeed was the biggest problem. We really underestimated the impact of query performance in the system. Knowing that our approach would be very different. Now we are hunting down all slowqueries that shows up. I'll tell about those techniques and how we are approaching that in the next blogpost. Our lesson here was: always keep an eye on your slow queries, even apparently harmless queries can take you down.

Escaping from our own doom.

P.S: My thanks to Nazar.IO team, which helped us to better understand the impact of slow queries.

P.S #2: We are hiring!