Write Spikes Wars, Episode I: The AWS RDS m3.2x.large Menace

I'm writing this to warn AWS users, that as myself were using an m3.2x.large RDS instance in production. Short story: Don't wait any more time, move to m4.2xlarge (or newer if it's available) tonight, m3.2xlarge have half of the write performance of its newer brother and is around 30% slower. Now if you have some time to spend, join me in a incredible story of a team of SREs trying to solve the problem of having random wave of write throughput spikes.

Have you ever visited Tsingy de Bemaraha in Madagascar? I didn’t, but watching our monitoring tools, it seemed so (Those monkeys trying to survive aren't us, but could be).

First, A Little Bit of Context

Our product, MeusPedidos, is a SaaS application that handle the processing of thousand of orders per second. Those Orders made by Industries, Sales Representative and Distributors can be done by users in our Web App, users of our Mobile Apps, third party integrations of our public API and finally by some huge XLS Sheets imported directly in our Web App.

So well, we have tons of data input every second, it seems that our app is a great place to perceive disk write spikes. And indeed it was:

Sorry for the historical data, only 1 hour periods, but believe me, it was 62.1M in in a time frame of just a few seconds.

At some times, it came not alone, but with a bunch of others spiky friends:

Spikes, everywhere, that 62M started to be recurring.

Spooky (no pun intended).

Time to Track down the Root Cause… or not.

We noticed that some thing was so wrong in our app, that we were really reaching our write throughput limit of 62MB… But wait a second. Is that the expected performance for m3.2x.large? Let's check the docs:

Directly from the docs. Is that right? m3.xlarge (one level down) which should have that performance.

Weird. But our instance, that have multi-AZ enabled, maybe cut the performance in half to keep the data in different regions synced through the EBS. But, I'm just Python developer with a SRE title which thinks that knows something about AWS architecture, let's do some real tests. Here is the plan, let's create a column in a very large table, which generates a huge write operation (MySQL needs to rebuild the entire table) in a m3.xlarge, m3.2xlarge and a m4.2xlarge, which should have 60MB/s, 125MB/s and 125MB/s of bandwidth respectively. Here are the results:

Multi-AZ Enabled

m3.2xlarge is the only one that is not following what the docs says. Let's disable the Multi-AZ.

Multi-AZ Disabled

Now it's working as expect, so it was indeed the Multi-AZ cutting the performance in half… but wait, if that's true, why m4.2xlarge and m3.xlarge are still with the same performance? Let's confirm that with our beloved AWS Support:

[…] I would like to inform you that irrespective of the single -AZ setup or multi-AZ setup db.m3.2xlarge will have a maximum bandwidth of 125MB/s. — AWS Support
Indeed.

Yikes! Something smells, let's open a support ticket and check what's going on…


The Incredible Story of The Ticket that took 3 Months to be closed

And thats not a joke, it indeed took that incredible amount of time. Tons of experiments, phone calls, attachments, trying to get refund from our own testing to prove the problem. It even had e-mail responses so big that it actually did create a writing spike in Google Mail Servers (I can't confirm this). Anyway, in the end what I can say from my experiments is that m3.2xlarge have some serious performance problems built in it.

Here it is one of my last experiments to prove the performance problem, showing how similar is the performance of the m3.2xlarge and the m3.xlarge:

Whole execution

The m4.2xlarge, which actually have the 125MB bandwidth finished 13 minutes earlier, and as you can see in the chart that is actually the same 13 minutes that the m3.2xlarge and m3.xlarge had to spent doing an extra work to compensate the half bandwidth performance.

The "write intensive operation" ends in here for m4.2xlarge, but keeps going for the others.
Which ends 13 minutes later for the half 62.5MB bandwidth guys.

So I'm pretty sure that the m3.2xlarge, at least for very intense write operations, have the same performance of it's smaller brother, the m3.xlarge when multi-AZ is enabled. And thats odd.

Here is the thing: AWS actually encourages their users to migrate to newer database classes, and you actually pay a lower price for having better performance when you go up a class level. And that's what we did in the first days after opening the infamous "3 Month Ticket", it was sure for us that something very wrong was happening with the m3.2xlarge. And well, after moving to m4.2xlarge we noticed an improvement in at least 30% in our database processing time. I actually think that it was around 50%, but unfortunately I didn't store that NewRelic metric to show it here now (excellent SRE job on my part 🙄). Why is m4 faster? I don't know the details, but one of it's selling points is the so called enhanced networking.

So we got a really significant speed boost and doubled our write threshold. With that in place, the downtimes that we had for reaching that threshold started to be very rare, so now we could finally start to handle the…

Attack of the Write Throughput Spikes

Thanks, but we are just enduring more, we still don't know what the hell is going on.

I talked a lot about how staying with the m3.2xlarge is a bad idea, but and what about the spikes? What was the actual root cause? Well, this blog post is already too big, and I need more views, so let's split this story in a half (like the m3.2xlarge did with its performance) and lets get together again in the upcoming blog post called…

Write Spike Wars, Episode II: The Attack of the Reads

But, you'll be able to, young padawan.

One last disclaimer, in the end the AWS never confirmed that the m3.2xlarge was at fault, and at least they spend a lot of time (3 months) trying to convince us that the m3.2xlarge didn't have any problem. They said that they opened a Cloudwatch bug for showing wrong values for the write throughput metric, which was the only actual problem that they could find. Besides that, they disqualified my tests saying that a column creation wouldn't be accurate enough to benchmark write performance, given that it has tons of factors. Maybe I'am too dumb to really understand what's going on (I'm no AWS expert), but no one ever could justify why my tests and premises were wrong. I can't say that support quality for this case was bad, since that it even allowed some phone calls that weren't "expected for our support plan", but it really didn't convince me and my colleagues with the idea that everything is right with the m3.2xlarge. If you are reading this and actually understand the inner workings of RDS/MySQL I'll would be very glad to know the actual explanation. See you in the next episode.