The moment of truth when massive traffic spike hits — “case M”

Published in

Aller Media Tech blog

5 min readJun 6, 2019

When do you know when relentless work for improving your service performance and reliability has paid off? Usually, this is one of the most difficult thing to verify since the “moment of the truth” is usually something where everything is expected to break down but things still keep on running smoothly.

In our context high performance and high availability is the bread and butter for us since we serve approx. 2.5 million users weekly. Day and night. This would be an easy task if usage of our service would spread evenly throughout the 24 hour period. However, due to the nature of our business, user volumes usually peak without warning as news break or some other phenomena not controlled by us.

In this article we are going to go through the “Case M” which caused one of these “moment of truth” situations and turned out to be the proof of our relentless work of infrastructure evolution.

Devops transformation

We have been focusing on moving to auto-scaling containers and full DevOps model for the past three years. There has been tremendous change how developers see themselves being responsible for the operational performance — I truly think this is the main goal for starting the DevOps journey. This in turn has enabled us to accelerate our infrastructure and application evolution far beyond what we were able achieve before, which is critical on our ability to handle massive traffic spikes without any issues. When I have been reflecting us to the other actors in the market, I can say with confidence we are in good position compared to our competitors.

I let Jani Sinkkonen, one of the champions in our product teams, to run you through more technical description to our moment of truth in question.

The technical perspective

Before going to the more or less epic traffic spike we faced that the “Case M” caused, here’s a short list of things done on the way.

Starting point

Our service in question has been running on AWS Elastic Container Service (ECS) since 2016.

Some details of the setup we started with:

Basic setup of CloudFront to match the needs to get going
Drupal 8 Back and Front
RDS db.r4.large (MySQL 5.7)
Drupal internal cache in database
Containers with 512 CPU and 1024 memory reservations
No dashboards of any sort

With that setup we faced constantly slowness and approx. five minute breaks on the service when deploying new versions and clearing Drupal caches (which needs to be done quite often).

Present day

Ever since starting the DevOps journey we have been relentlessly working on the speed and reliability of the application, even down to daily infrastructure changes.

Few details about the setup at the time of writing:

CloudFront behavior improvements
AWS CloudWatch dashboard
Drupal internal cache to Redis
Containers downscaled to 256 CPU and 512 memory reservations
RDS change to Aurora Cluster and instance downscale to db.t2.medium

CloudWatch dashboards opened more clear view on what is happening in the application, which helped us on fixes and improvements before they caused any real trouble. They also lead us to investigate better caching systems for Drupal and for that we decided to choose AWS ElastiCache Redis, which in turn lead us to situation where we had too much resources reserved in ECS so the reserved resources per container could be halved.

After Redis was implemented our database instances were way overpowered, so we ended up downgrading them as well and moving to Aurora cluster. Also, we thought we would easily take advantage of the read replicas but alas, that is another story, maybe someday.

The “Case M”

One gray Monday morning on February we got some serious unplanned “regression” test on our production site. Traffic had spiked up 100x and request count 35x, which was approx. ten times higher request count as normal high-intensity period. To make things worse, the spike hit during low traffic time, right before the application would have started to scale up along the increasing amount of users.

One of the big reasons for surviving such spikes is AWS CloudFront as it reduced the request count reaching the application down to just ~4% of all the requests. The rest were served directly from cached resources on CloudFront itself.

Looking at the graph of total requests the spike is quite evident and shows the scale for the half an hour and following three hours on that morning.

Similar story can be seen in the status code graph for the same period.

The interesting facts on the status codes on that spike are:

2xx: 95.82%
3xx: 1.19%
4xx: 0.33%
5xx: 2.66%

Meaning that we only had 2.6% error rate during sudden 100x traffic spike. Now, in the old way of running similar services the ratio would probably be the other way around. What makes this better is that CloudWatch dashboard also showed us the situation in real time during the whole incident. Worst case scenario with traditional way of running the service, it might had taken us the whole morning to investigate what is actually happening.

However, this all begs the question: Why did we have the 2% error rate?

Looking at the dashboards, it was pretty much immediately obvious what caused the errors as we saw a spike in the database connections graph— When scaling up too fast, we reached the limit of maximum connections to database as new containers & workers were trying to connect the same time. However, the situation fixed itself automatically as the scaling operations were done and connection pooling kicked in.

Looking bit deeper on the first moments of the spike as the scaling operations started, even with the smaller database instances in RDS the CPU peaked only at 23%, while average select latency raised from 0.1ms to 23.6ms. Average response time went up from 0.06s to 0.582s, percentage vise a poor result but still half a second for few minutes in a scale where request count to application suddenly increases 35x to the normal I consider this as a good result.

Bottom line: The past two years invested time on constant improvements on the service have really paid off. We’ve been able to avoid the notorious “2AM calls” completely. If something weird is discovered or reported there’s no panic and it usually gets resolved automatically. If manual work is required it’s quickly handled with tools that indicate you where to look for the reason without having to spend hours looking for anomalies.

— By Juha Kuokka, Head of Development at Aller One IT and
Jani Sinkkonen, Senior Developer at Aller Media Finland