Failing responsibly

Published in

Blog | Mirumee

5 min readOct 18, 2016

Working with demanding environments is very rewarding when everything goes according to the plan. But it also involves dealing with failures. The goal of this post is to show you how to handle problems in a responsible manner.

Given enough time a complex system is bound to break in one way or another. We need to react in time and restore services. We have to act carefully to avoid causing further damage. And once all lights turn green it’s important to learn from the experience. Skipping this step is throwing away a valuable chunk of insight into the product. Will things break tomorrow? What about next holiday when the traffic doubles? Did we go deep enough to fix the root cause? And if we can’t answer these questions how do we expect the customer to trust us?

There is a special type of report that we prepare. The industry calls them “postmortems” and the goal is to:

help our peers understand key events and factors,
point out the cause of failure,
assess the damage,
explain solutions taken,
list lessons learned that can prevent similar problems from arising in future, and
convince the customer that we control the situation.

Below is an incident report I had to write a couple of weeks ago. In this particular case the environment was an auto-scaled Elastic Beanstalk application. Tens of AWS EC2 instances behind an Elastic Load Balancer. I’ve redacted the document to avoid disclosing sensitive data. The important part is the structure, not the numbers.

This failure report represents the knowledge available to Mirumee Software of the incident that took place on the █████████, 2016.

Created by: Patryk Zawadzki, Development Lead
Approved by: Mirek Mencel, CEO

Mirumee Software
Tęczowa 7, 53–601 Wroclaw, Poland

AWS CloudWatch traffic analysis. Blue line represents incoming requests as seen by the load balancer. Orange line shows connections rejected by the load balancer due to not enough servers being available in the balancing pool.

Timeline

7:45 UTC ¹

Large traffic spike appears. AWS attempts to scale up from pre-defined ██ machines. ████ requests are observed in the first two minutes before a newly spawned machine appears. By the time the ██ machine starts answering requests and ██ is commissioned the traffic first throttles down to around ███ rpm then slowly increases to around ███ rpm.

7:55 UTC ²

Average response time approaches five seconds. Due to a configuration error five seconds are also the health check timeout threshold. Slower health check responses cause the load balancer to take ██ out of ██ machines out from the balancing pool.

Over the next twenty minutes machines bounce between being in- and out of service as their health checks fail once traffic increases and start passing again once they stop seeing new requests.

During that time up to half of the traffic is met with an error page as there are not enough servers in the load balancing pool to handle the connections.

8:15 UTC ³

████ contacts Michał over ███████ pointing to slow site response times.

Michał confirms there being a problem but at this point the team is not aware of the health check configuration problem.

Traffic keeps increasing and eventually stabilizes around ███ rpm.

8:40 UTC ⁴

Health check configuration is fixed to allow timeout thresholds safer than expected latency.

8:45 UTC ⁵

As health checks start passing all of the machines rejoin the load balancing pool. Dropped connection count starts going down. The machines are fast enough to handle new traffic but there are many queued connections. The team decides to scale up temporarily to increase the draining rate.

8:50 UTC ⁶

Max balancing pool is now increased to ██ but a limit is hit. AWS account’s EC2 quota is ██ concurrently running instances. Production and staging environments already have ██ EC2 instances combined.

Staging environment is reduced to a single running instance. The team decides not to shut it down entirely to keep a safe testing ground for all changes.

8:55 UTC ⁷

██ new machines join the load balancing pool. The draining rate increases and the environment status returns to green.

9:25 UTC ⁸

The fraction of rejected connections drops to zero. Traffic starts increasing.

9:35 UTC ⁹

Traffic peaks at ███ rpm. Error rate is still at zero. Average response time climbs up slightly to around 1s.

9:50 UTC ¹⁰

As traffic starts to slow down the team decides to scale down back to ██ machines to gain operational capacity to redeploy the staging environment on a higher-end EC2 instance. ████ asks the team to provide excessive resources as according to daily traffic profiles the highest peak is yet to be expected in the evening.

10:00 UTC ¹¹

Staging is successfully redeployed and tested on a ████████ EC2 instance.

Production environment continues to run with expected performance. Error rate remains at zero.

10:20 UTC ¹²

Production environment is upgraded to the same ████████ instance type. The change rolls out over the next 15 minutes as one by one machines are replaced by their more powerful equivalents.

Failure analysis

The first problem we hit was a misconfigured health check. A five second timeout threshold was not enough to keep the machine healthy under heavy loads and caused machines handling many concurrent connections to be erroneously decommissioned by the load balancer. This put even more strain on other machines and caused them in turn to fail the health check.

We are investigating why this did not fail during the last round of load testing. Our first suspicion is that the CPU profile of the code changed slightly with recently introduced changes to how pricing was handled. It’s possible that the overhead introduced by the changes was enough to change how health checks behaved under heavy load.

Solution: health check configuration was altered to allow longer health check response times before assuming the machine to be broken. The health check function itself will be profiled to see if its responsiveness can be increased when the server is operating under heavy loads. The team wants to run another round of load testing and profiling before the next sale.

The second problem is that the instance quota of the AWS account is too low. There is not enough room for expansion and it’s currently not possible to fully scale out the staging environment without affecting production environment’s ability to scale.

Solution: the team will use the AWS panel to request a higher instance quota.

The third problem is that due to error there were no alerts configured for the production environment. The team was not aware of the problem until contacted about it. Having immediate notifications would allow the team to understand the problem faster and severely limit the impact.

Solution: the team will correctly set up AWS alerts and notifications.