Sunday 21 February 2016

Partial Medium Outage

Published in

Postmortems

3 min readFeb 26, 2016

A bug was introduced Thursday that subtly broke caching for a subset of our database requests. This, combined with a lack of caching on some requests from logged out users, aggressive retry logic, and an increase in traffic from Talia Jane’s open letter increased pressure on our web tier Sunday morning.

For 30 minutes, between 8:55am to 9:25am PST, Medium’s primary web servers returned 504s to 20% of requests. On-call engineers were paged and were online within five minutes. It took ten minutes to identify the root cause and another fifteen to fully resolve it.

Timeline

02/17 11:00am PST
A change in how tags are cached is deployed to production.
02/21 8:55am PST
The tags table starts throttling and on-call engineers are paged.
02/21 9:00am PST
On-call engineers respond online.
02/21 9:10am PST
The problem is determined and our on-call engineers begin increasing the read capacity on the tags table.
02/21 9:15am PST
The first increase in read capacity is fully pushed to production. At this point, the error rate is reduced but there are still errors happening in production.
02/21 9:20am PST
On-call engineers bump read capacity on the table for the second time.
02/21 9:25 PST
The second increase in read capacity is fully pushed to production and the site is recovered.

Explanation of root cause

Medium uses Amazon’s DynamoDB database for serving much of the site’s content. We also have a Redis cache in front of Dynamo to help smooth out our traffic load and prevent spikes that would otherwise result in DynamoDB throttling our request load.

On Thursday, a change was made that was intended to make the fetching of cached post tags more efficient. An unintended side effect of this change was that we stopped caching an empty list for posts with no tags. With this change out in production, requesting the list of tags on a post without any tags started hitting DynamoDB 100% of the time.

The number of DynamoDB requests for post tags.

This issue wasn’t surfaced until the traffic spike on Sunday morning when we were paged for user facing errors and throttled DynamoDB requests.

Percent of 200s & 504s served [left] and count of throttled DynamoDB requests [right] on Sunday Morning.

Under the hood, the Node AWS SDK automatically retries requests to DynamoDB — including throttled requests — up to 10 times. This led to an increase in requests being processed concurrently on the Medium web servers, and subsequently an increase in latency. Our proxy servers compensated by shedding load and returning 504s. From 8:55am to 9:25am, approximately 20% of the requests to medium.com and other domains hosted on our platform returned 504s.

Increase in average and 95th percentile web server reported DynamoDB query latency.

Explanation of resolution

Because the Dynamo table did not have the full protection of our cache, it became more vulnerable to increases in traffic. Temporarily bumping the read capacity Sunday morning helped us weather the spike, but was not a long term resolution.

Preventative measures taken

Tuning the DynamoDB retry logic will decrease the perceived latency of throttled Dynamo queries. With fewer concurrent requests running on the system at once, the load caused by throttling shouldn’t spill over to other requests on the system like it did Sunday morning.
Increasing test coverage will help detect the class of bug where caching was bypassed.
Adding alerts that detect when access patterns to tables change will allow us to identify issues before users are affected.
Reevaluate Dynamic Dynamo as a mechanism for adjusting provisioning levels more quickly than humans can respond.

The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.