Monday 31 October 2016
Outage on post pages
On Monday lunchtime a change was pushed to production that caused post pages to become unavailable. Due to an increase in load, a small percentage of traffic to other parts of the site were also intermittently unavailable, though at a much smaller scale. The CMS was unaffected.
The error triggered alerts and our on-call team reverted the bad build. The incident lasted 18 minutes, from 1:08pm to 1:26pm PST.
Timeline
- 1:03pm PST
Health checks on the new build pass. - 1:06pm PST
The deployment completes. - 1:08pm PST
Alerts indicate abnormally high pressure on an internal service. - 1:16pm PST
Alerts indicate server responses are too slow. - 1:19pm PST
On-call triggers a rollback to the previous build deploy. - 1:26pm PST
The rollback succeeds, all pages are again fully accessible.
Explanation of root cause
On Monday a new feature, making a change to the rendered output of a post page, was rolled out to 100% of users, due in part to a desire to have a consistent experience for all users.
The data used to render Medium’s post pages comes from a number of backend services, supported by a number of different kinds of caches. During development of this new feature, a misconception about how cache abstractions work meant that the new feature queried Hopper (a backend data service) for every rendering of every post page.
As a result, once the deploy completed, Hopper saw a substantial spike in traffic, and was unable to handle the additional load. The machines backing Hopper rapidly approached 100% CPU, which caused requests to back up. Post pages in turn began to timeout, and the reverse proxies shed traffic.
The additional load on Hopper caused cascading slowdowns to other parts of Medium which relied upon it, causing intermittent — though less noticeable — failures throughout the rest of the site. Successful responses returned by Medium frontends were at 25% of what they were prior to the incident.
This class of issue is usually prevented through use of variants to roll new features out to progressively larger audiences. We start with Medium staff, then typically go to 10%, 25%, then 100% of Medium users. This is done both to assess the effectiveness of the feature, and to understand the impact of the feature on infrastructure. The Monday afternoon change bypassed this progressive rollout.
Explanation of resolution
We resolved the short term issue by turning off the new feature. A permanent caching fix reduced the load on Hopper. The feature was then rolled out to 5% of users, then in small increments, allowing the team to understand the required capacity and add machines to the Hopper fleet as necessary.
Preventative measures
- Restatement of policy to avoid going from 0–100% of users in one step. We know this to be best practice, and we made a mistake by not doing so. In particular, launch plans should be designed to account for the fact that we can’t jump to 100% for anything that touches the post page.
- Implement graceful degradation of non-critical features.
The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.