Monday 31 October 2016

Outage on post pages

Medium Engineering
Postmortems
3 min readNov 3, 2016

--

On Monday lunchtime a change was pushed to production that caused post pages to become unavailable. Due to an increase in load, a small percentage of traffic to other parts of the site were also intermittently unavailable, though at a much smaller scale. The CMS was unaffected.

The error triggered alerts and our on-call team reverted the bad build. The incident lasted 18 minutes, from 1:08pm to 1:26pm PST.

Timeline

  • 1:03pm PST
    Health checks on the new build pass.
  • 1:06pm PST
    The deployment completes.
  • 1:08pm PST
    Alerts indicate abnormally high pressure on an internal service.
  • 1:16pm PST
    Alerts indicate server responses are too slow.
  • 1:19pm PST
    On-call triggers a rollback to the previous build deploy.
  • 1:26pm PST
    The rollback succeeds, all pages are again fully accessible.

Explanation of root cause

On Monday a new feature, making a change to the rendered output of a post page, was rolled out to 100% of users, due in part to a desire to have a consistent experience for all users.

The data used to render Medium’s post pages comes from a number of backend services, supported by a number of different kinds of caches. During development of this new feature, a misconception about how cache abstractions work meant that the new feature queried Hopper (a backend data service) for every rendering of every post page.

As a result, once the deploy completed, Hopper saw a substantial spike in traffic, and was unable to handle the additional load. The machines backing Hopper rapidly approached 100% CPU, which caused requests to back up. Post pages in turn began to timeout, and the reverse proxies shed traffic.

The additional load on Hopper caused cascading slowdowns to other parts of Medium which relied upon it, causing intermittent — though less noticeable — failures throughout the rest of the site. Successful responses returned by Medium frontends were at 25% of what they were prior to the incident.

This class of issue is usually prevented through use of variants to roll new features out to progressively larger audiences. We start with Medium staff, then typically go to 10%, 25%, then 100% of Medium users. This is done both to assess the effectiveness of the feature, and to understand the impact of the feature on infrastructure. The Monday afternoon change bypassed this progressive rollout.

Medium FE Latency
Hopper Num Go Routines
Hopper Free Memory
Hopper CPU

Explanation of resolution

We resolved the short term issue by turning off the new feature. A permanent caching fix reduced the load on Hopper. The feature was then rolled out to 5% of users, then in small increments, allowing the team to understand the required capacity and add machines to the Hopper fleet as necessary.

Preventative measures

  • Restatement of policy to avoid going from 0–100% of users in one step. We know this to be best practice, and we made a mistake by not doing so. In particular, launch plans should be designed to account for the fact that we can’t jump to 100% for anything that touches the post page.
  • Implement graceful degradation of non-critical features.

The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.

--

--