Wednesday 26 January 2018

Medium Engineering
Jan 29, 2018 · 1 min read

Summary

On January 26th, 2018, Medium had a rolling outage from 4:51PM until 5:32PM.

Impact

A majority of requests to medium.com failed (including visits from a web browser, interactions from our native apps, and API requests).

Timeline

  • 4:51PM: Server errors began
  • 5:31PM: A bad cache instance was identified and restarted
  • 5:32PM: The last sustained server errors ended

Root cause

At Medium, we run experiments that affect the behavior of the site, the apps, emails, and many other things — all with the goal of refining the Medium experience to make it better for our users. When a server processes a request or handles an event, it checks the status of all running experiments to ensure the correct response is delivered.

This outage was caused by the failure of a cache that stored metadata for currently running experiments. The resulting surge of requests that failed over to our database were beyond provisioned capacity and requests began to fail.

Action taken

We have taken steps to ensure that the cache is more resilient against the failures of instances serving hot cache keys.


The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.

Postmortems

In pursuit of transparency, we will publish a public postmortem for any serious outage on core Medium services.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade