On January 26th, 2018, Medium had a rolling outage from 4:51PM until 5:32PM.
A majority of requests to medium.com failed (including visits from a web browser, interactions from our native apps, and API requests).
- 4:51PM: Server errors began
- 5:31PM: A bad cache instance was identified and restarted
- 5:32PM: The last sustained server errors ended
At Medium, we run experiments that affect the behavior of the site, the apps, emails, and many other things — all with the goal of refining the Medium experience to make it better for our users. When a server processes a request or handles an event, it checks the status of all running experiments to ensure the correct response is delivered.
This outage was caused by the failure of a cache that stored metadata for currently running experiments. The resulting surge of requests that failed over to our database were beyond provisioned capacity and requests began to fail.
We have taken steps to ensure that the cache is more resilient against the failures of instances serving hot cache keys.
The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.