On July 11th, 2017, Medium.com was not serving requests from 4:10 pm to 4:14 pm due to a deployment issue caused by a miscommunication.
No one could access https://medium.com for 4mn.
At Medium, all services are deployed following a continuous integration and continuous deployment model. What this means is that deployments of our services happen automatically at a high frequency (this service, in particular, gets deployed dozens of times a day). To make this process safe, our deployment pipeline runs a health check on a canary to make sure that there is no obvious regression.
If the health check step passes, all production traffic is redirected to a fleet of servers running the new build.
If anything bad is discovered, the deployment pipeline stops and notifies the on-call team. The team then investigates the possible regression and decides between retrying the health check step or abandoning a build.
In this case, the health check step did not pass, and two different protocols were implemented simultaneously, causing an outage.
We took the necessary actions to prevent such issues from happening again by adding a lock to prevent an abandoned build from being retested and adding extra steps prior to the promotion of a build to make sure enough hosts are running the service before switching traffic to the new fleet.
- 4:10 pm PDT
Health check of our deployment pipeline mistakenly allows a terminated build to be promoted taking down https://medium.com
- 4:14 pm PDT
Oncall engineer rolls back the bad fleet and returns Medium to service