Thursday 3 March 2016
Medium Outage
On Thursday morning a change was made to how EC2 instances are provisioned. This change, combined with lack of termination protection on key long-lived instances and loose IAM policies, resulted in instances being removed in a key availability zone. This impacted core infrastructure required for Medium’s operation.
The outage lasted 31 minutes, from 11:25am to 11:56am PST. Error rates remained elevated until 12:35 PST.
Timeline
- 11:20am PST
A change to how we provision EC2 instances was made and executed. - 11:26am PST
Increase in average response times coupled with a drop in traffic to application servers. - 11:27am PST
Elevated server errors. On-call engineers are paged and respond immediately. - 11:32am PST
On-call engineers identify that several reverse proxies are down, begin re-provisioning. - 11:44am PST
Instances for several other services are offline. Process begins to rebuild fleet. - 11:55am PST
Core services operational. medium.com and hosted sites are accessible. - 12:09pm PST
Source of terminations identified as within our network. Production access tightened for remainder of incident. - 12:35pm PST
All services operational and stable. Offline tasks are backed up, but not affecting user experience. - 12:56pm PST
Root cause identified. Restrictions on production access lifted.
Explanation of root cause
Medium uses Ansible for configuration management and orchestration. The EC2 module has a feature which can enforce a specific number of instances based on the uniqueness of EC2 tags. This can be done with either tag/value pairs or based on the presence of a tag.
For example, the following guarantees a single instance with a given “Name” tag:
count_tag:
Name: "{{ fullname }}"
exact_count: 1
For a particular internal service, we don’t want more than one instance, regardless of zone constraints. At the time, it wasn’t clear to us that the Ansible EC2 module has specific checks that apply to a given zone. And the zone was being set automatically elsewhere.
This leads us to the root cause of Thursday’s issue.
While debugging this particular problem the configuration was set to:
count_tag: Name
exact_count: 1
This means only allow one instance that has a Name tag, regardless of value.
Unfortunately this wasn’t executed in dry-run mode (-C) and ended up removing all but one instance in an availability zone.
We distribute most services across all availability zones, which meant we lost about 25% of our fleet.
Explanation of resolution
The immediate resolution was to re-provision the instances we lost.
Preventative measures
- Turning on termination protection for all our key long-lived instances, and making it the default behavior for new instances, will reduce impact of a similar incident in the future.
- Making the dry-run (-C) flag on by default will give more insight into what will change and help avoid accidental changes while debugging configurations.
- Creating an isolated testing environment for config changes will protect production services from bad config changes.
- Tightening up IAM role policies will help ensure that destructive actions (e.g. ec2:StopInstances) are restricted to automation systems that require them.
- Re-running disaster planning for loss of an availability zone will identify actions that will reduce the impact of such an event on core services.
The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.