Thursday 3 March 2016

Medium Outage

Medium Engineering
Postmortems
3 min readMar 8, 2016

--

On Thursday morning a change was made to how EC2 instances are provisioned. This change, combined with lack of termination protection on key long-lived instances and loose IAM policies, resulted in instances being removed in a key availability zone. This impacted core infrastructure required for Medium’s operation.

The outage lasted 31 minutes, from 11:25am to 11:56am PST. Error rates remained elevated until 12:35 PST.

Timeline

  • 11:20am PST
    A change to how we provision EC2 instances was made and executed.
  • 11:26am PST
    Increase in average response times coupled with a drop in traffic to application servers.
  • 11:27am PST
    Elevated server errors. On-call engineers are paged and respond immediately.
  • 11:32am PST
    On-call engineers identify that several reverse proxies are down, begin re-provisioning.
  • 11:44am PST
    Instances for several other services are offline. Process begins to rebuild fleet.
  • 11:55am PST
    Core services operational. medium.com and hosted sites are accessible.
  • 12:09pm PST
    Source of terminations identified as within our network. Production access tightened for remainder of incident.
  • 12:35pm PST
    All services operational and stable. Offline tasks are backed up, but not affecting user experience.
  • 12:56pm PST
    Root cause identified. Restrictions on production access lifted.
Application server requests
Internal server errors
Average latency

Explanation of root cause

Medium uses Ansible for configuration management and orchestration. The EC2 module has a feature which can enforce a specific number of instances based on the uniqueness of EC2 tags. This can be done with either tag/value pairs or based on the presence of a tag.

For example, the following guarantees a single instance with a given “Name” tag:

count_tag:
Name: "{{ fullname }}"
exact_count: 1

For a particular internal service, we don’t want more than one instance, regardless of zone constraints. At the time, it wasn’t clear to us that the Ansible EC2 module has specific checks that apply to a given zone. And the zone was being set automatically elsewhere.

This leads us to the root cause of Thursday’s issue.

While debugging this particular problem the configuration was set to:

count_tag: Name
exact_count: 1

This means only allow one instance that has a Name tag, regardless of value.

Unfortunately this wasn’t executed in dry-run mode (-C) and ended up removing all but one instance in an availability zone.

We distribute most services across all availability zones, which meant we lost about 25% of our fleet.

Explanation of resolution

The immediate resolution was to re-provision the instances we lost.

Preventative measures

  • Turning on termination protection for all our key long-lived instances, and making it the default behavior for new instances, will reduce impact of a similar incident in the future.
  • Making the dry-run (-C) flag on by default will give more insight into what will change and help avoid accidental changes while debugging configurations.
  • Creating an isolated testing environment for config changes will protect production services from bad config changes.
  • Tightening up IAM role policies will help ensure that destructive actions (e.g. ec2:StopInstances) are restricted to automation systems that require them.
  • Re-running disaster planning for loss of an availability zone will identify actions that will reduce the impact of such an event on core services.

The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.

--

--