Wednesday 22 June 2016

Outage on publication home pages

Medium Engineering
Postmortems
2 min readJun 24, 2016

--

On Wednesday morning a change was pushed to production that caused some publication home pages to throw errors, including landing pages for hosted domains. Direct traffic to medium.com, CMS, and post pages was unaffected.

The error triggered alerts and our on-call team reverted the bad build. The incident lasted 21 minutes, from 10:04am to 10:25am PST.

Timeline

  • 9:42am PST
    Health checks pass, despite errors on publication home pages.
  • 10:04am PST
    The deployment completes.
  • 10:09am PST
    Alerts indicate several major publication pages are offline.
  • 10:10am PST
    On-call triggers a roll-back to the previous build deploy.
  • 10:25am PST
    The rollback succeeds, publication home pages accessible.

Explanation of root cause

Medium’s publication home pages are divided into “sections” which display stories from a variety of sources (e.g. the latest stories in the publication, featured stories, or stories from a tag). On Medium, stories can be tagged with topics that help readers discover related stories. Publications can use these tags to divide up their homepage on a per-topic basis.

Medium assumes that every tag uses title case by default. For certain popular tags that should not use title case, we manually override the name of the tag with the correct casing (for example, we use “MLB” instead of “Mlb”). Unfortunately, we weren’t using the overridden cases when rendering sections on publication home pages. In the process of fixing this issue, a bug was introduced where it was assumed that tag object would always exist for the corresponding slug:

section.postListMetadata.tagName = tagsBySlug[tagSlug].name

When the corresponding tag object does not exist, tagsBySlug[tagSlug] evaluates to undefined. We then attempt to access the name property of an undefined object, which throws an error.

Errors like this are ordinarily caught by unit tests, or during the “health check” phase of Medium’s deployment process. During this final phase, we test a single machine with production traffic before deploying the build to the rest of the fleet. We track the errors on this machine and halt the build if we encounter any unexpected errors.

The Wednesday morning build succeeded due to a false negative with the health checks. This caused our deployment process to interpret the build as safe, and escalate it to production.

Explanation of resolution

We resolved the initial bug by checking whether the tag object exists before attempting to access its properties. We also verified that the case where the tag object didn’t exist was safe, and corrected that case in a later fix.

Preventative measures

  • Upgraded health check process with more probes for publication home pages and hosted domains.
  • Hardened health check process to prevent false negatives during health checks for this class of bug.
  • Will add additional unit tests for affected areas.
  • Will add additional end-to-end tests for publication features.
  • Identified steps to speed up rollback following event such as this.

The Medium Engineering team have committed to publishing a technical postmortem for serious outages to Medium core services, in order to build trust and hold us accountable to our users. More background on this program.

--

--