It Was the Intern. Or Was It? Blameless Postmortems

Daniel Stack
IndigoAg.Digital
Published in
3 min readSep 13, 2021

On June 17, pretty much everyone on HBOMax’s email list received an empty email with the subject “Integration Test Email #1”. A number of jokes on Twitter and other social media platforms followed, many with people wondering if an intern had accidentally run a test in a production environment.

Later that evening, HBOMax tweeted the following:

We mistakenly sent out an empty test email to a portion of our HBO Max mailing list this evening. We apologize for the inconvenience, and as the jokes pile in, yes, it was the intern. No, really. And we’re helping them through it. ❤️

It was a cute reply and I didn't get a feeling of malice behind it. But I’d like to talk about what I hope occurred afterward at HBOMax — and how I’ve seen such incidents handled both at previous companies and here at Indigo.

In my past life, I was responsible for leading postmortems whenever an incident caused a customer outage. Our goals were to learn what happened, why it happened, what was required to fix it, and what sorts of improvements could we put in place to prevent that and other incidents from happening again.

Here at Indigo, we have postmortems after every severity 1 incident — and they are encouraged for incidents of lesser severity. In them, we work through what happened, the timeline of our responses, the root cause, and, perhaps most importantly, takeaways and action items.

Incidents can be broadly defined — they don’t always need to be issues that customers experience. I’ve seen postmortems for projects that go well over timeline and/or budget, even when everything works fine.

Out of postmortems, I have seen a variety of improvements made. Runbooks to help support teams. Additional logging. Certain thresholds for releases. Protective measures for databases. Code review processes. Rules of engagement for error handling. Environmental checks around testing (handy if you want to prevent running a test in your production environment).

One danger in this is very often there will be someone who looks the most at fault. The developer who wrote an offending line of code. The person who signed off on a less reliable power supply. The data scientist who accidentally dropped the live production database. The intern who ran a test in a production environment.

If a person is terrified their job is on the line, they are unlikely to be a useful participant in understanding what went wrong. Moreover, in my nearly three decades of professional experience as a software developer, I don’t think I can think of a single incident that can be “blamed” on just one person. A power supply is chosen because of cost constraints. A developer makes a fix unaware another developer is making a code change that makes that fix problematic. A test suite is able to run in a production environment without any guardrails to make certain that is what is truly wanted.

I’ve lived my career on the assumption that most people are operating with the best of intentions. There is, of course, always the possibility of deliberate malice, but even then there should be mechanisms in place to prevent a single bad actor from having such power.

A blameless postmortem is a wonderful thing in an environment when everyone has the best of intentions. Atlassian describes their motivations and goals around blameless postmortems in their own well-thought-out blog post. One vital quote:

In a blameless postmortem, it’s assumed that every team and employee acted with the best intentions based on the information they had at the time. Instead of identifying — and punishing — whoever screwed up, blameless postmortems focus on improving performance moving forward.

Ironically, over the course of writing this article, I managed to introduce a minor break, forgetting to tag a commit with a semantic-release comment to bump the version number. In this repository, we don’t always need a version bump, but this was one of the times we did. It was a minor hiccup, easily fixed. But we still performed a quick post-mortem and decided to adjust our deployment process to fail if we neglected to update the version in such scenarios.

Ideally, postmortems happen after a relatively benign incident like that. But whatever their circumstances, they are opportunities to learn and improve.

--

--

Daniel Stack
IndigoAg.Digital

He/Him/His. Boston area software developer with opinions on topics both geeky and serious.