New session type at Reversim Summit: Postmortems

This year Reversim Summit introduces a new type of session: Postmortems.

We think postmortems could be fascinating, inspiring and educational, therefore we want to encourage sharing your postmortem stories.

Revesim Summit 2019

What are postmortems?

Postmortem is the analysis of a failure, for example outage, data loss, security breach or other. Different companies use different names (“incident review”, “root cause analysis”, etc.) but most tech companies run them in a similar manner. The goal of the postmortem is not only to fix the current highlighted broken piece, but to also to generalize the learning for the benefit of other systems in the organization and fix many potential issues at the same time.

For example — was there a security breach? Ask yourself what was the process that led us to allowing this breach? And where else did we implement this process, there may exist other breaches as well. Was there a communication problem? An ambiguous playbook? Could we have identified the flaw beforehand?

One important aspect of good postmortems is blameless, e.g. we don’t care whose fault it is, and typically it’s really not one person’s fault, we care about the process, that made the fault possible and how to fix the process, not the person.

Our experience shows that it’s useful, fascinating, inspiring and educational to learn from each other’s postmortems (cross-companies) therefore we ask you to submit and come tell us about your glorious, infamous postmortems — what wend wrong and how did you fix it?

Suggested structure:

  • What was the baseline? What’s the normal state/layout of your systems?
  • What happened?
  • How did it affect your systems?
  • How did you react?
  • How was the problem mitigated?
  • How did you analyze the incident?
  • What were your takeaways?
  • What was the followup process?

Example submission

Title:

A 30 hour intermittent service outage at Example.com

Abstract:

Last fall, Example.com suffered its most severe service outage, rendering some of its customers either completely unable to access their accounts, or incapable of processing transactions. We explain the cause of this outage and its side effects, such as a growing backlog which threatened to exceed our computing capacity. We discuss the different effects to customers based in the US vs. Europe and describe our efforts to restore the service without loss of data.
We will illustrate the incident mitigation timeline and the tools our teams found useful to collaborate across the globe.

Extra notes

  • As with blameless postmortems, we, too, are not interested in placing a blame on a company, a team or an individual. Please keep your presentation professional and educational.
  • By nature, postmortems expose company internals: technical, logistical, management, human communications. Please make sure your company agrees to this exposure.
  • We are only interested in stories where you had some role.
  • Postmortem session duration is 15 minutes.