Death and DevOps

Dealing with unfortunate things shape a culture.

Dormain Drewitz
Built to Adapt
6 min readDec 19, 2017

--

UPDATE: You can now watch a talk version of this article from Monktoberfest 2018. See link at the end.

There are certain universal human experiences that transcend cultures. The birth of a child. The coupling of two people in marriage. And death. From baptisms to burials, how we handle these moments is a reflection of our culture.

For better or worse, how death is dealt with leaves a lasting legacy that we can study. You can learn a lot about a culture — a mindset — by how people treat their dead. In medieval Europe, bodies were buried intact, facing east, so that they could rise facing Jerusalem upon the Resurrection. In Judaism, the dead are typically buried within a day and are never left unattended until burial. And, if you saw Coco, you saw a version of how families remember and honor the dead in Mexico.

Image from Pixar’s Coco.

So, what does that all have to do with DevOps?

Well, even in the DevOps world, things die. Servers die. VMs die. Applications die. With the right kind of automation and abstraction, these events matter less. But complex events and human error still cause outages, even at the most sophisticated of cloud-native companies. So, what happens when there’s an outage?

In his breakout at SpringOne Platform, Matt Curry rattled off four steps for dealing with an outage:

I noticed that of all my live-tweeting of SpringOne Platform, that tweet was getting more reactions. This got me wondering: Why does this four step process resonate with people? Then it hit me: this is a concrete example of DevOps “culture.” And we need those concrete examples.

What are your rituals?

At SpringOne Platform, many speakers said cultural change was the hardest part of their transformation. As Niki Allen from Boeing quoted, “Culture will eat strategy for lunch.” But for all the talk of how important culture is for transformation, examples were scarce — there were a few ping-pong references and photos of open office spaces.

But, as Ben Horowitz and Jason Rosenthal discussed on this great podcast episode, culture really matters when things go wrong. Culture has a lot to do with how we deal with the hard stuff, from the day-to-day to the life changing. The way most societies deal with hard stuff is a set of rituals. Rituals help us know what to do in really stressful situations (like a death) when it can be hard to think clearly.

Google has a lot of ritual around failure. Listening to Andrew Clay Shafer’s digest of the Google SRE book, I got the sense that the culture around learning from failure was borderline obsessive. It’s like a mourning process, one that ultimately lets people move on.

You need to have a set of rituals for when things go wrong. They may not look like Matt’s list and you may not even be aware of them. These rituals may vary from department to department, but they exist. They also may not be healthy and may be creating friction to change.

For example, your rituals for an outage may involve an urgent conference call. That turns into a blamestorming session, where a couple people or teams get thrown under the bus. Once the issue is resolved, a manager writes an email to only execs with a high-level overview. After that, the issue is considered buried and put to rest, only spoken of again on the next blamestorming call.

Defining a new way to mourn the outage

Unfortunately, you can’t just declare a new set of rituals and claim a cultural change. I mean, you could, but it probably wouldn’t work very well. If you want to have any hope of changing a culture, you have to first understand it. Then you can introduce new rituals to change the culture by teaching and practicing.

1) Learn and observe.

Abigail Stason has an insightful framework for what she calls Conscious Commitment. It’s intended (to my knowledge) for individuals, but can apply to teams and organizations as well. She starts with noticing patterns of what’s already happening, then allowing time to really study the behavior.

Observe what happens when something goes wrong. Include the good, bad, and ugly. Document the steps. Figure out what triggers different stages in the process. From there, you will be in a better position to recognize this pattern — and if need be head it off — once you’ve committed to a new set of rituals.

2) Define some new rituals and iterate.

Matt’s four steps are great: short and simple, making them easy to absorb as a multi-step process. They also emphasize humility and empathy, with can be easier said than done. That takes practice.

You can also take inspiration from analysis of public outages, like one at GitLab. Google’s SRE book sounds like a good source of ideas (full disclosure: I haven’t read it). But, be careful not to adopt concepts, like “blameless post-mortem,” without internalizing what they mean.

I’m going to go out on a limb here and say there’s a lean approach to defining new cultural rituals as well. Don’t try to craft the perfect outage response guide. Identify a couple things that you think would be an improvement and start to try them out. See how they go. Learn from them and change or add to them over time. When part of your culture is continuous improvement — iterating on solutions — you move faster because you don’t get stuck in analysis paralysis trying to nail it the first time.

3) Teach and practice.

So, now you have a couple new things you want to see happen in the event of an outage. How do you make that happen? You need to educate people. You can’t expect them to know any other way than what they are already doing.

We aren’t born knowing what to do at a funeral. For example, I had my first experience with a Jewish funeral this summer. There were little, printed handouts at all the seats of the temple. Each one explained the practices, along with the translated Hebrew prayers. The rabbi explained a lot of what was happening and why as the funeral progressed. Elder members of the community leaned over to fill me in on this or that. Having never participated in that part of the culture, there were several ways for me to learn.

Write down the “this is what we do when there’s an outage” list or manifesto. You’re (probably) not writing it in stone, so it can change. Don’t wait for an outage to spring the list on people. To lift an idea from Adrian Cockcroft, you may even want to run a few fire drills where you can reinforce new rituals.

UPDATE: Replay of a talk version of this post (with updates) at Monktoberfest 2018

Enough death. What are some happy rituals?

Okay, okay. I get that death and mourning are heavy topics. I’m asking you to put your cultural anthropologist hat on (looks like this). There’s something to be learned from cultural response to death and your culture around outages. Both are stressful events that are a question of when, not if, they will happen.

But we can also apply the same thinking to other moments. What about the “birth” of a new product? How do you celebrate it’s launch? How does the community support the new “parents”? These are questions for another post, but I’ll leave you with this thought: If you go to observe your rituals for new products and can’t find any, is your birth rate too low?

What are some of your team’s rituals around outages, new products, or something else? Let me know by responding to this post.

Change is the only constant, so individuals, institutions, and businesses must be Built to Adapt. At Pivotal, we believe change should be expected, embraced, and incorporated continuously through development and innovation, because good software is never finished.

--

--

Dormain Drewitz
Built to Adapt

History nerd, ex-equities analyst, student of IT trends, printmaker, mom, goofball @dormaindrewitz