Failure is the Only Option

Why do we spend so much time and money trying to prevent failure when failure is inevitable? Is it not true that to err is human? We often forget the second part of that quote: “to forgive, divine.” We focus so much on trying to defeat our very nature, that we often fail to forgive failures. The blameless culture derived from forgiveness facilitates learning and understanding.

Accepting failure as a given is our first step toward forgiveness. These moments of failure should be a time for learning and repairing our systems based on new knowledge. It would be great if we could always correctly predict and prevent all of the possible ways systems fail, but we can’t. There’s no way to know when a hard drive will fail, but we can mirror that drive or keep the data backed up regularly. Distributed technologies even allow for replication across multiple hosts to ensure losing an entire system doesn’t degrade the integrity of the data.

This doesn’t mean that we should stop protecting against failures, but rather that our goals should not be zero failures. Our goals should be focused on our time to recovery. If it takes seconds to recover from failure, then do we care if we fail in some cases? It’s likely that if we automate a recovery, then we’ll never know there was a failure. Of course, we will know from our monitoring systems, but we won’t find out until the morning when we’re back at work.

This is an important area of change and an area in which we’ll need to invest time and effort. However, that time and effort will pay dividends on the other side. This type of change will require strong leadership to understand that the results of the upfront investment will take time to appear and will appear in smaller amounts but over a long period. This isn’t something that fits well into a standard goal or business case. However, the evidence of this work will show up in goals related to uptime and mean time to recovery (MTTR). These are our real concerns.

To solve these issues, we’ll also need to get better at working as a unified team. With a blameless culture, we’ll build trust and be able to interact with one another empathetically. Trust is a key factor in a blameless culture. It isn’t just earned when you are successful. It can also be earned when you address failure with an open mind and without an accusatorial mindset. Failure is not someone’s fault, but rather a fault of the system we all belong to. Something in the process has failed, and it is the duty of everyone to fix it. This may be a change in documentation, code, or automation. It may be a change in process. We won’t know until we can investigate. Then we can fix the fault in our system.

I know we all hate meetings, but that may be the best way to quickly investigate an issue. The broader community calls these blameless postmortems. Everyone involved in the system congregates in a room and walk through the process. Everything is done in the open with all contributors having access to the data related to the failure. Of course, a bout of finger pointing may erupt, so it’s often necessary to include a facilitator who can help keep tensions at bay.

This facilitator should be someone without a direct connection to the incident who can remain objective. They should not contribute to the discussion, but rather provide guidance for the discussion. They need to be assertive so they aren’t overpowered by those with higher positions. This role is very important in low-trust environments where blaming and an us versus them mentality exists. As organizations mature past this, the role of the facilitator can fade away.

Failure is an inherent problem in our nature. We will never stop failure, but we can focus on mitigating and recovering from failure by learning from previous failures. Every failure should be viewed as a chance to improve the system and remove another failure vector. So let’s stop blaming and start working together to improve our systems.

Show your support

Clapping shows how much you appreciated Daniel Barker’s story.