Blameless Post Mortem
Most of the Senior Engineer’s at my company have never had any other job. They joined the company fresh out of college when it was just a startup and built the product that we currently have today. They did an amazing job. We are a leader in our industry and currently growing extremely fast.
A few months ago we hired a Senior Engineer who has worked for a couple of different companies in the Boston area. He introduced me to the concept of a Blameless Post Mortem after I deployed a change in our Rails app that had a bug in it and caused the app to go down for a couple of minutes. I fell in love with the concept right away, but the other senior engineers argued that people’s time is too valuable to have any additional meetings.
I understand the argument about not wanting any more meetings, but as our company grows we need some way to communicate systematic vulnerabilities. The bug that caused the error was created by a custom Rails engine that we use at my company. When a Senior Engineer called me into his office to discuss what happened the conversation was something like this:
SE: So it was your commit that caused the app to go down, but don’t worry you didn’t do anything wrong.
Me: Oh, I’m sorry. Everything was working fine for me when I was running the code locally. I didn’t even do anything out of the ordinary. I’m surprised what I did could crash the whole app. What was wrong with the code?
SE: Yeah it looks like you used our whatchyamacallit engine to generate the scaffold for the feature you were building.
Me: Yeah I thought that’s how we were suppose to do it according to our internal documentation.
SE: Right. The problem is that sometimes it creates an extra file that we don’t need which causes the app to crash when we run it in production. It’s a known bug with the generator. It will run fine locally, but when we compile the changes for production it throws an error that prevents the entire app from booting up. It won’t be a problem once we update the app to Rails 5. When this happens the most important thing to do is to rollback your commit as soon as possible and then we can debug the code locally.
The solution was as simple as deleting a couple of extra files that did not need to be created. When I talked to other engineers about how I accidentally crashed our app there were two different responses I would get.
For engineers who have been with the company for 2+ years the response was always something along the lines of “Oh yeah I’ve done that before. You used the whatchyamacallit engine didn’t you?”
Newer engineers would respond with “Wow that’s how it happened? I had no idea. What is the protocol for when the app crashes?”
It sucks being the engineer who crashed the application. Even when you work with nice senior engineers who tell you it’s not your fault, you feel like it’s your fault. The only thing that made me feel better was being able to explain to other engineers what happened to prevent them from making the same mistake in the future, and what the protocol is when you do crash the app.
This will not be the last time a piece of code crashes one of our apps. When the team was small information could be quickly communicated among the engineers in person. As our company continues to grow, it becomes harder to communicate individually to each person the what, how, and why each time an app crashes.
I would prefer to learn lessons together as a team and benefit from each other’s mistakes. My post about GitLab is a good example of the unintended benefits being transparent about bugs can have. Is it possible to convince other engineers in my company that doing something like a Blameless Post Mortem is worth the time investment? If so how?
I would love to hear feedback if you have ever found yourself in a similar situation at a company. Even if you haven’t, any comments or suggestions would be greatly appreciated. My twitter handle and email are available at the bottom of this post.
Thanks for reading.