A post-mortem at Arena
A problem that led to an innovation
An accidental deployment
One morning, one of our engineers intended to deploy code to our staging environment, but chose an incorrect build plan and deployed it to the production environment. The deployed code came from an older branch with unfixed bugs.
Within a few moments our production monitoring system signaled that something was wrong. We jumped into action, investigated and fixed the issue within fifteen minutes. The impact on the production environment was low, but it had the potential to disrupt our clients. It warranted a post-mortem.
Post-mortem at Arena
In our post-mortems, we discuss issues in a blame-free way: we consider what happened, why it happened, and what processes can be put in place to prevent similar issues from happening in the future. It’s a nod to our human nature: we all make mistakes, even when being careful.
The meeting takes about one hour, with some preparation before and follow-up after. It’s scheduled as soon as possible after an incident, so people clearly remember the details. The preparation is usually organized by one person, and includes:
- Constructing a timeline — so everyone can agree on the sequence of events that led to the issue
- Coming up with problems suggested by the timeline
At the beginning of the meeting, everyone reads through the timeline once more to refresh their minds. Then, the group responds to the list of possible problems using the 5 Whys technique to dig into the underlying causes. If new problems are identified, they are added to the list, sometimes leading to a “tree of whys”. Once the causes have been mapped, we discuss actions that can be taken to prevent the problems from happening again.
Following the meeting, the organizer translates the notes into a wiki doc and makes tickets for each action item. The attendees assign owners to the tickets and the wiki is shared so that everyone in the company can benefit from the lessons learned.
A problem that led to an innovation
The production environment is unforgiving when errors are made, requiring people to be more mindful when making changes. Our main concern after the accidental deployment was that it was too easy to deploy code to that environment without checks or safeguards. Also, as we grow, we’ll have more build plans and more chances of something like this happening again.
Our timeline and the 5 Whys discussion revealed three key problems:
- Our system didn’t notify people of what was being deployed to which environment
- There were no safeguards to keep people from deploying to the wrong environment
- There was no traceability of what had been deployed where
The discussion went back and forth. At first, people suggested individual solutions for each problem. For example, for notifications, we thought about sending messages through Slack to notify people when deploys were made. As for safeguard, we considered pair programming on every deploy. We then realized that we already have a system in place that ticks all these boxes: our peer review process. We designed a new process called “Deploy by Code”.
It consists of a list of folders for each environment. Inside each folder there’s a file for each application. Some path examples:
In each file, the first sequence of non-space characters is the commit hash to be deployed. Everything else is ignored.
Those deploy files are under version control and the process works like this:
- The next commit to be deployed is prepended to a deploy file and the change is submitted for peer review
- The reviewer validates the changes and approves the deploy
- The commit is merged, triggering an automatic deploy. The environment and application are inferred from the path; the commit, from the file.
Only the first sequence of characters in the file is actually used for the deployment, so it can be followed by additional comments. As deploys are made, the resulting file contains the history of deployed commit hashes, for example:
* Deploy alarms X, Y and Z
* Deploy changes to endpoint
* Updating library XYZ
This solution reduced the likelihood of mistakes and gave us:
- A broadcast system where people or teams are notified whenever there are changes to those files
- Confirmation through peer review for deployments
- Traceability of deploys
One downside of the current implementation is that it doesn’t track deployment dependencies. If two applications need to be deployed at the same time, it’s up to the engineer to submit a Pull Request modifying both deploy files in the same deploy commit. Also, people spend a bit more time preparing the deployments, but going through the extra step of peer review makes them more mindful of what they are doing.
Additionally, we implemented new automation incrementally: one example is adding a blocking reviewer specifically for production deploy commits, which adds an extra level of awareness at little cost. The developer acknowledges that the action is intended with a single click to proceed.
Once we recognize that we are humans and make mistakes, we can create a safe space for discussion and reflection, that leads to meaningful improvements. The eventful day of the accidental deploy opened the opportunity to revisit and revamp our deploy process. We haven’t had a post-mortem on accidental deploys since then.