How to avoid making mistakes
Errors, mistakes and failures are part of shipping products. We all make mistakes and as much as we would like to avoid them, they do happen.
Walking away from activity that can lead to errors isn’t the right solution, identifying the root cause of an error and addressing it is.
The problem is that we don’t really record mistakes, their causes, impact and corrective actions to avoid them and hence risk repeating the same mistakes.
Successful companies solve this by documenting a post-mortem or COE (correction of errors). Rather than me explain what they are, here is what such a document should look like:
2–3 line description of the incident.
The customer and business impact of the outage.
What metrics did this outage affect?
A detailed timeline on when the issue got introduced, when it started affecting users, when we learnt about it and when we fixed it. This is useful to know our turnaround in identifying and fixing issues.
Resolution steps taken
What actions were taken to resolve the issue?
Why did this happen?
A 5-whys exercise to understand the root cause of this issue. Read about 5-whys
What worked well?
What helped us either minimize the impact or get issue resolved.
What didn’t work well?
Unexpected or broken things that hindered our progress.
List the action items coming out of the post-incident analysis which will help us avoid this incident in the future.
Add appendix, notes and other stuff here
So is this a document for engineers only? No, use it for design, marketing, legal, finance or whatever your role is. The corrective process is the same and the return on investment is also the same — less errors.