What´s your “problem”?
|by Hugo Miguel Dias, Engineering Lead
According to the ITIL v4 definition (source: ITIL Foundation — ITIL 4 Edition), a problem is a cause or potential cause of one or more incidents. In a continuously evolving platform, it is inevitable that incidents will occur. If we don’t experience failures from time to time, we aren’t able to develop or move forward. Teams that don’t fear failure are better suited to take more significant risks and achieve what’s never been done. So, why are problems such a hard thing to admit in our daily work?
First of all, fear of failure is ingrained in our minds. It’s challenging to admit that a decision or action we’ve taken led to a problem and even harder to communicate to everyone about it.
Secondly, it’s hard to see something go wrong with a system we all work so hard on. This gives us the terrible sensation of uncertainty and vulnerability. However, these are the situations that drive us to go the extra step, to improve the reliability and robustness of the system.
Take the exceptional example of Toyota. Over the years, Toyota has been praised for its outstanding performance. The secret? Continuous improvement. Why is this related to problem management? As outlined in Toyota Kata, a management book by Mike Rother, the constant improvement by Toyota is based on the sole purpose of finding problems and working towards a solution to remove them. Every employee at Toyota has the responsibility to raise awareness of potential problems occurring on the shop floor. In this environment, problems are seen as a means for improvement rather than as something that needs to be avoided at all costs. (source: Toyota Kata, by Mike Rother).
To help us understand how vital and effective problem management can be, I’ve identified five steps that one should consider when dealing with problems.
Step 1 — Acknowledgement
Dr Phil once said, “You can’t change what you don’t acknowledge.” With this in mind, the first step, and possibly the most important one, is recognising that a problem exists and embracing it. We will be able to remove our bias and start working to solve the problem as soon as we acknowledge that fact. A problem should be seen as an opportunity to find gaps or loopholes that weren’t previously detected when designing or maintaining a system.
Let’s be honest with ourselves. In our line of work, it’s tough to keep track of every detail in a complex, fast-paced system when designing new solutions or even performing maintenance tasks. Even having complete and up-to-date documentation about the system, the cognitive load is tremendous. Sporadic failures are expected to happen. Don’t be afraid. Embrace them as a continuous improvement task.
Step 2 — Visibility
A problem should be visible to everyone, don’t hide it. Cultivate a blameless culture to empower people to focus their minds on problem identification and resolution, instead of wasting time on pointing fingers.
Providing visibility of the problem to the organisation will allow us to:
- Ensure the stakeholders are aware that the platform is experiencing instability and that the teams are finding a solution. Also, this sends a message of transparency to raise the confidence levels of the stakeholders within the teams
- Share the situation with other areas that might potentially experience similar problems in the future, allowing them to apply preventive measures before it is too late
- Empower the organisation with crucial information about the health of the platform. This can ensure the resolution of the issue is prioritised rather than focusing on the delivery of new features/changes
Step 3 — Workaround
The implementation of workarounds is of absolute importance in the problem management process, because it allows us to minimise the impact on the business. Workarounds can take several forms:
- a manual process that is documented in a step-by-step guide
- a fully automated process
- or, a temporary code change
Bear in mind that a workaround is a short-term solution created to quickly restore the system to a stable state. This provides the necessary time for the teams to find a definitive solution to resolve the problem.
To illustrate the workaround lifecycle, take the following example: A person is walking on a sidewalk and suddenly falls due to a hole in the pavement. A short-term solution to prevent other people from falling in the same spot is to isolate the area with barriers. Why is this a short-term solution? Although this is preventing people from harming themselves, the barriers might tumble due to bad weather and the hole is once again exposed, waiting for its next victim. The danger still exists, and to completely remove it, the degraded pavement needs to be fixed. Only then, walking on the sidewalk will be safe again.
Step 4 — Investigation and Resolution
The investigation of a problem should have a holistic approach. Every piece of evidence should be analysed, even if outside specific boundaries. A problem is like an onion. Besides the fact that it makes you cry, it might contain several layers that hide the actual origin of the issue and can mislead the investigation direction.
After gathering all the evidence and pinpointing the root cause or causes of the problem, the next step is to implement the solution to eliminate the problem completely. Why the emphasis on “completely” ? Fixing the root cause may not be enough. In various situations, problems may have caused collateral damage that needs to be identified, and corrective measures should be put in place to fix them; otherwise, the negative impact on the platform will persist (potentially leading to revenue losses). The problem resolution is only completed when all the affected parts of the system have been dealt with.
Step 5 — Continuous Improvement
A critical aspect of encountering a problem is acknowledging the lessons it taught us and understanding what we can do to prevent it or similar problems from occurring again.
To close the loop, the final step is to share the findings and resolution of the problem. This feedback raises awareness for specific situations that have the potential to happen in other parts of the system, allowing the teams to implement preventative measures.
To highlight the importance of this, let me share a real-life example: A long time ago, circa 1940, the number of aviation accidents were skyrocketing. As a result, the Chicago Convention on International Civil Aviation that took place in 1944 proposed a set of procedures to investigate plane accidents. The procedures put in place had the sole purpose of providing a process that ensured everyone from the industry could learn from their mistakes and apply measures to prevent accidents. This allowed the decrease of the number of accidents from about 1,000 in 1944 to about 100 in 2020 (source: https://www.baaa-acro.com).
Let me tell you a secret, incidents will keep happening in a fast-paced, growing, complex platform, they are inevitable. The key here is to fail fast and learn from it. To embrace the fail-fast culture one must focus on:
- Adding measurement to the platform as part of initiatives, to provide faster and more reliable information about the health of the platform
- Minimise the technical debt with the potential to cause future problems (establish a preventive maintenance culture)
- Define features rollout/cutover plans to gather all the scenarios that have the potential to cause incidents and prepare adequate counter-measures
Finally, when there is a problem there is an opportunity to improve towards a more reliable system.