The art of Root Cause Analysis — The way to become an IT Superhero by knowing that HOW a solution is fixed matters more that simply fixing it

--

Question: What steps or methodology do you use to resolve IT issues? Do they give you confidence that you and/or the team can fix anything as long as you use those steps and methodology?

In IT, when something is broken, you can’t just fix it. Every IT area has steps defined for HOW to fix issues. Why? Solutions are intricate. When something breaks, you need to find out why in a particular WAY, then fix it in a particular WAY. Otherwise, you will attempt to fix an issue and end up making it worse. You could even cause a major outage that takes days or weeks to fix.

For example, in your home if your television starts to flicker, you will simply turn the power button off and on or unplug the cord. This will typically resolve the issue.

In IT if users start to have issues with a particular application screen flickering, there is a specific methodology followed.

  • They call the customer support line or help desk. There are teams of highly trained people and tons of online and printed information to help Isolate the Cause of the issue.
  • Is there more than one user affected?
  • Is it just one Application?
  • When did the issue start?
  • What type of devices are experiencing issues?
  • As needed, the issue is escalated to 2nd level support. Higher trained experts use more advanced analytical thinking to isolate the issue and provide smart solutions.
  • As needed, the issue is sent to one or more IT teams to either help continue to isolate the issue or resolve it.

Long Before a Fix is proposed and implemented, the process for Root Cause Analysis may take hours or in worst case scenarios — days.

  • If you simply press the On/Off button on Application Servers or pull the plug, you could cause issues to be worse. Some fixes need to be done before a server is rebooted, or you could get stuck with a server that will not boot up at all.
  • If you shut down a firewall due to security issues with users getting virus emails, you could allow even more security threats through and damage a lot more devices.
  • If you rush and implement an untested application code fix without regression testing key functions of an application- your code fix could resolve an issue with processing invoices and unknowingly block the capability to take credit card payments causing a major outage and financial impact.

To be an IT Superhero, use the following approach to Root Cause Analysis.

  • The more information the better, get users on the line that are having the issue or have them send screenshots. The team needs to see exactly which area of the system has an issue.
  • Closely coordinate by including the right people at the beginning of the Root Cause Analysis. Open a web and/or conference call meeting and get resources for the areas of IT where the issue has been identified, other areas can be added as the analysis continues.
  • Communicate to the affected business leaders and get confirmation that they have informed the relevant customers.
  • Once a fix has been suggested, ask the team to identify all relevant Regression Test Scripts and be sure that the complete system’s major functions are tested.
  • Ask that the Operations Team carefully monitor the system following the deployment of the fix. Sometimes additional fixes are needed later that the team didn’t identify sooner.
  • Carefully document all issues and fixes.

In so doing, a thorough ordered approach was used and the HOW the fix was determined mattered more than just deploying the fix. This gives confidence to leadership that everything that could be done WAS done in order to be sure the system was returned to a healthy state. It also gives you the leader and the team confidence to address any issues since a Proven Methodology is available.

--

--

Diane Edwards, PMP - Senior Project Manager

Diane is an Author, Podcast Host and has managed hundreds of projects including for CBS, Showtime, IBM, Verizon, Avis/Budget, and CIGNA.