Why is IT Operations management even harder than before?

Ruchi Mahindru
IBM Cloud
Published in
3 min readJun 29, 2022
Complexity introduced to IT Operations management due to digitization explosion
Photo by IBM

Cloud service providers manage data centers that serve a wide variety of applications across a large number of customers while keeping the cost low with minimum downtime. However, to scale the operations, they are utilizing general practitioners and investing in AI-based solutions, such as AI for IT Operations.

Below let me share with you some challenges and inefficiencies that I have observed first-hand while working with Site Reliability Engineers (SREs) over the last couple of years.

Firstly, my understanding is that SREs may be general practitioners with expertise in certain domains, hence they may not be highly efficient in performing the problem diagnosis and remediation for applications where they lack expertise.

Secondly, as the nature of their work requires, SREs work with complex systems generating volumes of logs and metrics data, which may be hard to infer for a non-domain expert. As an exemplar, consider a sample anomalous log line, DMxZ0302E: RAException occurred. Error code is: RMFAIL (-7). Exception is: . . . RA exception: . . . Invalid operation: Connection is closed. ERRORCODE=-4x8, STATE=0834, where DMxZ0302E is the MessageCode; and RAException occurred. Error code is: RMFAIL. Exception is: . . . RA exception: . . . is the MessageString. Given, the example of anomalous LogLine, it can be seen that it would be challenging for a non-domain expert to comprehend it, let alone diagnose and resolve the underlying problem.

The above-mentioned challenges may typically lead to the following inefficiencies:

Inadequate Query Formulation for Resolution Retrieval: In order to diagnose and resolve a problem, SREs would need to research through repositories, which in turn would require query formulation. However, without a thorough understanding of the problem domain, it may be highly ineffective and a challenging task for them to create a viable query. Therefore, this may lead to a highly inefficient trial and error-based problem diagnosis and resolution retrieval.

Incorrect Problem Routing: One of the key pain points that SREs report is related to problems getting bounced around from one team to another due to insufficient details and a misunderstanding of the problem. In turn, this may lead to increased Mean-Time-To-Resolve (MTTR) and business Service Level Objective (SLO) violations.

Ad-hoc Problem Diagnosis and Resolution Application: If the problem is not well understood then the formulated queries would be incomplete and incorrect. Therefore, the resolutions retrieved would also be irrelevant, if not incorrect. This leads to ad-hoc problem diagnosis and remediation actions being performed.

Incomplete Resolutions: Historical incidents are typically poorly documented; they may not contain end-to-end all the steps necessary for SREs to perform to solve the underlying problem. For instance, resolutions may include partial phrases such as “restarted server”, “increase/decrease memory”, “change configuration variable”, or worse, it may say “forwarded to team”, hence lacking the full context and details regarding the remediation actions.

From the above observations, it is clear to me that there is a need for a solution that would help SREs with a better understanding of the problem and recommend resolutions that they can trust and apply with confidence. Hence, in order to address some of the inefficiencies discussed above, we have worked on a solution for Explainability and Resolution Recommendation for Log-based Alerts, which will be discussed in an upcoming blog. Stay tuned!

--

--