AI-powered Problem Remediation in IBM CloudPak for Watson AIOps 3.5

Published in

IBM Cloud

6 min readOct 3, 2022

Authors: Ruchi Mahindru, Meenakshi Madugula, Neil Boyette

IBM Cloud Pak for Watson AIOps helps an IT Operations (ITOps) team respond, understand and resolve incidents faster. It does this by reducing the noise, prioritizing the focus, providing guidance on resolution options, and eventually automating the resolutions.

Most client environments consist of a mix of home grown and off-the-shelf components. It is challenging, however, to be an expert in every one of these, especially in operations where 24x7 coverage is required. These challenges have been elaborated upon in a previous post.

ITOps teams are looking for ready-to-use solutions that require no training and provide valuable context and insights on Day 0. In IBM Cloud Pak for Watson AIOps 3.5, we have addressed this requirement for several products within the middleware domain such as IBM Websphere Application Server, IBM WebSphere Liberty, and IBM MQ.

Using IBM’s deep expert knowledge, IBM Cloud Pak for Watson AIOps has introduced domain-specific automatic log anomaly detection, alert explanation, and resolution recommendation. Such ready-to-use models are crucial for faster detection and resolution of issues.

This capability is delivered via the pipeline shown below.

Data Connector:

IBM Cloud Pak for Watson AIOps supports log ingestion from all the leading aggregators, including Mezmo (formerly LogDNA), CrowdStrike Falcon LogScale (formerly Humio), Splunk, and ELK, along with support for custom loggers. Once connected using the Data Connector of your choice to IBM Cloud Pak for Watson AIOps, logs will be continuously processed and analyzed in near-realtime.

Log Data Preparation:

IBM products produce standardized metrics, logs, and traces. Each log contains a designated message ID, a log level, type and other metadata. Such information is used to assist in identifying the source product. During the Log Data Preparation stage, if the system detects that the log messages are from supported products like WebSphere or MQ, then the entities — like message ID’s and log levels — are automatically extracted. We use our prior expert knowledge of message ID’s and log levels that are indicators of abnormal system behaviours. These extracted entities are then progressed to the log anomaly detector module.

Log Anomaly Detection:

The extracted entities are then processed to build the statistical baseline log anomaly detection product specific model. Such a trained model is capable of detecting anomalies as soon as it is online, with automated re-training every 30 minutes to relearn the baseline. The baseline model has prior expert knowledge to automatically identify, differentiate and detect erroneous entities from the extracted entities. The functioning of the statistical baseline model is hands off and fully automated, thus providing immediate value to the ITOps team. For more information regarding the working of these models, read Predictions in 30 mins using new Cloud Pak for Watson AIOps.

As there may be multiple message IDs that may be detected over a log anomaly window, IBM Cloud Pak for Watson AIOps uses a novel algorithm that takes historical context into account to identify the significant message ID. Log anomaly detector enhances the anomaly with the identified significant message ID which is further used in the Noise Reduction module, described next.

Noise Reduction:

To avoid overwhelming the ITOps teams with a stream of events and anomalies, IBM Cloud Pak for Watson AIOps has several noise reduction techniques.

The detected anomaly is first de-duplicated using the identified significant message ID. This allows persistent anomalies to be reduced to a single unique alert. The alerts are further grouped together into stories using multiple temporal, topological and scope based algorithms. Alerts are temporally grouped when they occur within a short time of each other. Alerts are topologically grouped when they occur on resources within a predefined section of the network topology. Alerts are also grouped together when they occur within a configurable time window on an administrator defined scope, such as a location, service, or resource. For example, if the same anomaly is seen to share a cause, as they all occurred on the same resource within N minutes of each other, they are grouped together. N being the configured time.

The screenshot below reveals that 3 alerts were grouped together into 1 story. Each alert further comprised around 25 deduplicated events.

All these different techniques are combined to give a holistic view of the incident, including all the evidence (alerts), context and insights.

Anomaly Enhancement:

This stage consists of various sub-tasks: namely Explainability, Anomaly Distribution, and Resolution Recommendation.

Explainability:

Typically, system-generated log data is deeply technical, therefore the anomalies are enriched to help the SREs gain a better understanding of the problem. Such explanations are extracted from a variety of data sources like

Anomaly Distribution:

It is quite important for the ITOps team to understand the distribution of log anomalies over a period of time. Therefore, the message IDs — along with their associated frequencies — are listed for user analysis.

Resolution Recommendation:

It is critical that the recommendations are on target for a faster system restoration. This component has been built by bringing together the expertise in the area of Knowledge Engineering, Data Science, AI, and deep know-how of the IBM Support team.
During the offline build process, a ready-to-use augmented knowledge base has been built, by tapping into IBM’s wealth of existing information, spread across siloed data sources like historical case data, asset reuse manager (a taxonomy of problem category and sub-category maintained by IBM Support Engineers) and knowledge centre articles. The objective of using support data is to exploit the embedded Subject Matter Expert (SME) knowledge of problem solving.
During runtime, this pre-trained knowledge is queried for explainability and resolution recommendation. A story is created with multiple detected log anomaly based alerts. Each alert is created with the most significant message ID and augmented with explanations along with the top three relevant recommended resolutions. This allows the ITOps team to follow the SME’s recommendations with trust and confidence in resolving the incident.

Conclusion:
Without this capability, the time-to-restore cycle by the ITOps team is much longer due to a lack of expertise in every domain and product. Additional problems can occur due to the delayed reactive process of problem diagnosis, manual searching of the disaggregated knowledge bases, and potential un-reliable attempts at remediation.
IBM Cloud Pak for Watson AIOps can automatically surface log anomalies, clearly explain the problem and identify resolution recommendation in realtime as the problem manifests. Further, the resolution recommendations are highly valuable and reliable as they have been validated and consistently used by Subject Matter Experts on similar past anomalies. Hence, leading to reduced Mean-Time-To-Restore (MTTR) and increased customer satisfaction.

See detailed tutorial AIOps explained: Out-of-the-Box (OOB) models

Our sincere thanks to the Collaborators: Aishwarya Guda, Ashish Ghodasara, Andy Tu, Bob Gibson, Colin Butler, Don Bourne, Haibin Liu, Harshit Kumar, Kevin Ng, Michael McCurry, Miles Woollacott, Owen Jeffs, Pujitha Kara, Seville Mostafa, Srajan Dube, Xiaotong Liu

AI-powered Problem Remediation in IBM CloudPak for Watson AIOps 3.5

Written by Meenakshi Madugula