How IBM Cloud Pak for AIOps can help you to predict (detect, avoid, and resolve) incidents

Ian Manning
IBM Cloud
Published in
4 min readAug 28, 2024

Recently a customer asked me the following question: “What is IBM Cloud Pak for AIOps’ stance on the predictability of an incident?” It is a question that I have been asked many times, so I thought that it would be a good idea to share my answer here.

First, I need to be open and state my dislike of the term ‘predict’ which conjures up notions of magic rather than mathematics. In IBM Cloud Pak for AIOps we use Machine learning algorithms for anomaly detection, alert seasonality and alert correlation, which can all help to detect and avoid incidents. There is no crystal ball involved!

How IBM Cloud Pak for AIOps can help detect and avoid incidents with anomaly detection

IBM Cloud Pak for AIOps analyzes logs and metrics to learn normal patterns of behavior, and then raises an alert if those patterns vary significantly or something unexpected occurs.

However, anomalies don’t always relate to incidents. This is especially true if a problem is resolved before an incident develops. Rewarding teams who avoid incidents instead of rewarding teams who resolve incidents is fundamental to improving overall quality and stability.

It is important to note that anomalies can be the result of an expected change that never posed any risk of causing an incident. In this case, the anomaly is good knowledge that the intended change was the actual change without any negative side effects.

Then there are unexpected changes. A digger cuts through a network cable, a power outage occurs, a configuration change has unintended consequences, there is a poor-quality release or patch, or someone hacks the system. All anomalies are worthy of investigation to assess whether action is needed to avoid an emerging incident before users or customers are impacted.

In IBM Cloud Pak for AIOps an anomaly is an alert and is propagated up to our alert algorithms to be assessed alongside other incoming alerts.

In the following screenshot multiple co-occurring anomalies are grouped, which gives a clearer picture of the problem, its impact, and its cause: high latency caused by high memory usage and pod restarts.

Forecasting

IBM Cloud Pak for AIOps has a forecasting capability to allow customers to forecast time series values into the future. The forecast includes confidence intervals so that you can understand the likelihood of values being within a defined range. In this way, IBM Cloud Pak for AIOps provides guidance on how quickly a change is occurring and whether action is needed now or can wait.

Using alert occurrence patterns to predict future occurrences

IBM Cloud Pak for AIOps has an Alert seasonality algorithm that learns when alerts typically occur — the time of day, day of the week, day of the month, and so on. By looking at the learned policies you can understand which alerts are likely to occur on a particular day, and then either fix the underlying cause or suppress the alert to avoid having to action it when it reoccurs. In the following screenshot, a weekly backup alert is created every Friday — maybe it is time to suppress this alert so that it doesn’t occur on future Fridays?

Using alert groups to predict alerts that are likely to occur

The Temporal grouping algorithm learns which alerts tend to co-occur. If one of the alerts is a service-impacting incident, then the other alerts are likely to be good predictors of the incident. The alert grouping policy also gives an indication of how much time is expected between alerts in the group, and therefore how much time remains until the incident is likely to occur. This information can be used to help you get ahead of recurring problems before they occur.

In the following screenshot, high error rates are a good predictor of synthetic test failures. Our user interface will show the history of these alerts occurring together over time, and the amount of time that typically elapses between the alerts.

In Summary
With IBM Cloud Pak for AIOps you get intelligent analytics and machine learning across all of your data to help detect, avoid and resolve incidents — the next best thing to a crystal ball!

--

--

IBM Cloud
IBM Cloud

Published in IBM Cloud

Understand how to bring elastic runtimes to the Enterprise with effective security and data protection at scale.