IBM Cloud Pak for AIOps tips and tricks: Leverage your existing root cause analysis from Netcool (NOI) and other event sources

Jack Buggins
IBM Cloud Pak for AIOps
6 min readJul 24, 2024

--

In this short post I want to share an example of how you can leverage root cause information configured for alerts sourced from NOI or elsewhere. The aim is to help you best leverage your past investments while exploring the huge value new capabilities made available to you within the IBM Cloud Pak for AIOps offer.

The Cloud Pak is packed full of novel “0 configuration” out-of-the-box root cause analysis capabilities which you can tune as required. So you do not necessarily need to make any customisations to get the benefits of probable cause. For more detail on these, check out this fantastic article written by my colleague Jonathon Settle which outlines the 5 methods for probable cause which are available as standard.

While there are a number of new methods available, when you are using the Cloud Pak for AIOps, you can be sure that you can continue to leverage your existing tribal and situational incident knowledge just as you always have. Why fix what isn’t broken?

Today I’ll guide you in one small step you can take to do this — Let’s get into it with a bit of background about a data point associated with alerts that has long been referred to as CauseWeight…

Injecting Tribal and Domain knowledge for Probable root cause in NOI

If you’re a pro IBM Netcool Operations Insight user, administrator or architect feel free to skip over to the next section, but carry on if you are interested in the background of this capability.

Since Netcool Operations Insight 1.3, a field named CauseWeight has existed as an optional add-on capability that allows you to manually define the relative cause likelihood of an alert.

The purpose of this capability is to help assist operators or SRE’s in determining the probable root cause alert during issue determination and resolution when they are resolving incidents. When the Netcool system automatically correlates your alerts, the CauseWeight can also be used to automate key incident workflow steps.

What’s the value proposition of root-cause based ranking?

This helps operators slice through noise quickly and guide them to the most important alerts in an incident, or it can even be used as a target for linearly scheduled automations, starting from the most likely cause alarm, helping you resolve incidents quickly.

Ironically, sometimes it is an alarm with lower severity that needs your attention the most in order to restore system functionality. In other occasions, we cannot simply rely on the timings of alerts since there can be delays in monitoring systems sending the alarms to the management system after the events have taken place.

CauseWeight based causal ranking helps us avoid falling short to these potential pitfalls. This translates to a direct reduction in the time-to-resolve.

You can bring this context along with your alerts and leverage it directly from the alert and incident viewers in the Cloud Pak for AIOps.

How is this information leveraged to assist operators in IBM Cloud Pak for AIOps incident view?

Where cause weight rankings are present in alerts, we will be able to see a clear ranking indicator, where the events have been ranked via the CauseWeight values.

Even if you do not have CauseWeight based ranking, the Cloud Pak will automatically provide ranking details for you when you configure word-based probable cause ranking, or have a topology application/group associated with the alerts.

CauseWeight based ranking in the alerts view of an incident

Since the ranking can be formed of many factors, including the probable root cause capabilities that are shipped as-standard in the IBM Cloud Pak for AIOps, we do not display the raw CauseWeight value. However, the raw CauseWeight value can be seen in the details of the Alert.

Raw details of an Alert displaying CauseWeight

The top 3 alerts are also shown in the incident overview page, alongside any topology that has been associated with the impacted resources

CauseWeight based ranking in the incident overview page

How do I configure propagation to the CauseWeight field for the Netcool connector?

When you connect a Netcool object server, you will find an integrations configuration menu that allows you to have full customisability around how you choose to set up your alert fields mapping in the Cloud Pak.

The CauseWeight field must be populated in the details object, and cast to a string value to comply with the Alerts schema. You could also do this with any raw payloads ingested. Check the bottom entry within the configuration example shown below to see exactly how you need to structure this data:

(
$isIPAddr := function($i){ $contains($i,/^[0-9]+.[0-9]+.[0-9]+.[0-9]+$/)};
{
"summary": alert.@Summary,
"deduplicationKey": alert.@Identifier,
"sender": {
"service": alert.@Agent ? alert.@Agent : undefined,
"name": alert.@Manager ? alert.@Manager : undefined
},
"resource": {
"name": alert.@Node = "" ? alert.@NodeAlias = "" ? "Node" : alert.@NodeAlias : alert.@Node,
"location": alert.@Location = "" ? undefined : alert.@Location,
"ipAddress": $isIPAddr(alert.@NodeAlias) ? alert.@NodeAlias : $isIPAddr(alert.@Node) ? alert.@Node : undefined,
"hostname": $not($isIPAddr(alert.@Node)) ? alert.@Node : undefined,
"sourceId": alert.@BSM_Identity = "" ? undefined : alert.@BSM_Identity,
"service": alert.@Service = "" ? undefined : alert.@Service,
"port": alert.@PhysicalPort = 0 ? undefined : alert.@PhysicalPort,
"physicalslot": alert.@PhysicalSlot = 0 ? undefined : alert.@PhysicalSlot,
"physicalcard": alert.@PhysicalCard = "" ? undefined : alert.@PhysicalCard,
"scopeId": alert.@ScopeID = "" ? undefined : alert.@ScopeID
},
"type": {
"eventType": alert.@Type = 1 ? "problem" : (alert.@Type = 2 ? "resolution" : (alert.@Type = 13 ? "information" : (alert.@Type = 0 ? "problem"))),
"classification": alert.@EventId = "" ? alert.@AlertGroup: alert.@EventId
},
"eventCount": alert.@Tally,
"signature": alert.@Identifier,
"firstOccurrenceTime": alert.@FirstOccurrence,
"lastOccurrenceTime": alert.@LastOccurrence,
"severity": alert.@Severity <=0 ? undefined : alert.@Severity = 1 ? 1 : alert.@Severity < 6 ? alert.@Severity + 1 : alert.@Severity >= 6 ? 6,
"state": alert.@Severity = 0 ? "clear" : "open",
"acknowledged": alert.@Acknowledged = 1 ? true : false,
"expirySeconds": alert.@ExpireTime = 0 ? undefined : alert.@ExpireTime,
"details": { "CauseWeight": $string(alert.@CauseWeight) }
}
)

Note, that all extra fields you choose to populate to the details section of an alert must be encoded as strings!

How do I configure propagation of numeric root-cause likelihood indicators for a custom connection?

We simply need to ensure that we configure the mapping to set the details.CauseWeight field. Note, that if this field could be empty, I would advise to add handling for when this field is not defined using the expressive JSONata syntax. In the Cloud Pak for AIOps alert schema, the details.CauseWeight field must be encoded as a string. Therefore, it may be necessary to cast this value in cases where the incoming payload has a value which is encoded as a number.

Here is an example of a simple configuration for a Generic Webhook integration containing this mapping:

(
{
"severity": risk.severity = "MINOR" ? 3 : risk.severity = "MAJOR" ? 4 : risk.severity = "CRITICAL" ? 6 : risk.severity= "UNKNOWN" ? 1 : 2,
"summary": details,
"resource": {
"name": target.displayName,
"sourceId": target.uuid
},
"type": {
"classification": target.className,
"eventType": actionState = "CLEARED" ? "resolution":actionState = "SUCCEEDED" ? "resolution" : "problem",
"condition": risk.subCategory
},
"sender": {
"name": "Back Up and Restore Operator Monitor",
"type": "Webhook Connector"
},
"details": { "CauseWeight": $string(insights.numericScoreIndicator) }
}
)

What else can I do with probable cause?

As I shared in the introduction, my colleague Jonathon Settle, a master inventor at IBM, has written a fantastic blog entry on this exact topic. Within this blog post you can find an overview of the methods of probable cause available, along with the full technical details on how to configure word-based probable cause if you are seeking an approach for deep customisation within your new management platform.

Wrap up

Thanks for taking the time to check out this article, I hope this helps you along your journey in streamlining your operations environment! As usual, if you have any questions or comments about this post, or the Cloud Pak for AIOps in general — don’t be shy. I’d love to hear your feedback and suggestions for what you would like to learn more about in future posts. Please also see the AIOps focused IBM Community for more how-tos, best practices, and use cases.

--

--