Run Book Automation with Watson AIOps : a tutorial.

Julius Wahidin
IBM Cloud Pak for AIOps
9 min readMar 22, 2021

SREs (Site Reliability Engineers) rely on automation to perform their tasks; thus, they need to work with lots of automation efficiently. IBM Watson AIOps has one useful feature/component that may help. It is called RunBook Automation (RBA).

RBA is useful because it can activate runbooks based on an event. The runbook can be a Linux/Windows script called through SSH, a web service call through HTTP(S), a BigFix, or an Ansible Tower Automation. It also provides the life cycle management of the runbook itself. RBA allows the user to rate and provide feedback after the runbook is triggered, providing the developer with suggestions to improve the runbook.

Before the internet is common, IT operations have used a runbook in the form of a physical book recording the sequence of actions that the operator needs to perform to resolve an incident. The operator runs commands as described in the book, hence the term “run” book. RBA supports this type of runbook digitally. If preferred, RBA can provide a user interface that helps the user perform the many manual actions in sequence. We will focus on an automatic runbook that watches for certain events to start the operational steps’ executions.

Historically, as a product, RBA started as an IBM Cloud SaaS offering. It was later made available in another offering: CEM (Cloud Event Manager). About two years ago, most of CEM features were then merged to NOI (Netcool Operation Insight). At the time of this writing (the first Quarter of 2021), RBA (and NOI) is now part of Watson AIOps — Event Manager.

The user interaction and screenshots in this blog are taken from a deployment of Watson AIOps version 2.1 with its Event Manager (NOI 1.6.3). It is the latest release during the writing of this blog. We will use the term event to mean alert/alarm/event; it refers to a row in the OMNIbus event database that comes with Watson AIOps. You can also perform the RBA configuration through its REST Interface instead of the GUI.

Automation, RunBook, Library, and Trigger.

First, let us look at four RBAs terms: Automation, RunBook, Library, and Triggers.

An automation is a unit of programmatic instructions in RBA. An automation can be a script run through an SSH session on a remote system, an HTTP(s) API call, a BigFix call, or an Ansible tower Job or Job Workflow call. A Runbook is formed by combining one or more automation. Once you have defined your runbook, you publish the runbook to create a collection called a library. You can activate the runbook manually from the library or by defining an event-based policy. The policy is called a trigger.

Thus, to define runbook automation in RBA, you first create the automation and then the runbook. You can now run the runbook manually from the library or automatically by defining a trigger.

Let us go through a simple example: the hello world from RBA.

The script

To start with, let us create a simple bash script that prints “hello world”. We placed the script on a Linux server:

$ cat rba/helloworld.sh 
#!/bin/bash
echo "Hello World."

The automation

Using the user interface, we create the automation [UI Menu sequence: Event Manager > Automations > RunBooks > Automations], and named the automation “helloworld”. It is script-type automation with one entry: rba/helloworld.sh. The UI looks as follow:

An example of a script runbook; a bash script located at rba/helloworld.sh

There are two default parameters associated with the automation, the user and host. You can enter the default value for your automation here, but we will specify this later, so we can leave the default as is.

The runbook

After the automation is created, we can use them by defining a runbook [UI Menu sequence: Event Manager > Automations > RunBooks > Library > New Runbook]. In this first example, the runbook has only one entry, the helloworld automation that we just defined. Remember the two parameters from the automation earlier? We are going to specify a specific user and host as per the following screenshots:

Configuration screen to specify the target host and user ID to run the runbook

Authentication and Authorisation.

To execute the command, RBA needs to establish an ssh session with the target host. To allow RBA to login without needing to specify a password, we need to copy RBA the public key into the user’s SSH authorized key of the target host. We can get the RBA public key by accessing the Watson AIOps — Event Manager User Interface [UI Menu sequence: Administration > Integration with other systems> automation type > script > edit] as shown in the following image.

SSH public key of the script integration of RBA.

so that is what we will do, we copy the public key to ~/.ssh/authorized_key

Manual test

Now we can test the runbook. Click the Run next to the runbook name.

Click the “Run” actions to test the runbook from the Library

And lo and behold. RBA says, “Hello world”.

Output of the hello world runbook.

Once we have rated and completed the runbook, the runbook details in the library get updated.

Library of runbooks

Trigger

To be more useful, we want to run the automation when a certain event condition occurs. 3 SNMP simulated events have been injected into the environment to support the use case. The events are shown on the following Event List.

Injected events for the use case

One of the events has the following message “This is a device test alerts. Device Disk Threshold Exceeded”. As a scenario, we want to run the runbook when this event happens. Let us define a trigger to execute the helloworld runbook when RBA finds an event with “Device Disk Threshold Exceeded” as part of the Summary field.

The following shows screenshots of the helloworld policy [UI Menu sequence: Event Manager > Automations > RunBooks > Triggers > Create new trigger].

Runbook trigger conditions configuration screen. Note the use of pattern matching on the Event’s summary field for this trigger.

Note the triggers’ conditions to monitor the events’ Summary field. It looks for events in which their Summary fields match the string pattern anything ended with “Device Disk Threshold Exceeded”.

Once defined, you can test the pattern matching by clicking the test button. The figure above shows that one event matches the string pattern. Next, we need to select the runbook, and we selected the helloworld runbook. As we want to watch the execution, rather than running it automatically, we toggled the execution switch to manual.

After we saved and made the runbook active, RBA will search for matching events and will attach the runbook to the event. From the event viewer, we can see that the event that matched the trigger’s condition now has a runbook associated with the event. Events that have runbook associated with it have a dot on the Event list runbook column. Clicking on the event and expanding the runbook, we can see from the following screenshot that the correct helloworld runbook was assigned.

The dot on the runbook column shows one of the events has a runbook associated with it.

To recap, we have created an automated manual runbook; the runbook is based on automation; however, we want to activate it manually for the time being.

After testing the automation behavior and being happy with it, we can toggle the setting to automatic. The runbook will then be executed automatically upon matching events.

Automation with parameters

Automation should be re-usable for a similar process at different resources. When you define automation to restart a server, you want the same automation to work on a similar server, server_A, server_B, or server_C. This re-usability is achieved through parameters. So lets us do an example, starting with the following script.

$ cat rba/parameter.sh 
#!/bin/bash
echo "Program name: ${0}"
echo "Parameter 1: ${1}"
echo "Parameter 2: ${2}"

The script prints the first two parameters to the standard output. As before, we will automate the execution of this script. We will define an automation, a runbook, and the trigger.

Here is the screenshot of the definition of the automation:

Runbook definition showing the script and its parameters.

In the automation definition, we had four parameters. Target and user are build-in parameters, $Node, and $AlertGroup are our defined parameters. A user-defined parameter is preceded with a $ sign.

A runbook can have multiple automation with varying numbers of parameters. So the number of parameters of a runbook can be different from the number of parameters of its automation. We will define a runbook with one automation for this simple example, so we will keep the same number of parameters and their name.

Once we have published and tested the runbook, we can assign it to a trigger. The following shows part of the definition of the trigger. We assigned the trigger conditions to match all three events.

Trigger condition test result showing the three events matched.

We can choose to assign a “fixed value,” “ask the user,” or substitute with the event’s field value during the automation parameter definition. Furthermore, if we use the event’s value, we can use the Netcool/Impact function rextract to use only part of its value. For this example, we will choose the whole field value. Again, we toggled the executions to manual to allow us to trigger the execution manually to observe the results.

the runbook automation parameters.

After we have saved the trigger, we can see that all three events now have a runbook attached to them. Selecting one of the events and running the runbook produced the expected results showing the parameters are passed correctly.

All three events have a runbook associated with them assigned by the trigger.
The runbook successfully called the scripts, passing two parameters that originated from the fields in the event.

Extending the scenario.

We have created a runbook with parameters, passed the parameter to a script on a remote host, and ran it through ssh. Using the same procure, we can perform network automation such as healing a Software Defined Network (SDN) component when we received a certain event. Here is an example scenario. We have a Fortinet firewall, a VM image running on OpenStack, provisioned using the orchestration software TNC-O (Telco Network Cloud — Orchestration).

The following diagram shows one sequence of actions involving the firewall, Watson AIOps, and TNC-O.

Fortinet firewall process flows, illustrating the use of runbook automation.

When the firewall encounters issues (such as disk full), it sends SNMP traps to the Event Manager, which triggers the runbook. The runbook performs an automatic heal, using TNC-O, the Orchestration Software that builds the firewall.

TNC-O API heal function requires two parameters. The first parameter is the device name that is contained in the SNMP alert. The second parameter is the service name; this is obtained by querying the result of ASM (Agile Service Manager) discovery. ASM is another component of Watson AIops.

Triggered by the received “disk full” alerts from the Fortinet Firewall, The Event manager (Watson AIOps) automatically executes the heal function, which performs the following:

  1. ssh and then oc login on a workstation in the same network domain as the OCP (OpenShift Containter Platform) cluster that hosts the TNC-O.
  2. Extracting the TNC-O’s username and password from a secret in the OCP cluster.
  3. Calling the TNC-O REST interface passing the username and password from step 2 to get a bearer token.
  4. Use the bearer token and two parameters from the runbook (service name and component name) to perform the Orchestration Heal function.

Here is the output result of the runbook. In the background, TNC-O heals the firewall.

The Fortinet SNMP traps in the Event List. Note the dot in the Topology and Runbook column, denoting a topology view is available on this device, and there is a runbook associated with it.
Runbook output showing the token and successful execution.

As this is the first tutorial on RBA, we do not go into the runbook and automation definition in detail. The aim here is to give an example that the same runbook definition process that we have discussed can be used to drive a more complex scenario.

Summary

We have just beginning to explore the RBA capability to build a runbook with parameters. RBA can support more automation interfaces and more complex business rules. Watson AIOps can help SREs with their automation journey.

--

--

Julius Wahidin
IBM Cloud Pak for AIOps

is a member of the IBM Watson AIOps Elite team. The team’s goal is to help design and implement Watson AIOps. All stories and comments are my own.