Auto Remediating Kubernetes alerts: Epiphani’s “on-premise” solution

Somnath Mani
epiphani
Published in
6 min readSep 25, 2020

Background

In a previous Medium blogpost, I had covered how one can use the Epiphani playbook engine to automate remediation of Kubernetes alerts.

At a high level, it involves a few simple steps:

  • Choose from a rich set of connectors to create playbooks that automate a sequence of remediation actions, which could include, for e.g., getting diagnostic information from the Kubernetes cluster, redeploying pods, sending slack messages, creating Jira tickets, triggering Pagerduty alerts etc.
  • Use the connectors to create playbooks. First, A set of playbooks that serve as handlers for one or more alert types and second, a parent playbook that intelligently routes alerts to their respective alert handler playbooks and
  • Configure alerts rules using Prometheus
  • Set up webhook configuration in Alertmanager to forward the alerts to Epiphani’s playbook engine service

… and that’s it! you are in business. As and when alerts arrive from the Alertmanager, the appropriate handler playbook is triggered. This closes out the alerting loop in Kubernetes

So what’s new?

Users now have capability to take charge of the uptime of their Kubernetes infrastructure with our on-premise solution. To begin with, the solution can be deployed in user’s AWS VPC. Users will soon have the option of deploying it in a location of their own choice, be it in AWS VPC, google cloud, private datacenter or even on user’s own laptops for that matter! You can find details regarding how to install here.

In addition to that, for the GUI wary developers, you can now create and execute Epiphani playbooks from the comfort of your terminal using “Ecube” CLI. Here is a link to the git repository .

Alert Remediation Playbooks

Let’s jump in to some details regarding how all of this comes together. Let’s take a look at sample playbooks created via CLI. Before we dive in to the nitty gritties, here are few housekeeping cli commands to assist with playbook creation and execution using CLI

List Available Playbooks:

List Available Connectors:

Above is just a truncated list of available connectors.

To list details of a particular connector, including parameters, config and the output context path, simply add — name “<connector-name>” at the end of the command, like such:

e3 playbook connectors --name SplunkPy

In addition to the above, one can

Create playbooks:

e3 playbook create --directory <file path to playbook yaml>

Example:

 e3 playbook create --directory samples/playbooks/oneNode

List playbooks:

e3 playbook show

Execute playbooks:

e3 playbook run --name getEC2

We support YAML definition of playbooks. Ecube CLI interprets the playbook yaml file and instantiates the corresponding playbook in the backend.

Examining our example from the previous medium post, we define 2 playbooks, first, the top level “Alert Router” playbook, which is responsible for routing the alerts to the correct alert handler playbook and second, the alert handler playbook itself.

Alert Router playbook:

For the sake of convenience, I am also displaying the rendering of the playbook created via CLI in the Epiphani playbook engine GUI.

The playbooks are broken down in to sections.

  • Arguments: These specify the arguments to the overall playbook
  • Plays: In this section, we spec out the actual nodes (connectors) and this contains the heart of the playbook logic. As can be seen above, we have defined 5 nodes, including the “start” and “end” nodes
  • Links: This section contains the topology of how the connectors are linked.

Plays:

Digging in to how plays are specified:

  • connector: We specify the connector to use here. In our case it is the ‘Alerts — Parser’
  • action: Each connector can support multiple actions. We specify ‘epiphani-parse-prometheus-alerts’ action
  • arguments: We specify connector level arguments here. We have specified a variable argument using Jinja styling. In our case, this variable is retrieved from the actual alert message that triggered this playbook run
  • rules: We use rules to specify match criteria and corresponding actions. This is how we decide which alert handler playbook to invoke based on the alert name/type. In our case, we specify that if the alert name is “PodHighCpuLoad” (match), then take the “green’ execution branch (action). One of the downstream nodes/connectors in the “green” execution path is “Restart Pod Playbook” node which is the handler for this alert. We also store “alert name”, “alert message” and “alert pod” in variables to be used later.

Links:

  • Here we specify how the nodes are linked to each other.
  • we use “fromPort” to identify links when there are multiple links exiting out of the same node. “green” link in our example specifies the link that leads to the downstream nodes for handling “PodHighCpuAlert” (see rules above)

Alert Handler Playbook:

  • This playbook handles the actual remediation logic
  • As an example, We use “ssh” connector to login to the host from where we can execute kubectl commands.
  • The login credentials are stored in “secure stash” and are retrieved in the backend at execution time
  • The corresponding GUI rendering of the playbook is also shown

Remediation in action

  • We had defined a simple alert rule in Prometheus to trigger an alert when the cpu for nginx container breached a certain threshold
  • The playbook engine page shows the playbooks being triggered in response to an alert
  • You can choose to follow along the live execution of the playbook in the playbook engine web page

Result

The playbook execution results are available in the playbook engine page. You can also retrieve the results via CLI by executing:

e3 playbook results — name AlertRouterCLI17

  • The result output is shown here. For example, the output of “Get-Pod_list” node in the restart handler playbook has been expanded for illustration. This lists the pods after the original nginx pod was re-deployed to remediate the alert
  • The resulting “enriched” Pagerduty page is shown here. At the end of our remediation playbook, we send a page to the “on-call” resource

In Conclusion

You now have the capability to download and install the Epiphani playbook engine locally. You can also create, execute and capture results using CLI.

Hopefully this blog post outlines how easy and more importantly, how powerful the playbook engine can be in closing the Kubernetes alerting loop by auto remediating Kubernetes alerts

Feedback

Your feedback is extremely valuable in improving the playbook engine. Kindly provide feedback: feedback@epiphani.ai

--

--