Automate Kubernetes Alerts Response using Epiphani Playbook Engine

Somnath Mani
epiphani
Published in
5 min readAug 31, 2020

Introduction

Operationalizing Kubernetes is hard and you can use all the help you can get. You want to be able to monitor your Kubernetes system closely, trigger alerts based on defined rules and finally you need the capability to automate response to these alerts

Alerting Framework

Prometheus provides the means to monitor Kubernetes and trigger alerts based on defined rules. Alertmanager, on the other hand, consumes these alerts from Prometheus and provides the capability to group the alerts, inhibit and silence the alerts as well as route the alerts to configured receivers.

Automated Response

One can configure Slack, Pagertduty etc to be the receivers for these alerts, in which case the alert notification is delivered as a message to the recipients, but what you truly want is the ability to automate the response to these alerts. Instead of manual intervention to resolve these alerts, you ideally want automation to auto-remediate if possible. At the minimum, you want to enrich the message with diagnostic information related to the alert before sending it out to the on-call users so that much of the leg work required to collect this information is performed upfront and the user can focus on what’s important, i.e. resolve the issue expeditiously.

Closing out the Kubernetes monitoring loop using Epiphani Playbooks

Configuring Epiphani as the receiver for alerts in Alertmanager enables automation of alerts response. You can now easily and simply auto remediate the alerts, essentially closing out the monitoring loop.

Let’s walkthrough an example that illustrates how this can be achieved.

Kubernetes Setup:

  • We have deployed Kubernetes on Minikube
  • nginx application
  • Prometheus & Alertmanager deployed in monitoring namespace

Please refer to the following link for more information regarding installing Prometheus and Alertmanager and setting up rules and alert configuration

Prometheus Rule:

A simple rule to generate an alert if the nginx cpu usage breaches a certain threshold has been configured

Alertmanager Configuration:

Using the webhook configuration, we specify Epiphani webhook endpoint. We are in the process of integrating Epiphani configuration natively in to Alertmanager

Epiphani Alert Routing:

  • Create an alert routing playbook that routes incoming alerts from Alertmanager to a corresponding handler playbook
  • Playbook consists of connectors, each performing a specific action. Connectors can be 3rd party connectors such as Slack, AWS EC2, Pagerduty, etc or they can be in house scripts. Epiphani provides a rich set of connectors.
  • In our example, PodHighCpuLoad alert will cause execution along the red path out the Alert Parser node (first node), versus a ServiceDown alert will take the green path. Blue is the default path. This is achieved by configuring match/action rules inside the nodes
  • Add an alert handler playbook as a node in the Alert routing playbook. In our alert routing playbook shown above, “Restart Pod Playbook” and “Service Down Playbook” nodes invoke respective handler playbooks for PodHighCpuAlert and ServiceDown alerts. We are using the “nested playbooks” capability here to invoke a child playbook from within a parent playbook.

Alert Handler Playbook:

  • Create an alert handler playbook to auto remediate a specific alert, PodHighCpuLoad in our example. These playbooks are triggered by the alert routing playbook as explained in the previous step
  • We have used shell scripts that invoke kubectl commands, Slack and Pagerduty connectors for this example
  • As an auto remediation example, we check to see if the pod exists in the cluster. If yes, we follow the green execution path, Send a slack message, restart the pod and finally send an “enriched” page to the on call person with the current state of the cluster

At this point we are ready with the setup to handle alerts for our example use case, PodHighCpuLoad. As simple as that!!

Let’s follow along the execution…

We login to the nginx container and run an inline script to cause cpu to spike, resulting in triggering of the alert

Which in turn triggers execution of the alert routing playbook

You can follow along live execution of the playbook. The color of the triangle on top left corner of each connector indicates state of execution of that connector. Blue indicates the action completed successfully and orange indicates “in process”. As can be seen, execution follows the red path as expected and triggers execution of the alert handler playbook for PodHighCpuLoad

Alert handler playbook execution proceeds along the green path as expected, triggering slack message and restarting of the pod (i.e deletion of the pod. K8s re-spawns the pod). Finally an enriched page is sent to the user

Slack messages sent as part of the playbook execution

“Enriched” Pagerduty notification with information regarding the incident. Corrective action taken and the current state of the Pods.

As you can see, in a few simple steps, you can create a robust, customized and automated response to Kubernetes alerts.

I truly hope you found this useful . If you would like to know more about creating playbooks, please visit https://epiphani.ai

If you are interested in alert response automation, you can sign up to for alpha trial at https://epiphani.ai/#signup

Finally, we would love to hear your thoughts on this article. Kindly provide feedback at feedback@epiphani.ai

--

--