Automate Kubernetes Alerts Response using Epiphani Playbook Engine

Somnath Mani
Aug 31 · 5 min read
Image for post
Image for post

Introduction

Operationalizing Kubernetes is hard and you can use all the help you can get. You want to be able to monitor your Kubernetes system closely, trigger alerts based on defined rules and finally you need the capability to automate response to these alerts

Alerting Framework

Prometheus provides the means to monitor Kubernetes and trigger alerts based on defined rules. Alertmanager, on the other hand, consumes these alerts from Prometheus and provides the capability to group the alerts, inhibit and silence the alerts as well as route the alerts to configured receivers.

Automated Response

One can configure Slack, Pagertduty etc to be the receivers for these alerts, in which case the alert notification is delivered as a message to the recipients, but what you truly want is the ability to automate the response to these alerts. Instead of manual intervention to resolve these alerts, you ideally want automation to auto-remediate if possible. At the minimum, you want to enrich the message with diagnostic information related to the alert before sending it out to the on-call users so that much of the leg work required to collect this information is performed upfront and the user can focus on what’s important, i.e. resolve the issue expeditiously.

Closing out the Kubernetes monitoring loop using Epiphani Playbooks

Configuring Epiphani as the receiver for alerts in Alertmanager enables automation of alerts response. You can now easily and simply auto remediate the alerts, essentially closing out the monitoring loop.

Let’s walkthrough an example that illustrates how this can be achieved.

Image for post
Image for post

Kubernetes Setup:

  • We have deployed Kubernetes on Minikube
  • nginx application
  • Prometheus & Alertmanager deployed in monitoring namespace

Please refer to the following link for more information regarding installing Prometheus and Alertmanager and setting up rules and alert configuration

Image for post
Image for post

Prometheus Rule:

A simple rule to generate an alert if the nginx cpu usage breaches a certain threshold has been configured

Image for post
Image for post

Alertmanager Configuration:

Using the webhook configuration, we specify Epiphani webhook endpoint. We are in the process of integrating Epiphani configuration natively in to Alertmanager

Image for post
Image for post

Epiphani Alert Routing:

  • Create an alert routing playbook that routes incoming alerts from Alertmanager to a corresponding handler playbook
  • Playbook consists of connectors, each performing a specific action. Connectors can be 3rd party connectors such as Slack, AWS EC2, Pagerduty, etc or they can be in house scripts. Epiphani provides a rich set of connectors.
Image for post
Image for post
  • In our example, PodHighCpuLoad alert will cause execution along the red path out the Alert Parser node (first node), versus a ServiceDown alert will take the green path. Blue is the default path. This is achieved by configuring match/action rules inside the nodes
  • Add an alert handler playbook as a node in the Alert routing playbook. In our alert routing playbook shown above, “Restart Pod Playbook” and “Service Down Playbook” nodes invoke respective handler playbooks for PodHighCpuAlert and ServiceDown alerts. We are using the “nested playbooks” capability here to invoke a child playbook from within a parent playbook.
Image for post
Image for post

Alert Handler Playbook:

  • Create an alert handler playbook to auto remediate a specific alert, PodHighCpuLoad in our example. These playbooks are triggered by the alert routing playbook as explained in the previous step
  • We have used shell scripts that invoke kubectl commands, Slack and Pagerduty connectors for this example
  • As an auto remediation example, we check to see if the pod exists in the cluster. If yes, we follow the green execution path, Send a slack message, restart the pod and finally send an “enriched” page to the on call person with the current state of the cluster

At this point we are ready with the setup to handle alerts for our example use case, PodHighCpuLoad. As simple as that!!

Let’s follow along the execution…

Image for post
Image for post

We login to the nginx container and run an inline script to cause cpu to spike, resulting in triggering of the alert

Which in turn triggers execution of the alert routing playbook

Image for post
Image for post

You can follow along live execution of the playbook. The color of the triangle on top left corner of each connector indicates state of execution of that connector. Blue indicates the action completed successfully and orange indicates “in process”. As can be seen, execution follows the red path as expected and triggers execution of the alert handler playbook for PodHighCpuLoad

Image for post
Image for post

Alert handler playbook execution proceeds along the green path as expected, triggering slack message and restarting of the pod (i.e deletion of the pod. K8s re-spawns the pod). Finally an enriched page is sent to the user

Image for post
Image for post

Slack messages sent as part of the playbook execution

Image for post
Image for post

“Enriched” Pagerduty notification with information regarding the incident. Corrective action taken and the current state of the Pods.

As you can see, in a few simple steps, you can create a robust, customized and automated response to Kubernetes alerts.

I truly hope you found this useful . If you would like to know more about creating playbooks, please visit https://epiphani.ai

If you are interested in alert response automation, you can sign up to for alpha trial at https://epiphani.ai/#signup

Finally, we would love to hear your thoughts on this article. Kindly provide feedback at feedback@epiphani.ai

epiphani

Automate, share and empower

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store