Constant Vigilance: A Step-by-Step Guide to Alerts with Helm and Prometheus

Sophia Sanchez
Curai Health Tech
Published in
9 min readAug 14, 2019

Here at Curai, we are working on cutting edge medical technology, such as Machine Learning (ML) models for medical diagnosis. However, ML models are only as good as they are reliable (see Martin Zinkevich’s “Best Practices for ML Engineering” and this paper on ML product readiness for more on this topic). We needed a robust alerting system to monitor and notify us of any issues within our ecosystem, running the gamut from anomalous CPU usage to high latencies for model return times.

In the Curai ecosystem, and its user-facing application First Opinion, users can interact with ML features as well as speak with healthcare professionals. We have frontend applications for users and health coaches, and servers to enable chat messages (fig. 1). There are also API servers to negotiate the logic around, for example, machine learning features integrated into the product. We also rely on a Postgres database and Redis cache, and use Docker and Kubernetes to containerize and orchestrate those containers. For a more complete description of the system components, please see the Curai blog post on porting a legacy application. All of these components need to be reliable to ensure the best possible experience for our users. Alerting allows us to keep a pulse on our engineering ecosystem and react quickly to any issues, minimizing downtime and enabling a high-quality interaction with the platform.

Fig. 1. High-level components of the Curai/First Opinion engineering ecosystem. For a more complete description of the system components, please see the Curai blog post on porting a legacy application. Credit to Vignesh Venkataraman for the original illustration.

It became clear early on that our initial alerting system, a series of auto-generated emails with stack traces, was not sufficient. First, there was no schema of prioritization, escalation, or targeted alerts, so often times critical information was buried in a sea of notifications. Second, writing time series and other metrics using our previous solution, Stackdriver, was limited by quotas and rate limits, and a lack of flexibility with converting our desired metrics into targeted alerts. Lastly, and most importantly, code-based error messages cannot alert you to the health of your pods (the set of containers with shared resources and network, and rules about how to run them), CPU usage, or other system-wide metrics, all of which are critical for reliability. Clearly, we needed a better solution. Enter Helm and Prometheus.

Fig. 2. A spike in CPU usage worthy of a notification.

Helm and Prometheus

Helm is a package manager for Kubernetes. It helps you easily apply, update, and upgrade your app. Prometheus is an open-source system for monitoring and alerting. It comes with a lot of cool features out of the box, including a UI for viewing sets of metrics and running queries. You can write time series metrics using Prometheus, allowing you to keep track of information like end-to-end model latencies and trigger alerts to send to email, Slack, PagerDuty, or other tools.

It is not a perfect solution, however. The initial set up is pretty involved as any fancy time series analytics requires plugging Prometheus into a system like Grafana, and it does not eliminate the need for outside logging. Even so, we found that the benefits of a performant, seamless, and open-source solution with a solid community decidedly outweighed the cons.

In the end, we decided to install Prometheus into our clusters using Helm to provide a straightforward system for monitoring and alerting throughout our engineering ecosystem. The MVP was setting up Helm, Prometheus (and its alertmanager, complete with a rules set) and sending alert notifications to Slack. While the documentation for individual components was thorough, setting up the complete pipeline from Helm, to Prometheus, to Slack requires some additional work. Below is a step-by-step outline of that process, and an explanation of how the individual pieces fit together to establish an integrated alerting system.

Step 1: Setting up Helm and Prometheus

First, you’ll need to set up Helm. The exact steps depend on your setup and environment. If you’re like us, and have a macOS system with Homebrew and kubectl, the Kubernetes command-line tool, setting up Helm should be straightforward. First, you’ll need to fetch the cluster credentials. Then, check that you are in the appropriate context with:

kubectl config current-context

If you’re not, go ahead and switch contexts. Once in the right context, run:

brew install kubernetes-helm

If you run into issues on install, the Helm documentation on setup can get you where you need to go. If you get error messages from the brew install, you may need to do some flavor of:

helm init --upgradehelm repo update

Note that helm init modifies your cluster by installing the helm server, Tiller, on it.

Next, you’ll want to set up Prometheus. We’ll call it “alerting-release.” Helm allows you to install multiple instances of the same software by specifying unique names. It’s important to pick a distinctive and descriptive name for this reason. To install Prometheus to the current context:

helm install --name alerting-release stable/prometheus

You can sanity check that Prometheus alerting-release was installed with a simple helm list, and also that alertmanager is running with

kubectl get pods | grep alert

At any point if you mess up, you can always start fresh with

helm delete --purge alerting-release

At this point, you should be good to go with the helm setup.

Step 2: Setting up Slack Webhook

Setting up a Slack-based alert system requires webhooks. The instructions via Slack are relatively straightforward. Essentially, you want to “Install” webhooks, and then “Add Configuration.” You’ll want to point this at your new channel. For the sake of the argument, we’ll call this channel #alerting (which you need to set up on your Slack channel as well). At the end of this setup, you should have a functional webhook url.

To test that the webhook works, you can send a simple cURL as follows:

curl -X POST --data-urlencode "payload={\"channel\": \"#MYCHANNELNAME\", \"username\": \"sanitybot\", \"text\": \"Just a sanity check that slack webhook is working.\", \"icon_emoji\": \":ghost:\"}" MY_WEBHOOK_URL

Step 3: Setting up Alertmanager and Rules

Now that you have alerting-release set up, you need to specify some instructions. The next step is to create a values.yaml file that specifies 1) what the alert rules are, 2) what the Prometheus targets are (i.e the definition of what to scrape and how) and any jobs for Prometheus, and 3) where alerts should be routed to (in this case, Slack).

It is worth noting that the alertmanager allows for many other useful integrations, such as email and PagerDuty. For the sake of brevity and clarity, I will focus on Slack, but I would highly recommend perusing the Alertmanager configuration options to get a sense of your options. This leaves us with three configuration tasks — alert rules, Prometheus targets, and alert routing — to trigger an alert and send a Slack message.

Let’s take these one at a time.

Alert Rules

The alert rules can test anything Prometheus metrics has access to. For example, kube_pod_container_status_waiting_reason can tell you a lot about the health of your pod. As a first pass, it’s usually a good idea to test if the Prometheus instance is up. That expression is up == 0. You can see here I’ve set up an alert for InstanceDown which is labeled as critical.

Prometheus Targets

Next, you will set up the Prometheus targets and specify the jobs. Fortunately, Kube scraping is already set up out of the box, so for our purposes, no additional action is required for this step.

There are a number of default jobs and specs that come with Prometheus. I defer to this wonderful and verbose default values.yaml file, and I would highly recommend poking around to better understand the default configuration. Take a look specifically at everything with “job_name”. Please see the official Prometheus configuration docs for more on adding additional targets.

Alert Routing

The last step is to tell the alertmanager where to send alerts. In this case, we want to send them to our Slack #alerting channel. At the same level as “serverFiles”, you’ll want to add in the alertmanager config. This part of the file is saying: for alerts triggered that are marked as “critical”, send them to the alerting channel with the specified text description. Here, it is set to repeat the alert every minute, and also send an alert once the issue has been resolved. There are many default values specified in the Prometheus configuration that are not necessarily obvious at first glance. For example, unless you specify otherwise, a notification will only be re-sent for an alert every four hours.

There is a lot of extra work you can accomplish with the alertmanager specifications, such as grouping sets of alerts and sending different levels of alerts to different places. As a first pass, simply checking for the instance being down is a good start.

Once your values.yaml file is prepared, you’re ready to upgrade.

helm upgrade -f values.yaml alerting-release stable/prometheus

For the sake of the tutorial, the install and upgrade have been broken up into separate steps. Once you’ve got the hang of helm, you can combine the helm install with the upgrade command like so:

helm install -f values.yaml --name alerting-release stable/prometheus

Step 4: Sanity check everything.

To troubleshoot, you’ll want a thorough sanity check of all the pieces in your new system. First, check that configmaps look like what you specified in the values.yaml file.

kubectl describe configmap alerting-release-prometheus-alertmanagerkubectl describe configmap alerting-release-prometheus-server

You’ll also want to check the kube deployment logs to make sure the new configuration was passed in without error. Parsing errors, for example, are a common problem. It’s also a good idea to check the status of “alerting-release-prometheus-server” and “alerting-release-prometheus-alertmanager” and make sure everything is green and running.

Lastly, you’ll want to start up the built-in UI. To see the server dashboard, run:

POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")kubectl --namespace default port-forward $POD_NAME 9090

If you navigate to localhost:9090, click on “graph”, write “up” in the expression box, and click “execute,” you should see an interface like so:

Admittedly, a straight line is not particularly interesting. It’s a good idea to test out other more interesting expressions, such as:

kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", container="MYCONTAINER"} == 1

If there have been any CrashLoopBackOffs since you kicked off Prometheus, you should see it displayed on the graph. Alternatively, you can see all the kube_pod metrics available in the search bar if you start typing, and pick one that regularly fluctuates. Once you pick your expression, you should go ahead and update expr in values.yaml.

As a final step, you will want to trigger your alert rule (for example, by taking your instance down for more than a minute, or forcing a CrashLoopBackOff on a dummy pod). If all goes well, you should see this message, with your title and text summary of choice, as a reward:

Final Thoughts

This is just the tip of the Prometheus iceberg when it comes to alerting. You can also write metrics using a tool such as the Python client, triage alerts based on severity, and integrate alert data with a dashboard such as Grafana. We have found the combination of Helm, Prometheus, and Slack webhooks to be a good fit for the alerting and monitoring needs at our stage.

Alerting infrastructure helps enable us to provide the highest quality experience for our users. And, when it comes to scaling the world’s best healthcare, users and health professionals depend on the reliability of the system to provide high quality care. If you’d like to learn more about our work at Curai, please reach out directly or check out our careers page here.

Thank you to Sindhu Vijaya-Raghavan, Vignesh Venkataraman, and Matt Willian for their guidance and collaboration on alerting infrastructure and input on this post.

--

--