Monitoring Kubernetes with Prometheus

Aymen El Amri

Published in

The MetricFire Blog

3 min readMar 24, 2020

Introduction
What’s Broken, and Why?
Monitoring Distributed Systems: The Four Golden Signals
Latency
Traffic
Errors
Saturation
Prometheus and the Four Golden Signals
Conclusion

Introduction

In part I of this blog series, we understood that monitoring a Kubernetes cluster is a challenge that we can overcome if we use the right tools. We also understood that the default Kubernetes dashboard allows us to monitor the different resources running inside our cluster, but it is very basic. We suggested some tools and platforms like cAdvisor, Kube-state-metrics, Prometheus, Grafana, Kubewatch, Jaeger, and MetricFire.

In this blog post, we are going to look at the Four Golden Signals of building an observable system, and then see how Prometheus can help us in applying these rules.

To get started, sign up for the MetricFire free trial, where you can try out our Hosted Prometheus with almost no set up.

What’s Broken, and Why?

A big part of a DevOps team’s job is to empower development teams to take part in the operational responsibility. DevOps is based on cooperation between the various IT players around good practices in order to design, develop, and deploy applications more quickly, less expensively, and with higher quality. It regulates the development and operation teams around the famous principle given to us by Werner Vogels, CTO of Amazon: “you build it, you run it”. The people that make up the team is, therefore, one of DevOps’ main assets.

Taking the responsibility of running an application also requires the DevOps team to get involved in other subtasks like monitoring the application. This is when choosing the right metrics to watch in production is a critical task. What you monitor, and the data you see, will impact your DevOps approach.

Also, involving the team in traditional monitoring tasks is not enough. With monitoring, you can discern what is happening in your production infrastructure. You can determine if there is a high activity volume on a server or a pool of servers. However, with observability (or white box monitoring), you can detect the problem before it becomes an issue.

“Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.” ~ Google SRE Book.

There are no ready-to-use methodologies when it comes to choosing the right metrics; everything depends on your team’s technical and business needs. However, the following approaches may inspire you:

Google SRE book from Google
USE Method from Brendan Gregg
RED Method from Tom Wilkie

We will try to understand some of the essential and most commons metrics to watch in a kubernetes-based production system based on Google’s Four Golden Signals.‍

Monitoring Distributed Systems: The Four Golden Signals

In chapter 6 of “Monitoring Distributed Systems” of the famous Google SRE book, Google defines the four main signals to be constantly observed. These four signals are called the four golden signals: latency, traffic, errors, and saturation.

These signals are extremely important, as they are essential to ensure high application availability. Let’s briefly take a look at what each one means.‍

…

To finish reading this article, check out the full post on the MetricFire website.

Monitoring Kubernetes with Prometheus

Introduction

What’s Broken, and Why?

Monitoring Distributed Systems: The Four Golden Signals

Written by Aymen El Amri