Going open-source in monitoring, part III: 10 most useful Grafana dashboards to monitor Kubernetes and services

Sergey Nuzhdin
4 min readJul 21, 2017

--

There are dozens of ready dashboards available on grafana.net/dashboards and Github. Many of them just work, but many do not. In most cases, you just need to fix template variables, but some require deeper involvement. For example dashboard for PostgreSQL monitoring. After import, it welcomes you with the error message “Datasource named ${DS_PROMETHEUS} was not found”. Setting the correct datasource name in dashboard settings does not help because it has an error in __inputs declaration. The easiest way to fix it is to edit JSON before import.

It takes time to find useful dashboards, make them work for you and then choose one. For example, if you’ll search for Kubernetes dashboards with datasource Prometheus you’ll get, among others, 5 results with the same name.

In this post, I want to cover some of the most useful dashboards available to help you monitor your Kubernetes cluster and services deployed on it.

This post is one of a series of posts about monitoring of infrastructure and services. Other posts in the series:

0. Intro

1.Deploying Prometheus and Grafana to Kubernetes (this article)

2. Creating the first dashboard in Grafana

3. 10 most useful Grafana dashboards to monitor Kubernetes and services

4. Configuring alerts in Prometheus and Grafana

5. Collecting errors from production using Sentry

6. Making sense of logs with ELK stack

7. Replacing commercial APM monitoring

8. SLA, SLO, SLI and other useful abstractions

Cluster view

Kubernetes cluster overview #1621 or #315

This dashboard usually the first one you search for after deploying Kubernetes + Prometheus. Among the 5 shown in the screenshot above only 2 is really good. Actually, it is the same dashboard saved by two people.

Node view

Detailed node overview — #1860

This one is great when you need to understand what is wrong with the particular node. It has much more stats compared to the cluster overview.

Deployment metrics monitoring

Kubernetes Deployment metrics #741

Next level of monitoring after the node — deployment. This dashboard will show you everything about selected deployment.

Pod metrics monitoring

Kubernetes Pod Metrics #747

Pod level monitoring. Shows mostly the same stats as the deployment dashboard but on Pod level. Pod info row shows some random values in Pod container and Pod IP Address places, but other values seem fine.

Application metrics monitoring

Kubernetes App Metrics #1471

This one is good as an example of how to monitor your application deployed to Kubernetes. It has metrics like request rate, error rate and response times for your application alongside with different resource usage stats.

PostgreSQL

Postgres Overview #455

After you fix the broken variable declaration — it just works :) I saved fixed one here

ElasticSearch

Elasticsearch cluster overview #266

The best one so far is 266. It supports most of the metrics needed and required least amount of tweaks to make it work. There are several other options — 2322, 2347, 718. But they either lack graphs or did not work at all for me.

Redis

Prometheus Redis #763

Memcached

Memcached node — #37

Etcd

Monitoring of Etcd is covered on CoreOS website here. Dashboard suggested in this manual has hardcoded datasource, so you have to edit it before import. So I made datasource configurable and saved it in the repo here. By default, Etcd is not monitored by Prometheus at all. So, you need to tell it where your Etcd lives. If you have self-hosted Etcd annotations to the Etcd service will work, otherwise, you need to update prometheus config with the location on your Etcd nodes.

- job_name: etcd
static_configs:
- targets: ['10.240.0.32:2379','10.240.0.33:2379','10.240.0.34:2379']

Wrap Up

That’s it. At this point, we can monitor health of the Kubernetes cluster and several common services.

Sources for PostgreSQL and Etcd dashboards, with changes I did, are available on GitHub.

Stay tuned.

Like this article?

Click the 💚 below so other people will see it here on Medium.
Subscribe to get new stories delivered to your inbox or follow me on twitter.

Originally published at blog.lwolf.org on July 21, 2017.

--

--