Adding Grafana visualisation to a Kubernetes cluster with Prometheus

Monitoring your services is vital and should be considered as part of your underlying infrastructure for your services. You should put this in place ahead of creating and deploying your services. In this article I look at how to deploy Grafana on a Kubernetes cluster to visualise the metrics collected by Prometheus.

13 min readJan 9, 2024

Adding observability to your Kubernetes cluster

This article follows my previous articles on creating a Kubernetes cluster and installation of Prometheus on Kubernetes.

I am assuming that you have followed both of these articles and have a Kubernetes cluster set up with Prometheus installed.

In those articles I describe why monitoring and altering is so important and why you should set it up before development of your services. I now look at adding Grafana to the cluster to provide visualisation and alerting.

Prometheus and Grafana

If you have done any investigations into setting up monitoring and alerting for your Kubernetes cluster, you may have come across Prometheus and Grafana as options and you may have been confused about their roles.

This may be because both packages provide features that the other provides.

For example, as shown in the diagram above, both applications provide alerting, both also provide querying of the underlying metrics and both provide a user interface that allows graphs to be used to spot trends.

With help from Google, you will find many articles that discuss the pros and cons of each platform.

For this article we are going to create this set up:

If you have been following my previous articles, you will have a blank Kubernetes cluster without any application services but with Prometheus installed.

We use Prometheus to collect relevant metrics. In the world of Prometheus, this is known as scraping the target.
We will use Grafana to display the information collected (known as visualisation) and also to generate and deliver alerts via Slack.

In this architecture, we will run Grafana on the cluster itself. As we do not want to lose any metrics or configurations, we will back it with a Persistent Volume.

Architecture

From my previous articles, you should have a Kubernetes cluster that looks like this:

You will have Prometheus installed in the cluster and it will be monitoring the nfs-server and gw servers. We will now install Grafana onto this cluster, using a Persistent Volume backed by the nfs-server.

Starting with a Kubernetes cluster backed by an NFS server and persistent storage, we will add:

A Persistent Volume (PV)
A Persistent Volume Claim (PVC)
Grafana

All these components will be added to a Kubernetes namespace called monitoring, which should already exist and should contain Prometheus.

We will then install Grafana and then add alerts and deliver those alerts via Slack using a webhook.

You will see that Prometheus is capable of providing sophisticated alerting via its Alert Manager module but I have decided to use Grafana as it is more user friendly and lends itself to more to ad hoc changes, allowing you to experiment with alerting rules and levels as you learn about how your system behaves.

Setting up our PVC

We need a place for Grafana to store its data and config safely so it doesn’t get lost if the pod is killed and rescheduled. We do this using a Persistent Volume (PV), which the application claims through a Persistent Volume Claim (PVC). I have written about creating PVs and PVCs here.

Creating the PV

I would strongly suggest that you create a separate share for Grafana. If you have followed my previous articles, you will need to add this share. Log in to your nfs-server and modify this file as root (keep any other changes you may have made). I have included the line from the Prometheus article.

/etc/exports

/pv-share *(rw,async,no_subtree_check)
/pv-share/grafana *(rw,async,no_subtree_check)
/pv-share/prometheus *(rw,async,no_subtree_check)

Before we load these into NFS, we have to create the folder:

sudo mkdir /pv-share/grafana
sudo chmod 777 /pv-share/grafana

Note that these file permissions are weak and should not be used for production. For this article I am showing an example to get you started.

Now load these shares and ensure the service starts correctly.

sudo systemctl restart nfs-server
sudo systemctl status nfs-server

You can now use these shares.

Log in to your k8s-master and create the following file (I am assuming here that you are accessing your cluster via kubectl on your master node. If not, use whatever access you typically use to deploy to your cluster):

Remember to replace any fields between < and > with your own values.

grafana-pv.yml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-pv
spec:
  capacity:
    storage: 10Gi
  storageClassName: grafana-class
  accessModes:
  - ReadWriteOnce
  nfs:
    path: /pv-share/grafana
    server: <nfs-server IP address>
  persistentVolumeReclaimPolicy: Retain

You may decide to change the overall size of this PV, which I have set to 10GB.

Now create the PV and check it has been created:

kubectl create -f grafana-pv.yml
kubectl get pv

You should see your PV is now available to the cluster:

NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS       REASON   AGE
grafana-pv      10Gi       RWO            Retain           Available           grafana-class               22s
prometheus-pv   10Gi       RWO            Retain           Available           prometheus-class            30s

We now need to create a PVC. We do this by creating this file:

grafana-pvc.yml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-pvc
  namespace: monitoring
spec:
  storageClassName: grafana-class
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Now create it and check that it is bound to the PV we created above:

kubectl create -f grafana-pvc.yml
kubectl get pvc -n monitoring

This should immediately show the PVCs bound to their PV equivalent:

NAMESPACE    NAME             STATUS   VOLUME          CAPACITY   ACCESS MODES   STORAGECLASS       AGE
monitoring   grafana-pvc      Bound    grafana-pv      10Gi       RWO            grafana-class      11s
monitoring   prometheus-pvc   Bound    prometheus-pv   10Gi       RWO            prometheus-class   21s

This can now ready to be connected to your Grafana pod as a mounted volume.

Deploying Grafana

I am assuming that you have installed Helm from the previous article.

With Helm installed we can use it to deploy Grafana in to our cluster with all the configuration in place to monitor our cluster.

First add the Grafana repository to Helm:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Before we install Grafana, there are a few values we need to override. Create the following values override file, remembering to replace the < > fields with your own values:

grafana-values.yml

persistence:
  enabled: true
  type: pvc
  existingClaim: grafana-pvc
initChownData:
  enabled: false
service:
  type: NodePort
  nodePort: 31000
  externalIPs: 
    - <k8s-master IP address>
serviceMontioring:
  enabled: true
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-monitoring-server.monitoring.svc.cluster.local:9090

This does four things:

persistence — connects Grafana to our previously create PVC
initChownData — disables Grafana from setting the persistent data ownership (this is not directly supported by NFS) and allows the pod to start
service — changes the ClusterIP service that the chart creates by default into a NodePort service so we can access it externally
datasources — connects Grafana to Prometheus as a data source

We now install Grafana:

helm install grafana-monitoring grafana/grafana -f grafana-values.yml -n monitoring

Verify that the pod has started:

kubectl get pods -n monitoring

You should now be able to access the UI on the k8s-master IP address at port 31000. You should see the login screen. Log in with the username of admin and a password given by:

kubectl get secret --namespace monitoring grafana-monitoring -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Note that if you delete and reinstall Grafana, I find that the default admin password can fail and the only way I have found of resetting the password is to delete the contents of your Grafana share and reinstalling. I would suggest you change the password once you log in successfully and ensure you always log out before reinstalling.

Connecting Grafana to our data sources

Through the values file, Prometheus is connected to Grafana as a data source. This allows metrics about the deployed servers, collected by Prometheus, to be displayed on a dashboard. These metrics include all metrics from:

The cluster itself
The Gateway Node Exporter
The NFS Server Node Exporter

We can now start to visualise these metrics. After logging in, go to the main menu and:

Select Dashboards
Click New
Select Dashboard
Select Prometheus as the data source
Enter node_memory_Active_anon_bytes for the metric
Click Run query

You should now see a graph with memory usage on all the nodes and the two additional servers. You can narrow this down by selecting Job under Select label. Then, under Select value you can then select a particular exporter node or the Kubernetes cluster itself.

So now you can monitor the metrics of your system through Grafana dashboards.

You can import Grafana dashboards through numeric IDs or by importing a JSON dashboard definition. These can show you the capability of Grafana and can act as a starting point for your own dashboards.

Alerting via Slack

I will now show you how to use your Grafana deployment to alert you via Slack. Grafana is a feature rich interface and I only intend to show you enough to get you up and running.

Testing the monitoring

When testing monitoring, you really need something that you can control. For a simple test, I will show you a case of alerting when a pod fails. We will do this by installing an interactive BusyBox service called debug in the default namespace using:

kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh

When you log out from BusyBox, it deletes the pods it creates. This is useful as you can control when your service is up or down.

This is a great way to test your motoring.

Create Slack app

To Alert via Slack, you need to have a Slack account. You can sign up for free at https://slack.com.

Once you have an account, you can go to https://api.slack.com/apps?new_app=1. From here click Create New App and then:

Select From scratch
Give it a name, eg Grafana
Pick a workspace
Click Create App
Scroll down to Incoming Webhook
Switch on Activate Incoming Webhooks
Add New Webhook to Workspace
Select a channel to receive your alerts
Click Allow
Copy the URL
Add New Webhook to Workspace

The URL you copy contains your access token that you will need for Grafana. Don’t lose it or you will have to recreate it.

Setting up alerts in Grafana

There are a number of steps involved in setting up an alert in Grafana.

Define the contact point for the alert (in our case this is Slack)
Define the rules for when an alert should be sent to a given contact point (Notification Policy)
Set up the rule to determine if the alert is normal or firing (a change between normal and firing and vice versa will send a Slack message)

Each of these actions is carried out via the main Grafana menu under Alerting. Note that the order above is slightly different to the menu order and that is because steps #1 and #2 are generally carried out once for a number of rules so the menu places the rules at the top.

Define a Contact Point

The contact point is the channel through which an alert is to be sent. In our case we want to set up a Slack contact point. Go to Alerting under the Home menu and select Contact Points.

Click + Add contact point.
Give your contact point a Name (eg: Slack)
Under Integration, select Slack
Under Recipient choose a Slack channel to receive the alert
Ignore Token as we are using a webhook
Under Webhook enter the URL you copied earlier (note that once entered you will not be able to access this)
Click Test and ensure a test message arrives in your Slack channel
Save contact point

Now you have created a contact point you can now send alerts to it!

Set up a Notification Policy

A notification policy determines which alerts should be sent to which contact points.

These policies are nested with each lower level overriding the level above. When you enter Notification policies under the Home -> Alerting menu for the first time you will see a Default policy which will handle all alerts that are not matched further down.

Under this click + new nested policy

Now you enter the conditions which will trigger this policy. For now you can enter this as an example:

Label: severity
Operator: =
Value: alert

Choose Slack as the contact point and Save policy.

Now you have a way to tell Grafana to send an alert to Slack, just define a label called severity and set it to alert.

Set up an alerting rule

Now we get to the difficult part — defining the rules for your alert. Now you can do this through the Alert rules menu but this can be confusing and so I recommend you start from a Dashboard.

By starting with a Dashboard, you can visualise your metric, optimise it and then set the alerting conditions around what you are seeing.

Set up a Dashboard

Under

All your alerting rules need to be placed under an alert group. The group

Go to Home -> Dashboards menu option
Click New -> New Dashboard
Click + Add Visualisation
Select Prometheus as the data source

What you will see will seem pretty crazy if you have never used Grafana before:

The page is split into three main components:

A graph that shows you the result of your query (Top left)
A description and settings for your graph (right side)
Your query (bottom left)

Before we continue, I thought I would describe the scenario I am going to set up. Basically, we will graph the number of pods in the ready state over time. We will then start BusyBox and see the graph change. We will then generate alert when the BusyBox pods terminate.

So let’s start.

Give your visualisation a Title (#1), such as BB Availability
Select the kube_pod_status_ready metric (#2)
Click Run queries (#3)

On my cluster, I see a graph showing 60 pods in the ready state.

When I start a BusyBox instance and refresh the dashboard, I see this jump to 63.

kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh

Remember, monitoring all takes time due to polling. Prometheus only scrapes its targets every minute. You may set up your alerts to only check every minute and you may decide not to alert until a condition has occurred for a minute. This means that you may not see an alert for up to 3 minutes — which can seem like a lifetime!

When you log out of the BusyBox instance, you should see the number of pods fall by 3 (remember to refresh the dashboard).

This is important — save your query/visualisation panel (#4). When you set up an alert you can lose your work. You will be asked to save the dashboard also.

Add alert from Dashboard

You can now click on the Alert tab (#5). Click on Create alert rule from this panel.

Give your alert a Name (eg: BB Availability)
The query should be set up from your Dashboard query, which gives you a time series (ie: a set of values over time)
Under Expressions a reducer should have been set up for you. It reduces the time series to a single value by selecting the last value (note that it refers to input A, which is your dashboard query)
A threshold will also have been set up for you with an input of B(the reducer output
Set the threshold to IS BELOW 63 (in my case 63 is the number of pods with the BusyBox instance running — adjust the number for your system)
Under evaluation behaviour you will need to create a folder to store your alert (eg: add a new folder called pod alerts)
Add a new evaluation group, which sets the evaluation update period, call it pods and set it to 20s
Under add annotations, give it a summary (this will appear in your Slack message)
Under configure notifications, add the key severity and the value alert, which you should remember is the condition to send your alert via Slack
Click Preview routing to check that it will be delivered via Slack

This is a very basic alert set up. You can see there are many options, such as templating, evaluation criteria, silencing rules etc. The aim here was to show you a starting point.

Testing your alert

Now we have set up our alert, we can now start and stop our BusyBox pod and see our alert trigger and resolve. In my case this is what I see in Slack:

Not very user friendly but now you have a way of testing and refining your alerts.

No Data

You may wonder why I do not add a filter to only include pods from the default namespace. We could do this and get a number on our graph of 0 or 3. We could even go further and check for the debug container.

The thing is, once the BusyBox pod terminates, the filtered metric disappears. It becomes No Data and that introduces a new dimension to the alerting configuration. Any evaluation that is based on one ore more values that are No Data produces a No Data output. To avoid that complexity, I chose just to count all pods and avoid the No Data challenge.

Summary

In this article we looked at how to add visualisation to our Prometheus monitoring application using Grafana.

We updated our NFS share configuration to provide persistent storage for Grafana using a Persistent Volume and Persistent Volume Claim.

After providing default overrides, we then installed Grafana, which automatically connected Prometheus as a data source. This allowed us to start creating dashboards so we can gain visibility of our cluster.

We then set up an alerting contact point via Slack and set up a basic alert to trigger Slack messages when a BusyBox instance was started and terminated.

There is much more to the Grafana story but that can wait for another article.

If you found this article of interest, please give me a clap as that helps me identify what people find useful and what future articles I should write. If you have any suggestions, please add them in the comments section.