Adding Grafana visualisation to a Kubernetes cluster with Prometheus
Monitoring your services is vital and should be considered as part of your underlying infrastructure for your services. You should put this in place ahead of creating and deploying your services. In this article I look at how to deploy Grafana on a Kubernetes cluster to visualise the metrics collected by Prometheus.
This article follows my previous articles on creating a Kubernetes cluster and installation of Prometheus on Kubernetes.
I am assuming that you have followed both of these articles and have a Kubernetes cluster set up with Prometheus installed.
In those articles I describe why monitoring and altering is so important and why you should set it up before development of your services. I now look at adding Grafana to the cluster to provide visualisation and alerting.
Prometheus and Grafana
If you have done any investigations into setting up monitoring and alerting for your Kubernetes cluster, you may have come across Prometheus and Grafana as options and you may have been confused about their roles.
This may be because both packages provide features that the other provides.
For example, as shown in the diagram above, both applications provide alerting, both also provide querying of the underlying metrics and both provide a user interface that allows graphs to be used to spot trends.
With help from Google, you will find many articles that discuss the pros and cons of each platform.
For this article we are going to create this set up:
If you have been following my previous articles, you will have a blank Kubernetes cluster without any application services but with Prometheus installed.
- We use Prometheus to collect relevant metrics. In the world of Prometheus, this is known as scraping the target.
- We will use Grafana to display the information collected (known as visualisation) and also to generate and deliver alerts via Slack.
In this architecture, we will run Grafana on the cluster itself. As we do not want to lose any metrics or configurations, we will back it with a Persistent Volume.
Architecture
From my previous articles, you should have a Kubernetes cluster that looks like this:
You will have Prometheus installed in the cluster and it will be monitoring the nfs-server
and gw
servers. We will now install Grafana onto this cluster, using a Persistent Volume backed by the nfs-server
.
Starting with a Kubernetes cluster backed by an NFS server and persistent storage, we will add:
- A Persistent Volume (PV)
- A Persistent Volume Claim (PVC)
- Grafana
All these components will be added to a Kubernetes namespace called monitoring, which should already exist and should contain Prometheus.
We will then install Grafana and then add alerts and deliver those alerts via Slack using a webhook.
You will see that Prometheus is capable of providing sophisticated alerting via its Alert Manager module but I have decided to use Grafana as it is more user friendly and lends itself to more to ad hoc changes, allowing you to experiment with alerting rules and levels as you learn about how your system behaves.
Setting up our PVC
We need a place for Grafana to store its data and config safely so it doesn’t get lost if the pod is killed and rescheduled. We do this using a Persistent Volume (PV), which the application claims through a Persistent Volume Claim (PVC). I have written about creating PVs and PVCs here.
Creating the PV
I would strongly suggest that you create a separate share for Grafana. If you have followed my previous articles, you will need to add this share. Log in to your nfs-server
and modify this file as root (keep any other changes you may have made). I have included the line from the Prometheus article.
/etc/exports
/pv-share *(rw,async,no_subtree_check)
/pv-share/grafana *(rw,async,no_subtree_check)
/pv-share/prometheus *(rw,async,no_subtree_check)
Before we load these into NFS, we have to create the folder:
sudo mkdir /pv-share/grafana
sudo chmod 777 /pv-share/grafana
Note that these file permissions are weak and should not be used for production. For this article I am showing an example to get you started.
Now load these shares and ensure the service starts correctly.
sudo systemctl restart nfs-server
sudo systemctl status nfs-server
You can now use these shares.
Log in to your k8s-master
and create the following file (I am assuming here that you are accessing your cluster via kubectl
on your master node. If not, use whatever access you typically use to deploy to your cluster):
Remember to replace any fields between < and > with your own values.
grafana-pv.yml
apiVersion: v1
kind: PersistentVolume
metadata:
name: grafana-pv
spec:
capacity:
storage: 10Gi
storageClassName: grafana-class
accessModes:
- ReadWriteOnce
nfs:
path: /pv-share/grafana
server: <nfs-server IP address>
persistentVolumeReclaimPolicy: Retain
You may decide to change the overall size of this PV, which I have set to 10GB.
Now create the PV and check it has been created:
kubectl create -f grafana-pv.yml
kubectl get pv
You should see your PV is now available to the cluster:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
grafana-pv 10Gi RWO Retain Available grafana-class 22s
prometheus-pv 10Gi RWO Retain Available prometheus-class 30s
We now need to create a PVC. We do this by creating this file:
grafana-pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitoring
spec:
storageClassName: grafana-class
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
Now create it and check that it is bound to the PV we created above:
kubectl create -f grafana-pvc.yml
kubectl get pvc -n monitoring
This should immediately show the PVCs bound to their PV equivalent:
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
monitoring grafana-pvc Bound grafana-pv 10Gi RWO grafana-class 11s
monitoring prometheus-pvc Bound prometheus-pv 10Gi RWO prometheus-class 21s
This can now ready to be connected to your Grafana pod as a mounted volume.
Deploying Grafana
I am assuming that you have installed Helm from the previous article.
With Helm installed we can use it to deploy Grafana in to our cluster with all the configuration in place to monitor our cluster.
First add the Grafana repository to Helm:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
Before we install Grafana, there are a few values we need to override. Create the following values override file, remembering to replace the < > fields with your own values:
grafana-values.yml
persistence:
enabled: true
type: pvc
existingClaim: grafana-pvc
initChownData:
enabled: false
service:
type: NodePort
nodePort: 31000
externalIPs:
- <k8s-master IP address>
serviceMontioring:
enabled: true
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-monitoring-server.monitoring.svc.cluster.local:9090
This does four things:
persistence
— connects Grafana to our previously create PVCinitChownData
— disables Grafana from setting the persistent data ownership (this is not directly supported by NFS) and allows the pod to startservice
— changes theClusterIP
service that the chart creates by default into aNodePort
service so we can access it externallydatasources
— connects Grafana to Prometheus as a data source
We now install Grafana:
helm install grafana-monitoring grafana/grafana -f grafana-values.yml -n monitoring
Verify that the pod has started:
kubectl get pods -n monitoring
You should now be able to access the UI on the k8s-master
IP address at port 31000. You should see the login screen. Log in with the username of admin
and a password given by:
kubectl get secret --namespace monitoring grafana-monitoring -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Note that if you delete and reinstall Grafana, I find that the default admin password can fail and the only way I have found of resetting the password is to delete the contents of your Grafana share and reinstalling. I would suggest you change the password once you log in successfully and ensure you always log out before reinstalling.
Connecting Grafana to our data sources
Through the values file, Prometheus is connected to Grafana as a data source. This allows metrics about the deployed servers, collected by Prometheus, to be displayed on a dashboard. These metrics include all metrics from:
- The cluster itself
- The Gateway Node Exporter
- The NFS Server Node Exporter
We can now start to visualise these metrics. After logging in, go to the main menu and:
- Select
Dashboards
- Click
New
- Select
Dashboard
- Select
Prometheus
as the data source - Enter
node_memory_Active_anon_bytes
for the metric - Click
Run query
You should now see a graph with memory usage on all the nodes and the two additional servers. You can narrow this down by selecting Job
under Select label
. Then, under Select value
you can then select a particular exporter node or the Kubernetes cluster itself.
So now you can monitor the metrics of your system through Grafana dashboards.
You can import Grafana dashboards through numeric IDs or by importing a JSON dashboard definition. These can show you the capability of Grafana and can act as a starting point for your own dashboards.
Alerting via Slack
I will now show you how to use your Grafana deployment to alert you via Slack. Grafana is a feature rich interface and I only intend to show you enough to get you up and running.
Testing the monitoring
When testing monitoring, you really need something that you can control. For a simple test, I will show you a case of alerting when a pod fails. We will do this by installing an interactive BusyBox service called debug in the default namespace using:
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
When you log out from BusyBox, it deletes the pods it creates. This is useful as you can control when your service is up or down.
This is a great way to test your motoring.
Create Slack app
To Alert via Slack, you need to have a Slack account. You can sign up for free at https://slack.com.
Once you have an account, you can go to https://api.slack.com/apps?new_app=1. From here click Create New App
and then:
- Select
From scratch
- Give it a name, eg Grafana
- Pick a workspace
- Click
Create App
- Scroll down to
Incoming Webhook
- Switch on
Activate Incoming Webhooks
Add New Webhook to Workspace
- Select a channel to receive your alerts
- Click
Allow
- Copy the URL
Add New Webhook to Workspace
The URL you copy contains your access token that you will need for Grafana. Don’t lose it or you will have to recreate it.
Setting up alerts in Grafana
There are a number of steps involved in setting up an alert in Grafana.
- Define the contact point for the alert (in our case this is Slack)
- Define the rules for when an alert should be sent to a given contact point (Notification Policy)
- Set up the rule to determine if the alert is normal or firing (a change between normal and firing and vice versa will send a Slack message)
Each of these actions is carried out via the main Grafana menu under Alerting
. Note that the order above is slightly different to the menu order and that is because steps #1 and #2 are generally carried out once for a number of rules so the menu places the rules at the top.
Define a Contact Point
The contact point is the channel through which an alert is to be sent. In our case we want to set up a Slack contact point. Go to Alerting
under the Home
menu and select Contact Points
.
- Click
+ Add contact point
. - Give your contact point a
Name
(eg: Slack) - Under
Integration
, select Slack - Under
Recipient
choose a Slack channel to receive the alert - Ignore
Token
as we are using a webhook - Under
Webhook
enter the URL you copied earlier (note that once entered you will not be able to access this) - Click
Test
and ensure a test message arrives in your Slack channel Save contact point
Now you have created a contact point you can now send alerts to it!
Set up a Notification Policy
A notification policy determines which alerts should be sent to which contact points.
These policies are nested with each lower level overriding the level above. When you enter Notification policies
under the Home
-> Alerting
menu for the first time you will see a Default policy
which will handle all alerts that are not matched further down.
Under this click + new nested policy
Now you enter the conditions which will trigger this policy. For now you can enter this as an example:
- Label:
severity
- Operator:
=
- Value:
alert
Choose Slack as the contact point and Save policy
.
Now you have a way to tell Grafana to send an alert to Slack, just define a label called severity
and set it to alert
.
Set up an alerting rule
Now we get to the difficult part — defining the rules for your alert. Now you can do this through the Alert rules
menu but this can be confusing and so I recommend you start from a Dashboard.
By starting with a Dashboard, you can visualise your metric, optimise it and then set the alerting conditions around what you are seeing.
Set up a Dashboard
Under
All your alerting rules need to be placed under an alert group. The group
- Go to
Home
->Dashboards
menu option - Click
New
->New Dashboard
- Click
+ Add Visualisation
- Select
Prometheus
as the data source
What you will see will seem pretty crazy if you have never used Grafana before:
The page is split into three main components:
- A graph that shows you the result of your query (Top left)
- A description and settings for your graph (right side)
- Your query (bottom left)
Before we continue, I thought I would describe the scenario I am going to set up. Basically, we will graph the number of pods in the ready state over time. We will then start BusyBox and see the graph change. We will then generate alert when the BusyBox pods terminate.
So let’s start.
- Give your visualisation a
Title
(#1), such asBB Availability
- Select the
kube_pod_status_ready
metric (#2) - Click
Run queries
(#3)
On my cluster, I see a graph showing 60 pods in the ready state.
When I start a BusyBox instance and refresh the dashboard, I see this jump to 63.
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
Remember, monitoring all takes time due to polling. Prometheus only scrapes its targets every minute. You may set up your alerts to only check every minute and you may decide not to alert until a condition has occurred for a minute. This means that you may not see an alert for up to 3 minutes — which can seem like a lifetime!
When you log out of the BusyBox instance, you should see the number of pods fall by 3 (remember to refresh the dashboard).
This is important — save your query/visualisation panel (#4). When you set up an alert you can lose your work. You will be asked to save the dashboard also.
Add alert from Dashboard
You can now click on the Alert
tab (#5). Click on Create alert rule from this panel
.
- Give your alert a
Name
(eg:BB Availability
) - The query should be set up from your Dashboard query, which gives you a time series (ie: a set of values over time)
- Under Expressions a
reducer
should have been set up for you. It reduces the time series to a single value by selecting the last value (note that it refers to inputA
, which is your dashboard query) - A
threshold
will also have been set up for you with an input ofB
(the reducer output - Set the threshold to
IS BELOW 63
(in my case 63 is the number of pods with the BusyBox instance running — adjust the number for your system) - Under
evaluation behaviour
you will need to create a folder to store your alert (eg: add a new folder calledpod alerts
) - Add a new evaluation group, which sets the evaluation update period, call it
pods
and set it to20s
- Under
add annotations
, give it a summary (this will appear in your Slack message) - Under
configure notifications
, add the keyseverity
and the valuealert
, which you should remember is the condition to send your alert via Slack - Click
Preview routing
to check that it will be delivered via Slack
This is a very basic alert set up. You can see there are many options, such as templating, evaluation criteria, silencing rules etc. The aim here was to show you a starting point.
Testing your alert
Now we have set up our alert, we can now start and stop our BusyBox pod and see our alert trigger and resolve. In my case this is what I see in Slack:
Not very user friendly but now you have a way of testing and refining your alerts.
No Data
You may wonder why I do not add a filter to only include pods from the default
namespace. We could do this and get a number on our graph of 0 or 3. We could even go further and check for the debug
container.
The thing is, once the BusyBox pod terminates, the filtered metric disappears. It becomes No Data
and that introduces a new dimension to the alerting configuration. Any evaluation that is based on one ore more values that are No Data
produces a No Data
output. To avoid that complexity, I chose just to count all pods and avoid the No Data
challenge.
Summary
In this article we looked at how to add visualisation to our Prometheus monitoring application using Grafana.
We updated our NFS share configuration to provide persistent storage for Grafana using a Persistent Volume and Persistent Volume Claim.
After providing default overrides, we then installed Grafana, which automatically connected Prometheus as a data source. This allowed us to start creating dashboards so we can gain visibility of our cluster.
We then set up an alerting contact point via Slack and set up a basic alert to trigger Slack messages when a BusyBox instance was started and terminated.
There is much more to the Grafana story but that can wait for another article.
If you found this article of interest, please give me a clap as that helps me identify what people find useful and what future articles I should write. If you have any suggestions, please add them in the comments section.