GPU metrics per pod via Grafana

Amir A
6 min readNov 20, 2021

--

TL;DR

Just paste these queries to your Grafana dashboard/panel(this will work with multi GPUs on the same node):

#PromQL query for GPU per pod panel: # GPU Utilization:
DCGM_FI_DEV_GPU_UTIL{instance=~"${instance}",exported_pod=~"${pod}",gpu=~"${gpu}"}
# GPU Memory Utilization:
DCGM_FI_DEV_FB_USED{instance=~"${instance}",exported_pod=~"${pod}",gpu=~"${gpu}"} /1024
GPU utilization per pod with pod name.
GPU memory utilization per pod with pod name.

The Longer version :

Do you need to show GPU metrics in Grafana for your workloads ? do your users want to view the GPU utilization per pod? my bet is 100% for sure.

The needed stack:

  • kubernetes cluster
  • Prometheus
  • Grafana
  • Nvidia GPU(s)
  • a nice Devops engineer that believes knowledge sharing is part of being devops

I'm running very often into requests for getting more and more visibility or dashboards that are not there out of the box, don't get me wrong, the kubernetes Grafana package is amazing with the pod/namespace/cluster metrics! but really there will always be a need for custom dashboards and panels.

In this article, we’ll talk about the Nvidia per pod metrics, those panels are not included in the (https://grafana.com/grafana/dashboards/12239), although this is a pretty impressive dashboard, it was designed for the cluster/infrastructure admin and not for the end user who just wants to see their per pod per GPU utilization.

*** if you are new to this and want to get Nvidia running in your cluster make sure you follow this article https://github.com/NVIDIA/k8s-device-plugin.

Assuming you have Nvidia daemon set, Prometheus, Grafana up and running, lets install the Nvidia DCGM exporter, this daemon set will export the metrics from each node in your cluster. if only few nodes in your cluster have GPUs, make sure you use a node selector label so the daemon set will run on the targeted nodes.

DCGM-exporter by Nvidia: https://github.com/NVIDIA/dcgm-exporter

$ helm install \ 
- generate-name \
gpu-helm-charts/dcgm-exporter

Make sure all pods are in ready state and running:

The logs on the pod should look something like this:

As we can see, these pods are running a web server, that web server is exposing the needed metrics that Prometheus will pull ( or scrape ).

Next stop , Grafana! lets add the Nvidia dashboard and data source.

login to Grafana, click on Configuration>data sources , then click on add data source(see picture below). The URL should be the kubernetes service in your cluster.

Kubernetes service ( does not have to be node port, can be cluster IP ) :

Save and test! testing is important. Now lets create a dashboard and a panel, go to dashboards and click on add dashboard. Click on json model and copy paste this section:

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "This dashboard is to display the metrics from DCGM Exporter on a Kubernetes (1.13+) cluster",
"editable": true,
"gnetId": 12239,
"graphTooltip": 0,
"id": 17,
"iteration": 1637409044872,
"links": [],
"panels": [
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 28,
"panels": [],
"title": "",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus-dcgm",
"description": "\n\nDevops",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 13,
"w": 10,
"x": 0,
"y": 1
},
"hiddenSeries": false,
"id": 20,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.5",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_UTIL{instance=~\"${instance}\",exported_pod=~\"${pod}\",gpu=~\"${gpu}\"}",
"interval": "",
"legendFormat": "{{exported_pod}} {{exported_namespace}} {{gpu}} {{instance}}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "GPU Per Pod-- GPU Utilization ",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}, {
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus-dcgm",
"fieldConfig": {
"defaults": {
"custom": {},
"unit": "none"
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 13,
"w": 10,
"x": 10,
"y": 1
},
"hiddenSeries": false,
"id": 22,
"legend": {
"avg": false,
"current": true,
"hideEmpty": false,
"hideZero": false,
"max": true,
"min": false,
"rightSide": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "7.3.5",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "DCGM_FI_DEV_FB_USED{instance=~\"${instance}\",exported_pod=~\"${pod}\",gpu=~\"${gpu}\"} /1024",
"interval": "",
"legendFormat": "pod {{exported_pod}}",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "GPU Per Training -- GPU Memory Utilization ",
"tooltip": {
"shared": false,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "none",
"label": "GB",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "decmbytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": false,
"schemaVersion": 26,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": null,
"current": {
"selected": true,
"text": [
"All"
],
"value": [
"$__all"
]
},
"datasource": "Prometheus-dcgm",
"definition": "label_values(DCGM_FI_DEV_POWER_USAGE,instance)",
"error": null,
"hide": 0,
"includeAll": true,
"label": null,
"multi": true,
"name": "instance",
"options": [],
"query": "label_values(DCGM_FI_DEV_POWER_USAGE,instance)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"selected": true,
"text": [
"All"
],
"value": [
"$__all"
]
},
"datasource": "Prometheus-dcgm",
"definition": "label_values(gpu)",
"error": null,
"hide": 0,
"includeAll": true,
"label": null,
"multi": true,
"name": "gpu",
"options": [],
"query": "label_values(gpu)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"selected": true,
"text": [
"All"
],
"value": [
"$__all"
]
},
"datasource": "Prometheus-dcgm",
"definition": "label_values( DCGM_FI_DEV_GPU_UTIL,exported_pod)",
"error": null,
"hide": 0,
"includeAll": true,
"label": null,
"multi": true,
"name": "pod",
"options": [],
"query": "label_values( DCGM_FI_DEV_GPU_UTIL,exported_pod)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
]
},
"timezone": "",
"title": "GPU Graphs - example",
"uid": "5g8jvktnz",
"version": 5
}

At this point we should have a beautiful dashboard with two panels, one is GPU Utilization and one is GPU Memory Utilization, both are designed to show one pod (a small amount of pods is also ok). Lets start by choosing one pod from the top drop down menu:

GPU utilization per pod with pod name.
GPU memory utilization per pod with pod name.

Summary

using this article you will install DCGM Nvidia exporter, build a dashboard with two panels that will show per pod GPU utilization.

--

--