Self-managed GPU Monitoring Stack on Google Cloud with DCGM, Prometheus, and Grafana

Published in

Google Cloud - Community

6 min readSep 11, 2024

Overview

On Google Cloud, there are increasingly more Kubernetes workloads that require accelerators like GPUs. Monitoring GPU usage on GKE can be challenging and often necessitates the use of NVIDIA Data Center GPU Manager (DCGM). Google provides two approaches for using DCGM: a self-managed approach via the DCGM exporter and a managed addon to collect and view metrics.

However, users sometimes desire more flexibility and want to use DCGM with other open-source tools such as Prometheus and Grafana.

This tutorial outlines the setup of a self-managed, open-source GPU monitoring stack. The stack leverages Prometheus exporters, Google Cloud Managed Service for Prometheus (GMP), and Grafana. This solution empowers users to effectively monitor and address performance issues in GPU-accelerated workloads. By optimizing resource utilization and ensuring the health and availability of GPU systems, this stack enables users to maintain optimal performance and reliability.

Specifically, the monitoring stack comprises the following components:

DCGM Prometheus exporter: This collects a wide range of metrics from GPU devices and systems, including utilization, temperature, power consumption, and memory usage.
Google Cloud Managed Service for Prometheus (GMP): This platform provides a hosted Prometheus instance, simplifying the deployment and management of the monitoring stack without requiring extensive infrastructure setup.
Visualization and alerting via Grafana: These offer visual representations of collected metrics, empowering users to monitor GPU performance, identify trends, and detect anomalies.

Create a GKE cluster with a GPU node pool

In this tutorial, a GKE cluster with a GPU node pool needs to be created. We also need to run some sample workloads. The following steps walk you through the process. For more details about the example, you can read this doc.

You need to run all the following commands in Cloud Shell.

First, let’s clone the sample repository:

git clone https://github.com/GoogleCloudPlatform/ai-on-gke/ ai-on-gke
cd ai-on-gke/tutorials-and-examples/gpu-examples/training-single-gpu

Configure the variables:

export PROJECT_ID=[YOUR-PROJECT-ID]

export REGION=us-central1
export CLUSTER_NAME=gke-gpu-cluster
export NAMESPACE=gke-ai-namespace
export K8S_SA_NAME=gpu-k8s-sa
export BUCKET_NAME=${PROJECT_ID}-gke-gpu-bucket

Configure the environment and enable the needed services:

gcloud config set project “$PROJECT_ID”
gcloud config set compute/region “$REGION”

gcloud services enable compute.googleapis.com container.googleapis.com

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID — format=’value(projectNumber)’)

Create the GKE cluster:

gcloud container clusters create ${CLUSTER_NAME} \
 — addons GcsFuseCsiDriver \
 — location=${REGION} \
 — num-nodes=1 \
 — workload-pool=${PROJECT_ID}.svc.id.goog

Create the GKE node pool:

gcloud container node-pools create gke-gpu-pool-1 \
 — accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
 — machine-type=n1-standard-16 — num-nodes=1 \
 — location=us-central1 \
 — cluster=gke-gpu-cluster

Create a storage bucket:

gcloud storage buckets create gs://${PROJECT_ID}-gke-gpu-bucket \
 — uniform-bucket-level-access

Create the namespace and service account in GKE:

kubectl create namespace ${NAMESPACE}
kubectl create serviceaccount ${KSA} — namespace=${NAMESPACE}

Use workload identity to allow access:

gcloud storage buckets add-iam-policy-binding gs://${PROJECT_ID}-gke-gpu-bucket \
 — member “principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${K8S_SA_NAME}” \
 — role “roles/storage.objectUser”

Copy the data:

gcloud storage cp src/tensorflow-mnist-example gs://${BUCKET_NAME}/ — recursive

Start the training workload:

envsubst < src/gke-config/standard-tf-mnist-train.yaml | kubectl -n ${NAMESPACE} apply -f -

You can either run the following command or view the logs from the console:

kubectl logs -f jobs/mnist-training-job -c tensorflow -n ${NAMESPACE}

DCGM Prometheus exporter

The NVIDIA DCGM exporter is a specific type of Prometheus exporter designed to collect and expose metrics from NVIDIA GPUs (Graphics Processing Units) using the NVIDIA Data Center GPU Manager (DCGM).

Install DCGM exporter

There are multiple ways to install the DCGM exporter. In this tutorial, you can just run the following command:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/main/examples/nvidia-dcgm/exporter.yaml

You can read this doc for more details regarding the installation of the DCGM exporter.

Define a PodMonitoring resource

In order to scrape metrics from Prometheus, you also need to Create a PodMonitor custom resource in GMP, specifying the namespace and labels to select the DCGM Exporter pod.

Run the following command to achieve that.

kubectl apply -n gmp-public -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/main/examples/nvidia-dcgm/pod-monitoring.yaml

You can find more details about the example here. Once the command runs successfully, you should see the following in the workloads page.

Google Cloud Managed Service for Prometheus (GMP)

To Verify the configuration and metric collection is working fine. You can run the following PromQL query in the Metrics explorer. You can read this doc for more details.

Alternatively, you can run the following command that uses curl to run a query.

curl -s https://monitoring.googleapis.com/v1/projects/shenxiang-gcp-solution/location/global/prometheus/api/v1/query \
 -d ‘query=DCGM_FI_DEV_GPU_UTIL{cluster=”gke-gpu-cluster”}’ \
 -H “Authorization: Bearer $(gcloud auth print-access-token)”|jq

You should see results like the following.

Grafana

If you don’t have a Grafana instance, you need to install it or use the SaaS version. Grafana offers several installation methods, catering to different environments and preferences:

Grafana Cloud: The easiest way to get started is through Grafana Cloud, their managed service. This eliminates the need for manual installation and maintenance. Simply create a free account and start using Grafana immediately.
Package Managers: Grafana provides packages for various operating systems, like Debian, Ubuntu, RPM-based distros, and Homebrew for macOS. These packages simplify the installation process, handling dependencies and configuration.
Docker: If you prefer containerization, Grafana offers official Docker images. This allows for easy deployment and management within Docker environments.
Binary Installation: You can also download the Grafana binary directly and run it. This method provides flexibility but requires manual configuration and setup.
Kubernetes: For Kubernetes users, Grafana provides Helm charts to deploy and manage Grafana within Kubernetes clusters.

Choose the method that best suits your needs and environment. Refer to the official Grafana documentation for detailed instructions on each installation method:

Grafana Installation Overview: https://grafana.com/docs/grafana/latest/setup-grafana/installation
Grafana Download: https://grafana.com/grafana/download

Configure Grafana

Grafana can use GMP as a Prometheus data source directly, which means you can continue using any community-created or personal Grafana dashboards without any changes.

Note: The community has created a patch that allows us to use the Cloud Monitoring data source to run PromQL. However, this is not recommended due to its limited functionality.

One challenge in configuring the Prometheus data source is setting up authentication. Using data source syncer is recommended, and you can follow the steps in this documentation to configure the Prometheus data source in Grafana.

Explore the DCGM metrics in Grafana

Click the menu icon and, in the sidebar, click Explore. A dropdown menu for the list of available data sources is on the upper-left side. The Prometheus data source will already be selected. If not, choose Prometheus.
Confirm that you’re in code mode by checking the Builder/Code toggle at the top right corner of the query panel.
In the query editor, where it says Enter a PromQL query…, enter sum(avg_over_time(DCGM_FI_DEV_GPU_TEMP[5m])) and then press Shift + Enter. A graph should appear.
In the top right corner, click the dropdown arrow on the Run Query button, and then select 5s. Grafana runs your query and updates the graph every 5 seconds.
You just made your first PromQL query! PromQL is a powerful query language that lets you select and aggregate time series data stored in Prometheus.

Grafana Dashboard

You can choose to build your own dashboards or import an existing one. For example, to import the Grafana NVIDIA DCGM Exporter Dashboard, you can click Import under Grafana’s dashboard menu, either use the dashboard id or copy/paste the JSON definition.

Due to the slight difference of GMP, you may need to change some dashboard variables. For the DCGM exporter dashboard, you need to change the GPU variable query from label_values(gpu) to label_values(DCGM_FI_DEV_GPU_TEMP, gpu)