Monitoring Dask + RAPIDS with Prometheus + Grafana
Prometheus is a popular monitoring tool within the cloud community. It has out-of-the-box integration with popular platforms including Kubernetes, Open Stack, and the major cloud vendors, and integrates with dashboarding tools like Grafana.
In this post, we will explore installing and configuring a minimal Prometheus service with Grafana as our front end and using it to monitor RAPIDS.
Prometheus overview
At its core, Prometheus is a time-series database for storing system and application metrics. It gathers metrics by polling metric exporters periodically and then allows you to query those metrics with PromQL.
It also has additional services such as pushgateway, for short-lived jobs, and alertmanager for notifying operators of issues based on metric rules.
Exporting metrics
Exporting metrics from a system or application is either done by a standalone exporter or by the application itself.
Commonly, metrics are made available in a text format that is accessible at a /metrics
endpoint. Applications that are already serving HTTP traffic can make this available directly with the help of client libraries, while other services may need a companion exporter which runs a web server and gathers data in a native way before exporting.
Examples of stand-alone exporters are the node_exporter which queries hardware and OS level metrics from *nix systems. The mysqld_exporter is another example that runs alongside a MySQL database using the database connection to gather metrics about the database server itself.
To instrument RAPIDS we care about exporting three sets of metrics:
- System metrics via the node_exporter.
- GPU metrics via the DCGM-exporter.
- Dask metrics that are natively exposed via the Dask dashboard’s web server.
Service discovery
When running Prometheus at scale it is common to use service discovery to allow Prometheus to automatically discover metrics endpoints.
When running Prometheus on Kubernetes, for example, the Kubernetes service discovery will use the Kubernetes API to discover all running HTTP services and will attempt to call the /metrics
endpoint on each service looking for metrics.
It is also possible to write custom service discovery by either using DNS records, a key/value store like Consul, or simply a text file that is periodically updated with endpoints.
For simplicity, we will manually configure Prometheus to monitor a single host running RAPIDS.
Installing our components
For this example, we will run RAPIDS on a Ubuntu 20.04 workstation with two NVIDIA GPUs, the latest NVIDIA drivers, and NVIDIA Docker installed.
RAPIDS
To make deployment simple, here we will be using Docker and Docker Compose. Let’s start by creating a compose file for RAPIDS.
version: "3.9"services: rapids:
image: rapidsai/rapidsai:0.18-cuda11.0-runtime-ubuntu16.04-py3.8
ports:
- "8888:8888" # Jupyter
- "8786:8786" # Dask communication
- "8787:8787" # Dask dashboard
environment:
JUPYTER_FG: "true"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
Here we are defining one container running the RAPIDS image, exposing all the necessary ports, setting Jupyter to run as our foreground process, and allowing access to all our GPUs.
Then we can get RAPIDS up and running.
docker-compose up -d
Now we should be able to access port 8888
in our browser to view Jupyter Lab.
Next, let’s start our Dask cluster. You can do this in a notebook or via the Dask Jupyter Lab Extension. Let’s click the Dask logo on the side and click NEW
.
Now if we visit port 8787
in our browser, we will see the Dask dashboard.
And if we visit the /metrics
endpoint, we will see text format metrics that we can scrape with Prometheus.
Start Prometheus
Next, let’s start Prometheus and have it scrape these Dask metrics.
In our current directory, we will create a new directory with mkdir prometheus
and within that a config file called prometheus.yml
with the following contents.
global:
scrape_interval: 15sscrape_configs:
- job_name: rapids
static_configs:
- targets: ['10.51.100.43:8787']
The IP here is the IP of the workstation on our LAN, so update it to be whatever yours is.
Then let’s add another service to our Docker Compose file.
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus/:/etc/prometheus/
ports:
- "9090:9090"
Then run docker-compose up -d
again.
Now we can head to port 9090
on our system and try out some PromQL queries in Prometheus. For example, we can get the number of Dask workers in our cluster with dask_scheduler_workers{job=”rapids”}
. Our system has two GPUs so we can see two workers reported here.
Collecting more metrics
In addition to our Dask cluster metrics, we also want to collect system and GPU metrics. So let’s add those exporters as services in our docker-compose.yml
file.
node_exporter:
image: quay.io/prometheus/node-exporter:latest
command:
- '--path.rootfs=/host'
network_mode: host
pid: host
volumes:
- '/:/host:ro,rslave' gpu_exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
ports:
- "9400:9400"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
Then run docker-compose up -d
again.
Now we need to update our Prometheus configuration to include these two exporters.
global:
scrape_interval: 15sscrape_configs:
- job_name: rapids
static_configs:
- targets: ['10.51.100.43:8787']
- targets: ['10.51.100.43:9100']
- targets: ['10.51.100.43:9400']
And then we need to restart Prometheus with docker-compose restart prometheus
.
Now if we head back to the Prometheus dashboard we can perform a query like DCGM_FI_DEV_GPU_TEMP
to get our GPU temperatures.
Grafana dashboards
Now what we have all our metrics being collected by Prometheus let’s install Grafana so we can make plots and dashboards.
We need to create a directory to store Grafana config with mkdir grafana
and also give it ownership by the Grafana user sudo chown -R 472 grafana
.
Then let’s add one last section to our docker-compose.yml
.
grafana:
image: grafana/grafana:latest
volumes:
- ./grafana:/var/lib/grafana
ports:
- "3000:3000"
And start the service with docker-compose up -d
.
Now we can visit port 3000
, log in with the credentials admin:admin,
and run through the Grafana first time setup.
We need to tell Grafana about Prometheus to click “Add your first data source” and choose a Prometheus source.
Then input the URL of the Prometheus server and click “Save & Test”.
Then we can head back to the home page by clicking the Grafana logo and click “Create your first Dashboard”.
This gives us a new dashboard with one empty panel, click “Add an empty panel”.
Now we can enter a query and configure our plot. For this first example let’s query the number of connected Dask workers and display it as a Stat plot. Once you’re happy with it click Apply.
We can then keep clicking the “Add Panel” button to create plots for all the metrics we want to have on our dashboard. Take some time to experiment here and see how we can visualize all the data collected by Prometheus.
Some good resources when designing dashboards are:
- Prometheus Official Best Practices
- Grafana Official Best Practices
- Tips for Designing Grafana Dashboard by Percona
- Creating the perfect Grafana dashboard by Logz.io
- Grafana dashboards from basic to advanced by Metric Fire
Conclusion
In this post, we used Prometheus and Grafana to instrument our RAPIDS deployment and display useful metrics on a dashboard. This allows us to gain more insight into our workflows and how they are performing on our system.
You can find the full example config files in this GitHub Gist and an interactive example dashboard on RainTank.
Monitoring with Prometheus scales from a single node to multi-node clusters. While we only covered setting up monitoring on a single node in this post, we intend to cover multi-node Kubernetes deployments in the future.