Monitoring Nvidia GPUs using API

Oleksandr

Published in

DevOops World … and the Universe

14 min readDec 5, 2019

This is the 1st out of 2 articles related to accessing Nvidia GPUs using API.

Photo credit: *Channels (by Susan Hiller)*

Current part covers GPU monitoring and:

includes an overview of available official APIs;
describes certain functions and data structures provided by them;
brings an example of monitoring of GPU temperature, power consumption and fans speed in C++;
shows monitoring results and provides an interpretation of them with comments.

Test device, drivers and libraries

All tests and examples used in this article were run on Nvidia’s GeForce GTX 950 (GV-N950XTREME C-2GD, shown below) using drivers of version 440, NVML of version 10.440 and NVAPI of version R435.

Figure 1 — test device, Nvidia’s GeForce GTX 950 on Gigabyte’s board

This device has semi-active air colling, which means active cooling is turned on only after GPU temperature reaches a certain threshold — a nice feature which eliminates extra noise in case GPU is loaded only occasionally. For this particular device the threshold is around 60°C. Hardware, like design of radiators, of course, plays a big role in implementation of this feature, but it’s not a solely hardware feature, however: automatic cooling management is provided by software, which is graphics driver.

Available APIs

First things first, it’s important to know there are 2 official APIs:

Let’s take a brief look at their key abilities and differences.

NVML

This API is provided specifically for monitoring and management purposes. It is used by nvidia-smi command-line utility, which can give reference metrics when validation of API usage correctness is needed. And this is one of its major advantages. SMI output example is shown below:

Figure 2 —Example output of `nvidia-smi`

The output is split into 2 parts: a header with a GPU list and a process list. The header contains information about versions of drivers and APIs used. GPU list enumerates detected GPU devices and reports most several metrics per GPU:

intended fan speed (in % of max RPM) for all fans present on a device;
actual chip temperature (in °C);
actual power used and power cap (in W) for GPU, device’s memory and surrounding circuitry;
actual memory utilization (in %);
actual GPU utilization (in %).

These are the core metrics and they are enough for the example part of this article.

In case of a Linux-based system as a target platform, NVML is the only option. Availability on Linux-based systems is another major advantage of this API.

As for distribution, NVML is available as a .dll for Windows and already comes installed with graphics drivers. Usually it’s available at:

C:\Program Files\NVIDIA Corporation\NVSMI\nvml.dll

The command-line utility nvidia-smi is located in the same directory.

As for Linux-based systems, nvidia-smi also comes installed with graphics drivers. NVML is available as .so, but has to be installed separately however. For example, for CentOS 7, this library is provided by nvidia-x11-drv-libs package which is available in ELRepo repository. After the installation, the library for 64-bit systems is usually available at:

/usr/lib64/libnvidia-ml.so

The actual name may vary, though, as driver’s version is appended after .so extension. E.g., with drivers of version 440.31 the full path will be:

/usr/lib64/libnvidia-ml.so.440.31

Hence, another advantage of NVML is its usage of the most recent drivers.

Another plus is all functions described in documentation are exported objects in a shared library. Hence, they can be listed directly by examining a library and they can be loaded by names which is pretty convenient.

Additionally, Nvidia provides official Perl and Python bindings for NVML available at CPAN and PyPI as nvidia-ml-pl and nvidia-ml-py respectively. There are no docs for the Python package, though, so one will have to download its sources and examine them at least to get correct package name to use in import statements.

However, in spite of having many advantages, NVML has at least a couple of drawbacks.

One such a drawback is absence of header files with definitions of data types and functions. To get them one will have to define them manually, which can become pretty annoying. Fortunately, all info needed is described in reference docs, so there’s no need to guess what to write.

Another drawback of NVML is that it does not allow manual fan control: one can look at fan speeds but cannot touch them. That’s quite awful, as NVML is the only API available on Linux-based systems.

A digress on Nvidia drivers and Linux-based systems
Just in case one goes to spawn a Linux-based server and utilize an Nvidia GPU for computation, there’s one little thing to keep in mind:

To make certain functionality work, e.g. automatic fan control, X server must be installed and running.

That’s a pretty big slip. Alas.

NVAPI

If one does ask to describe NVAPI in a couple of words, the answer would be the following:

it’s available both as a static and as a dynamic library;
its dynamic library provides hidden undocumented functionality which is not available in the static version;
certain measurement units are different from those used in NVML;
it’s more fine-grained and more featureful than NVML for certain use-cases;
it’s exclusively for Windows.

Firstly, a little note on distribution of libraries: Nvidia’s download center allows people registered as developers to download an SDK. It provides all necessary C headers and static libraries precompiled for both 32- and 64-bit architectures. All data types and functions defined in headers are documented and are pretty self-explanatory. All of that provides a pleasant development experience.

What one is unlikely to get to know from official documents is about th existence of NVAPI’s dynamic library — nvapi.dll. This DLL is installed with graphics driver and typically is available as:

C:\Windows\System32\nvapi.dll

Obviously, no docs and no headers are available and it may be nothing wrong with calling usage of nvapi.dll a slight version of hacking. Or masochism.

The only function exported by nvapi.dll is nvapi_QueryInterface:

Microsoft (R) COFF/PE Dumper Version 14.23.28106.4
Copyright (C) Microsoft Corporation.  All rights reserved.Dump of file nvapi.dllFile Type: DLLSection contains the following exports for nvapi.dll00000000 characteristics
    5DC35FBD time date stamp Thu Nov  7 02:05:17 2019
        0.00 version
           1 ordinal base
           1 number of functions
           1 number of namesordinal hint RVA      name1    0 0009FDE0 nvapi_QueryInterface…

The function nvapi_QueryInterface is used to load other internal functions using a memory address or offset. For example, a function NvAPI_Initialize has to be loaded as the following to be used:

typedef int(*NvAPI_Initialize_t)();NvAPI_Initialize_t NvAPI_Initialize = (NvAPI_Initialize_t)(*NvAPI_QueryInterface)(0x0150E828);

A “magic number” 0x0150E828 is used as a memory address. One will have to get a list of such memory addresses from… somewhere… to be able to load other functions. It’s pretty hard to tell who can guarantee a stability of code relying on such magic numbers while using dynamic libraries. But people are doing this for real.

Despite obvious risks of usingnvapi.dll and magic numbers, such an approach can be pretty temptative or really the only one, as it provides access to certain functions that are not described by official documentation and which are not available in NVML.

For example, official GPU Cooler Interface describes only a single function NvAPI_GPU_GetTachReading which allows one to get GPU’s fan speed. Docs are silent regarding functions that allow controlling fan speed, but there’s a hidden function at address 0x891FA0AE for that.

Another detail to point out is measurement units: NVML and NVAPI can have different units of measure for similar variables. For example, nvmlDeviceGetFanSpeed returns relative fan speed in % of max RPM, e.g. 42%, while NvAPI_GPU_GetTachReading gives absolute speed in RPM, e.g. 1291.

Next, NVML and NVAPI can provide different levels of granularity. For example, nvmlDeviceGetTemperature is the only NVML function for reading GPU temperature and it returns a single scalar per device. In contrast, NVAPI provides a function NvAPI_GPU_GetThermalSettings which reading temperature of a specific sensor from a specific device and there can be up to 3 thermal sensors per single device currently.

Finally, again, NVAPI is for Windows only. *Sigh*.

API usage example

The following section provides an example of GPU monitor written in C++17, which polls the following core metrics:

temperature;
fan speed;
power consumption;
GPU utilization;
GPU’s memory utilization.

The GPU API used in the example is provided by NVML library. Such choice is driven by a need not to be limited to Windows only. So, the example works both on Windows and Linux-based systems.

Only a brief overview of the monitor will be provided here to avoid throwing too much code on readers. For those curious, there’s a full version in nvidia-gpu-monitoring repository on GitHub.

Despite being written in C++, its logic is not limited to C++, though: one can use the aforementioned Python or Perl bindings to achieve the same results.

Monitor’s structure

The layout of language modules provided by the example reflects the following components:

monitor — main component which controls execution flow;
nvml —wrapper for low-level NVML functions and data structures;
dlib — wrapper for OS-dependent dynamic library management;
utils — utilities reused by multiple modules.

Associations between these components are depicted on the component diagram below:

The most important module here is nvml, so it’s worth looking into its internals. The class diagram below covers the main data types residing inside the module.

There are 3 classes here:

NVML — manages NVML dynamic library and wraps low-level API;
NVMLDevice — represents a single GPU device, allows refreshing device’s metrics and query device’s info.
NVMLDeviceManager — keeps track of a collection of GPU devices, allows to get a number of detected devices and to iterate over them.

Both NVMLDevice and NVMLDeviceManager utilize an instance of NVML as API client. It provides public get_* methods for querying specific information like device name, temperature, etc.

Monitor’s behaviour

After an overview of the structure of the components, it’s time to look at their interaction and behaviour. The whole execution flow can be split into two parts:

initialization;
metrics polling.

The initialization part is responsible for loading and initialization of NVML library, loading of its functions and detection of GPU devices. A sequence diagram for it is given below:

Figure 5 — Sequence diagram for initialization part of execution flow

Despite looking a bit longish, it’s just 26 lines of code from monitor component’s perspective:

The output of the initialization part is a simple report on driver’s and API’s versions along with the info about detected devices in a YAML-like format:

driver version: 441.20
NVML version:   10.441.20devices_count:  1
devices:
- index:              0
  name:               GeForce GTX 950
  fan_speed:          0%
  temperature:        46C
  power_usage:        12404mW
  gpu_utilization:    0%
  memory_utilization: 19%

Next comes the main part where metrics of each device are being polled periodically. A pretty straightforward sequence diagram below shows its logic:

Figure 6 — Sequence diagram for metrics polling part of execution flow

The main interaction here is calling refresh_metrics_or_halt() method of NVMLDevice— same method used by NVMLDevice internally during its initialization. The listing below shows implementation of the main part:

Listing 2 — Implementation of metrics polling

The output produced is in a quite valid CSV format:

timestamp_ms,device_index,fan_speed,temperature,power_usage,gpu_utilization,memory_utilization
38944219,0,0,46,12404,0,19
38944471,0,0,46,12404,0,19
38944723,0,0,46,12404,2,20
38944975,0,0,46,12314,0,19
38945228,0,0,46,12314,0,19
38945481,0,0,46,12314,0,19
38945733,0,0,46,12510,0,19
38945988,0,0,46,27397,75,12
38946240,0,0,46,30026,26,12
38946492,0,0,46,30235,2,3
38946745,0,0,46,30136,3,3
38946999,0,0,46,30039,0,2
38947252,0,0,46,29942,1,2
38947505,0,0,46,30061,0,2
38947764,0,0,46,20521,0,11
38948020,0,0,46,29078,92,17
38948272,0,0,46,30512,7,6

That’s pretty much it. The only thing left to do is to visualize and to interpret metrics collected, which is the topic of the following sections.

Visualization of collected metrics

The example’s repository nvidia-gpu-monitoring contains a bunch of utilities for visualization of collected metrics. Among them:

data_extractor — strips header of monitor’s output and leaves only CSV-formatted data;
device_data_filter — filters data of a device specified by device_index, which defaults to 0;
data_visualizer — produces PNG figures out of device’s data.

To extract metrics data from the monitor’s output, one can run the following:

./data_extractor/run.py monitor.log > metrics_all.csv

Next, data for device 0 can be filtered as following:

./device_data_filter/run.py metrics_all.csv > metrics_0.csv

Both commands can be piped together to avoid creation of intermediary files:

./data_extractor/run.py monitor.log > ./device_data_filter/run.py > metrics_0.csv

Finally, metrics need to be visualized. This is achieved by data_visualizer, which produces a figure of all data set and a set of subfigures representing a sequence of time windows at a smaller scale. Hence, visualizer takes data set path as an input along with a file path template used for output figures:

data_visualizer/run.py monitor.{suffix}.png monitor.csv

For example, data collected during 35 minutes is visualized as a single figure and 12 subfigures subsequently covering every 3 minutes:

Drawing full window to 'monitor.full.png'
Drawing subwindow 1 of 12 to 'monitor.1.png'
Drawing subwindow 2 of 12 to 'monitor.2.png'
Drawing subwindow 3 of 12 to 'monitor.3.png'
Drawing subwindow 4 of 12 to 'monitor.4.png'
Drawing subwindow 5 of 12 to 'monitor.5.png'
Drawing subwindow 6 of 12 to 'monitor.6.png'
Drawing subwindow 7 of 12 to 'monitor.7.png'
Drawing subwindow 8 of 12 to 'monitor.8.png'
Drawing subwindow 9 of 12 to 'monitor.9.png'
Drawing subwindow 10 of 12 to 'monitor.10.png'
Drawing subwindow 11 of 12 to 'monitor.11.png'
Drawing subwindow 12 of 12 to 'monitor.12.png'

Resulting figures are 4-dimensional, as they include 1 horizontal axis for time domain and 3 vertical axes for the following domains:

fan speed;
device temperature;
device power consumption.

An example of full data set visualization is shown in figure 7 below:

Figure 7 — Example of visualization of captured fan speeds, device temperatures and power consumption

Is that a good or a bad device behaviour? Well, although looking at the figure above can bring certain useful information, it is pretty hard to comprehend what’s going on there at a top-level scale. Hence, the next section examines several specific cases providing more details.

Examination of collected metrics

Now it’s time to take a look at what is possible to see and tell from the collected data. This section gives an overview of a couple of observations worth attention.

Case #1 — Stress test

The first case is related to behaviour of the pre-cooled test GPU device during a stress test:

Figure 8— Highlighted time window of the stress test

Here, power consumption looks like a regular square waveform, just as expected during a change of workload from minimum to maximum and back:

Figure 9 — Power consumption change during the stress test

The stress test lasts for ≈1 minute and heats the device for ≈17°C in ≈18s with disabled active cooling:

Figure 10 — Temperature change during the stress test

As it was mentioned previously, the given test device has semi-active cooling. It’s clearly visible that fans are disabled until temperature reaches a threshold of ≈60°C. This threshold is reached pretty quickly — just in ≈18s — and that’s a time of very rapid heating.

Next, active cooling kicks in and tries to blow temperature down to the value of the threshold. During the remaining ≈47s of the test the temperature keeps growing, but significantly much slower, though.

Finally, it takes ≈46 more seconds to cool GPU down to the original stable level of ≈44°C. This is almost ×3 longer than active heating time. Interestingly, fans are turned off at a level which is much below the threshold and it’s not obvious whether there’s another threshold or the algorithm remembers previous cool state or something else.

Another point of interest is behaviour of fans:

Figure 11 — Fans speed change during the stress test

As it’s clearly visible, several weird things are out there.

The overshooting which is happening right after turning fans on is not nice but understandable: an additive error is likely to accumulate when fans are idle.

But there are also significant random spikes up to 80% of RPM which look absolutely anomalous. These anomalies last only for a fraction of a second and telling what causes them is really hard.

There’s one little thing to keep in mind also:

Fan speed reported by NVML is not an actual fan speed, but an intended one. NVAPI has to be used to get real tachometer values.

Case #2 — Slow heating

Another interesting case to look at is how temperature stays at a stable level with a minimal load applied to the device but starts slowly increasing during a slightly bigger load.

Figure 12 — Temperature change during slow heating

There’s nothing odd with that, one just has to remember during testing that it can take a significant amount of time for a device to reach high temperatures: whether because of a small load or because of a slight difference between energy consumed and energy radiated.

Case #3 — Vivid fan instability

Finally, there’s the most interesting case which demonstrates how odd stock implementation of automatic fan control can be. The figure below brings an example when a pretty constant workload is accompanied by really odd fan behaviour.

Figure 13 — Fan instability during constant workload

Certain anomalies are the same as in case #1: overshooting and random spikes. What makes this example different is that the workload is not that big, just like in case #2. Hence, the temperature is changing really slow and fan speed should not change significantly or rapidly.

Instead, huge and random spikes are observed and are causing flip-flopping. The first issue here is that such voltage pattern definitely will not prolong life of fan’s drive. The second issue is a noise pattern: that noise strikes interweaved with silence can appear to be annoying. Every spike looks similar to overshooting at the beginning of fan work. And spikes are as random as anomalous pin spikes during fan work.

Additionally, fan work between anomalies also looks suspicious: it has a form of a smooth wide bulge with fan speed varying from ≈10% to ≈18% while the workload is slight and constant.

Looking at that altogether gives a strong feeling that there’s something wrong with stock automatic fan control.

Conclusion

This article has covered a topic of monitoring core metrics of Nvidia’s GPUs using official API libraries.

There are two main libraries from Nvidia: NVML and NVAPI. Their purposes are different in general, but for certain use cases, like basic monitoring, both of them can be used. The former one is available both for Windows and Linux-based systems while the latter one is available for Windows only. Also, NVAPI provides a better development experience when used as a static library from SDK and has hidden functionality when used as a shared library. On the other hand, the correctness of usage of NVML can be validated against the output of official tools like nvidia-smi. In several cases both libraries can provide a slightly different level of granularity and can use different units of measurement for similar functionality.

If Nvidia representatives are reading this, here’s a question for them:

When it can be expected to see fan management functionality like fan speed control in management library NVML?

Next, an overview of example GPU monitor was given. It’s not hard to implement in C++ and in case of NVML it is not limited to C++: Python or Perl can be used as well.

Finally, sample metrics from the test device were collected, visualized and examined. It’s visible that there are issues with stock automatic fan control.

Having a monitor implemented and reference metrics collected, the second article on this topic will try to cover implementation of custom fan control which is needed for headless Linux-based systems. Stay tuned.