Explained Output of Nvidia-smi Utility

Published in

Analytics Vidhya

4 min readDec 16, 2019

Hey learners!

Image from https://www.nvidia.com/en-us/about-nvidia/partners/

Machine Learning and Deep Neural Nets when evolved, computations on the CPU used to take a long time or even not possible to complete on time. Thereafter, GPU got introduced for these though it was already been used for gaming. To read more about GPU and its monitoring, have a quick glance at this blog.

NVIDIA GPUs started widely used for many Machine Learning and Deep Learning models and multi-GPU setup then needs to be monitored and managed to get its benefits. Well, good news then! One of the command line utility tools “nvidia-smi” is a savior. Let’s learn about it.

Nvidia-smi

There is a command-line utility tool, Nvidia-smi (also NVSMI) which monitors and manages NVIDIA GPUs such as Tesla, Quadro, GRID, and GeForce. It is installed along with the CUDA toolkit and provides you with meaningful insights.

Below is an output of “nvidia-smi” command line.

Two tables are generated as the output where first reflects the information about all available GPUs (above example states 1 GPU). The second table tells you about the processes using GPUs.

Let’s go one by one.

Table I

Let’s dig into it more.

Temp: Core GPU temperature is in degrees Celsius. We need not worry about it since it will be controlled by AWS datacentres except to care about your hardware. The above “44C” in the table shown is normal but give a call when it reaches 90+ C.

Perf: Denotes GPU’s current performance state. It ranges from P0 to P12 referring to maximum and minimum performance respectively.

Persistence-M: The value of Persistence Mode flag where “On” means that the NVIDIA driver will remain loaded(persist) even when no active client such as Nvidia-smi is running. This reduces the driver load latency with dependent apps such as CUDA programs.

Pwr: Usage/Cap: It refers to the GPU’s current power usage out of total power capacity. It samples in Watts.

Bus-Id: GPU’s PCI bus id as “domain:bus:device.function”, in hex format which is used to filter out the stats of a particular device.

Disp.A: Display Active is a flag that decides if you want to allocate memory on GPU device for display i.e. to initialize the display on GPU. Here, “Off” indicates that there isn’t any display using a GPU device.

Memory-Usage: Denotes the memory allocation on GPU out of total memory. Tensorflow or Keras(TensorFlow backend) automatically allocates whole memory when getting launched, even though it doesn’t require. Hence, have a glance on GPU on Keras and Tensorflow targeting its solution with more interesting information.

Volatile Uncorr. ECC: ECC stands for Error Correction Code which verifies data transmission by locating and correcting transmission errors. NVIDIA GPUs provide an error count of ECC errors. Here, Volatile error counter detects error count since the last driver loaded.

GPU-Util: It indicates the percent of GPU utilization i.e. percent of the time when kernels were using GPU over the sample period. Here, the period could be between 1 to 1/6th second. For instance, output in table above is shown 13% of the time. In the case of low percent, GPU was under-utilized when if code spends time in reading data from disk (mini-batches).
More detailed reference: https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

Compute M.: Compute Mode of specific GPU refers to the shared access mode where compute mode sets to default after each reboot. “Default” value allows multiple clients to access the CPU at the same time.

So, that was about GPU being used in processes. Now, let’s walk through the second table which gives an idea about each process using GPU.

Table II

GPU: Indicates the GPU index, beneficial for multi-GPU setup. This determines which process is utilizing which GPU. This index represents the NVML Index of the device.

PID: Refers to the process by its ID using GPU.

Type: Refers to the type of processes such as “C” (Compute), “G” (Graphics), and “C+G” (Compute and Graphics context).

Process Name: Self-explanatory

GPU Memory Usage: Memory of specific GPU utilized by each process.

Other metrics and detailed descriptions are stated on Nvidia-smi manual page.

Happy Reading!

Can get in touch with me via LinkedIn.

Explained Output of Nvidia-smi Utility

Nvidia-smi

Table I

Table II

Written by Shachi Kaul