Minimal cuDNN C++ Example
cuDNN is a library developed by Nvidia that provides optimised GPU implementations of neural network primitives (convolutions, activations, etc). cuDNN is used in the background by most popular high-level neural network libraries, including PyTorch and Tensorflow. While Nvidia provides some sample code (including examples on RNN, convolutions and MNIST) along with the library installation, these are relatively large (1000+ lines of code) and span multiple source code files. This article contains a small (< 70 lines of code) cuDNN example that:
- creates a tensor
- applies the sigmoid function on it
- and prints the output
The full code for this is:
What the code is doing:
- Lines 1–3 import the libraries we’ll need —
iostream.h
for general IO,cuda.h
for interacting with the GPU, andcudnn.h
. - Lines 11–13 print how many GPUs are found connected to your computer. In this example, I tell it to use GPU0 (in Line 14).
- Lines 15–19 use the
cuda
library functioncudaGetDevice
andcudaGetDeviceProperties
and prints the compute capability of the selected GPU. - The first thing to do while using cuDNN is creating a handle. You will need to use this handle in all subsequent
cuDNN
functions. (Lines 21–23) - In cuDNN, whenever we create a tensor, we need to provide cuDNN with information on the tensor. This is done by creating a descriptor for our tensor (lines 26–32). The cuDNN explains what the
NCHW
data format is here. - Lines 35–39 create the tensor and print it to screen. The array created is a simple one:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
. While using cuDNN, all data/tensors must be allocated only on the GPU. - Similar to how we created a descriptor for our tensor, that explains what our tensor looks like, we need to create a descriptor for the activation function too (Lines 42–48).
- Line 50 has the function call that does the sigmoid activation (finally). You can see what each argument is supposed to be from the cuDNN API reference (direct link to activation function).
- Line 61 destroys the handle that we created in lines 21–23.
- Lines 62–65 print the processed array.
- Line 66 frees the memory that we allocated on the GPU.
To compile the code, copy it and save it in a file hw.cpp
and use these bash commands:
g++ -I/usr/local/cuda/include -I/usr/local/cuda/targets/ppc64le-linux/include -o hw.o -c hw.cpp/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_70,code=sm_70 -o hw hw.o -I/usr/local/cuda/include -I/usr/local/cuda/targets/ppc64le-linux/include -L/usr/local/cuda/lib64 -L/usr/local/cuda/targets/ppc64le-linux/lib -lcublasLt -lcudart -lcublas -lcudnn -lstdc++ -lm
Some small things:
- I ran this code on a machine running
Ubuntu 16.04 LTS
with Nvidia driver version465.19.01
with CUDA Version 11.3 and cuDNN8.2.1.32-1
on a Tesla v100 GPU. The v100 has a compute capability of 7.0 (you can check the compute capability of your GPU model on the Nvidia website). Based on the compute capability of your GPU you might have to change two items in the first bash command (in bold). For example, if your compute capability is 5.0, you’ll have to change the command to contain `arch=compute_50, code=sm_50` instead. - The first bash command compiles the C++ code, but doesn’t link the libraries (the
-c
flag prevents the linking from happening). - The second bash command uses
nvcc
to compile, link and create the final executable. - Run the executable by running
./hw
The output that I got was:
Found 4 GPUs.
Compute capability:7.0
Created cuDNN handle
Original array: 0 1 2 3 4 5 6 7 8 9
Destroyed cuDNN handle.
New array: 0.5 0.731059 0.880797 0.952574 0.982014 0.993307 0.997527 0.999089 0.999665 0.999877
To cross check if the answer we got was correct, let’s quickly cross check in Python:
>>> import numpy as np
>>> arr = np.asarray([0,1,2,3,4,5,6,7,8,9])
>>> np.reciprocal(np.exp(-1 * arr) + 1)
array([0.5 , 0.73105858, 0.88079708, 0.95257413, 0.98201379,
0.99330715, 0.99752738, 0.99908895, 0.99966465, 0.99987661])
>>>
So, that’s it — a simple cuDNN program!