CUDA Optimization Design Tradeoff for Autonomous Driving

CNN inference speed in critical for autonomous driving in practice, so it’s important to optimize its inference speed. This is the case because for autonomous driving, any delay in the inference processing might cause accidents and we definitely want to avoid accidents.

Convolutional Layer Optimization Design Tradeoff

For the source codes referred in this article, go to my github site. You can find the unoptimized Python implementation of the convolution layer here.

The most computation-intensive part of CNN is the convolution operation. To optimize this convolution operation, one can turn it into a matrix multiplication. Then optimized CUDA Matrix Multiplication library cuBLAS can be used to perform the matrix multiplication on GPU.

If using cuBLAS, one follows the steps below to optimize the convolution layer:

1. Use im2col to stretch the filters into column vectors and concatenate them into the Kernel Matrix.

2. Use im2col to stretch the features at different locations of the input to column vectors and concatenate them into the Input Features Matrix A.

3. Transpose the Input Features Matrix A. Then use cublasSgemm [4] in the cuBLAS library to perform matrix multiplication between the Input Features Matrix and the Kernel Matrix. This generates the Output Features Matrix. See the Matrix Multiplication CUDA C++ code here.

4. Use col2im to shape the Output Features Matrix to the proper output dimensions.

However, this Matrix Multiplication approach trades speed with memory because now it has multiple duplicates at all the locations in the Input Features Matrix A.

cudnn mitigates this memory issue by using “lazy tiles fetching” trick:

Fixed sized submatrices of the Features Input matrix A and Kernel Matrix B are successively read into on-chip memory and are then used to compute a submatrix of the Features Output Matrix C. It computes on tiles of A and B while fetching the next tiles of A and B from off-chip memory into on-chip caches and other memories. This technique hides the memory latency associated with the data transfer, allowing the matrix multiplication computation to be limited only by the time it takes to perform the arithmetic. [5]

You can call cudnnConvolutionForward in the cuDNN library to achieve all the cuBLAS steps mentioned above while having good memory usage, and cuDNN does all these behind the scene.

Theano [6], TensorFlow [7], Torch [15], Caffe [9] all support cuDNN to optimize their deep learning functions.

In Theano, if cuDNN is available, by default, it will replace all theano.tensor.nnet.conv2d operations with theano.sandbox.cuda.dnn.dnn_conv (Theano wrapper function for cudnnConvolutionForward); otherwise it will fall back to using the cublasSgemm version (slower then cuDNN in most cases and uses more memory) [10].

In TensorFlow, its tf.nn.conv2d function uses cuDNN by default (use_cudnn_on_gpu = True by default) [11]. It’s also capable running the CNN on multiple GPUs. Two optimization design tradeoffs are considered here [12]:

  1. Using asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, using fully synchronous updates will be as slow as the slowest model replica. To solve this issue, one can place an individual model replica on each GPU and update model parameters synchronously by waiting for all GPUs to finish processing a batch of data. Each GPU computes inference as well as the gradients for a unique batch of data. This setup effectively permits dividing up a larger batch of data across the GPUs.
  2. However, this approach requires all GPUs share the model parameters. This leads to another design tradeoff issue: transferring data to and from GPUs is quite slow. To solve this issue, one can store and update all model parameters on the CPU (see green box). A fresh set of model parameters is transferred to the GPU when a new batch of data is processed by all GPUs. The GPUs are synchronized in operation. All gradients are accumulated from the GPUs and averaged (see green box). The model parameters are updated with the gradients averaged across all model replicas.

In Caffe, you have the option to enable cuDNN by setting USE_CUDNN := 1 during the installation [9]. Otherwise, its default convolution implementation [14] uses the cublasSgemm approach, and the default implementation is known to have certain memory issues [8].

In Torch, by default it does not use cudnn (cudnn.benchmark is set to “false”). Setting it to true will improve performance, at the expense of using more memory. As well, by default, cudnn.fastest is set to “false”. It should be set to “true” if memory is not an issue, and you want the fastest performance.

So which framework has the best overall performance in both memory and speed for training and inference? It really depends on the network architecture you are implementing and also if it has a large community to support it. I will compare the different deep learning frameworks for autonomous driving in another blog post.

This part of the blog post is incomplete. It will be updated soon…

References

[1] High Performance Convolutional Neural Networks for Document Processing

[2] cuBLAS :: CUDA Toolkit Documentation — NVIDIA Documentation

[3] NVIDIA cuDNN | NVIDIA Developer

[4] cuBLAS library reference

[5] Chetlur et al, cuDNN: Efficient Primitives for Deep Learning

[6] Theano with cuDNN

[7] TensorFlow with cuDNN

[8] Caffe convolution with GPU

[9] Caffe with cuDNN

[10] Theano convolution with GPU

[11] TensorFlow convolution with GPU

[12] TensorFlow convolution with multiple GPUs

[13] convnet-benchmarks

[14] convolution layer in Caffe

[15] Torch with cuDNN

[16] Comparative Study of Deep Learning Software Frameworks

[17] Benchmarking State-of-the-Art Deep Learning Software Tools

[18] Lenet

[19] AlexNet

[20] Deep Residual Learning for Image Recognition

[21] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning