CUDA Optimization Design Tradeoff for Autonomous Driving
CNN inference speed in critical for autonomous driving in practice, so it’s important to optimize its inference speed. This is the case because for autonomous driving, any delay in the inference processing might cause accidents and we definitely want to avoid accidents.
Convolutional Layer Optimization Design Tradeoff
The most computation-intensive part of CNN is the convolution operation. To optimize this convolution operation, one can turn it into a matrix multiplication. Then optimized CUDA Matrix Multiplication library cuBLAS can be used to perform the matrix multiplication on GPU.
If using cuBLAS, one follows the steps below to optimize the convolution layer:
1. Use im2col to stretch the filters into column vectors and concatenate them into the Kernel Matrix.
2. Use im2col to stretch the features at different locations of the input to column vectors and concatenate them into the Input Features Matrix A.
3. Transpose the Input Features Matrix A. Then use cublasSgemm  in the cuBLAS library to perform matrix multiplication between the Input Features Matrix and the Kernel Matrix. This generates the Output Features Matrix. See the Matrix Multiplication CUDA C++ code here.
4. Use col2im to shape the Output Features Matrix to the proper output dimensions.
However, this Matrix Multiplication approach trades speed with memory because now it has multiple duplicates at all the locations in the Input Features Matrix A.
cudnn mitigates this memory issue by using “lazy tiles fetching” trick:
Fixed sized submatrices of the Features Input matrix A and Kernel Matrix B are successively read into on-chip memory and are then used to compute a submatrix of the Features Output Matrix C. It computes on tiles of A and B while fetching the next tiles of A and B from off-chip memory into on-chip caches and other memories. This technique hides the memory latency associated with the data transfer, allowing the matrix multiplication computation to be limited only by the time it takes to perform the arithmetic. 
You can call cudnnConvolutionForward in the cuDNN library to achieve all the cuBLAS steps mentioned above while having good memory usage, and cuDNN does all these behind the scene.
Theano , TensorFlow , Torch , Caffe  all support cuDNN to optimize their deep learning functions.
In Theano, if cuDNN is available, by default, it will replace all theano.tensor.nnet.conv2d operations with theano.sandbox.cuda.dnn.dnn_conv (Theano wrapper function for cudnnConvolutionForward); otherwise it will fall back to using the cublasSgemm version (slower then cuDNN in most cases and uses more memory) .
In TensorFlow, its tf.nn.conv2d function uses cuDNN by default (use_cudnn_on_gpu = True by default) . It’s also capable running the CNN on multiple GPUs. Two optimization design tradeoffs are considered here :
- Using asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, using fully synchronous updates will be as slow as the slowest model replica. To solve this issue, one can place an individual model replica on each GPU and update model parameters synchronously by waiting for all GPUs to finish processing a batch of data. Each GPU computes inference as well as the gradients for a unique batch of data. This setup effectively permits dividing up a larger batch of data across the GPUs.
- However, this approach requires all GPUs share the model parameters. This leads to another design tradeoff issue: transferring data to and from GPUs is quite slow. To solve this issue, one can store and update all model parameters on the CPU (see green box). A fresh set of model parameters is transferred to the GPU when a new batch of data is processed by all GPUs. The GPUs are synchronized in operation. All gradients are accumulated from the GPUs and averaged (see green box). The model parameters are updated with the gradients averaged across all model replicas.
In Caffe, you have the option to enable cuDNN by setting USE_CUDNN := 1 during the installation . Otherwise, its default convolution implementation  uses the cublasSgemm approach, and the default implementation is known to have certain memory issues .
In Torch, by default it does not use cudnn (cudnn.benchmark is set to “false”). Setting it to true will improve performance, at the expense of using more memory. As well, by default, cudnn.fastest is set to “false”. It should be set to “true” if memory is not an issue, and you want the fastest performance.
So which framework has the best overall performance in both memory and speed for training and inference? It really depends on the network architecture you are implementing and also if it has a large community to support it. I will compare the different deep learning frameworks for autonomous driving in another blog post.
This part of the blog post is incomplete. It will be updated soon…
 Caffe with cuDNN
 Torch with cuDNN