On Vectorization of Convolution Layer in Convolution Neural Networks (CNNs)
Unboxing the black box!
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks. CNN's have shown a remarkable state of the art performance in many applications such as in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, and natural language processing.
In this article, I will only focus on vectorizing the single convolution layer and not the whole convolution neural network.
Convolution Operation
In image processing, a kernel, convolution matrix, or mask is a small matrix. It is used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between a kernel and an image.
A convolution is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.
In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions (f and g) that produces a third function ( f * g ) that expresses how the shape of one is modified by the other The general form for matrix convolution is given by,
Intuitively, a convolution allows for weight sharing — reducing the number of effective parameters — and image translation (allowing for the same feature to be detected in different parts of the input space).
The 3D visualization of the convolution operation can be seen as follow,
Depending upon the type of the kernel, the different features from the input image can be extracted.
The output of the convolution operation is generally termed as “Feature maps”.
This convolution operation is the backbone of the convolution neural networks.
Let’s code Convolution Operation
Python code for the simple convolution operation for a single input image will look something like this,
The above operation will give the following results,
But the above implementation will perform the convolution on only one input image, what if we have multiple images inputted into the network (e.g mini-batch)? Obviously, the simplest thing that will come to mind is to simply loop the above operation all over the mini-batch. Assuming a 2D back and white image, it has 1 channel (e.g (28, 28, 1)) in comparison to the colored image that has 3 channels (RGB) (e.g (28, 28, 3)). Hence, the code for the above settings will look something like this,
Let’s apply this to the handwritten digits — MNIST dataset. MNIST dataset consists of 60,000 training images of 28 x 28. The input size will be (60000, 28, 28, 1). After applying the convolution layer with a kernel size of 3 x 3 x 1 and with 32 filters (Output Channel), the output of the layer will be (60000, 26, 26, 32).
The output of the convolution layer with 32 different filters will look something like this,
But the above method of implementation will work very slow considering the looping over all the input images (60000 images for MNIST) in the dataset. Hence, the above method is not the most prominent one and the vectorization method is the key, in order to better understand and facilitate the parallel implementation. Key steps in training and testing deep CNNs are abstracted as matrix and vector operators, upon which parallelism can be easily achieved.
Considering the mini-batch of 5000 images, the above implementation takes 658.169 Sec. As the number of input images increases, the runtime also increases exponentially.
Strategy to vectorize convolution
Vectorization refers to the process that transforms the original data structure into a vector representation so that the scalar operators can be converted into a vector implementation.
Considering the input image of 3 x 3 with 3 different filters of 2 x 2, during the convolution operation each filter will run over the same input image and will result in an output feature map of 2 x 2. Hence, 3 different filters will result in 3 “Feature maps”. There is a way to vectorize this operation.
Steps to vectorize the above operation includes,
- Convert all kernels/ filters to rows and get a kernel matrix.
- Split your input (image) into slices for convolution then convert to columns and get an input matrix. You can append other inputs (images) to form a mini-batch
- multiply input matrix with the kernels matrix. In the result matrix, each row is one feature map.
A similar strategy was used for implementation and the code for the input size of (5000, 28, 28, 1) for 32 different filters which result in the output feature map of (5000, 26, 26, 32) will look something like this,
The above implementation resulted in similar output but now the implementation runtime was reduced to only 1.511 Sec,
For further details, I recommend the readers to have a look at the paper “On Vectorization of Deep Convolutional Neural Networks for Vision Tasks” by Jimmy SJ. and Ren Li Xu.
Summary
In this article, I elaborated on how the vectorization of the Convolution Layer in Deep Convolution Neural Networks (CNNs) works. Vectorization is the key to reduce the total run time.
References
[1] https://en.wikipedia.org/wiki/Kernel_(image_processing)
[2] https://en.wikipedia.org/wiki/Convolutional_neural_network
[3] Jimmy SJ. Ren Li Xu. On Vectorization of Deep Convolutional Neural Networks for Vision Tasks.