On Vectorization of Convolution Layer in Convolution Neural Networks (CNNs)

Sanghvirajit
Analytics Vidhya
Published in
5 min readFeb 7, 2021

Unboxing the black box!

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks. CNN's have shown a remarkable state of the art performance in many applications such as in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, and natural language processing.

Figure 1: Convolution Neural Network.

In this article, I will only focus on vectorizing the single convolution layer and not the whole convolution neural network.

Convolution Operation

In image processing, a kernel, convolution matrix, or mask is a small matrix. It is used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between a kernel and an image.

A convolution is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.

In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions (f and g) that produces a third function ( f * g ) that expresses how the shape of one is modified by the other The general form for matrix convolution is given by,

Figure 2: Matrix convolution [1].

Intuitively, a convolution allows for weight sharing — reducing the number of effective parameters — and image translation (allowing for the same feature to be detected in different parts of the input space).

The 3D visualization of the convolution operation can be seen as follow,

Figure 3: 3D visualization of the convolution operation.

Depending upon the type of the kernel, the different features from the input image can be extracted.

Figure 4: Common Filters [1].

The output of the convolution operation is generally termed as “Feature maps”.

Figure 5: Convolved Feature.

This convolution operation is the backbone of the convolution neural networks.

Let’s code Convolution Operation

Python code for the simple convolution operation for a single input image will look something like this,

Figure 6: Code for Convolution operation for a single Input image.

The above operation will give the following results,

Figure 7: Original Image (Left) and Gaussian Blur (Right).

But the above implementation will perform the convolution on only one input image, what if we have multiple images inputted into the network (e.g mini-batch)? Obviously, the simplest thing that will come to mind is to simply loop the above operation all over the mini-batch. Assuming a 2D back and white image, it has 1 channel (e.g (28, 28, 1)) in comparison to the colored image that has 3 channels (RGB) (e.g (28, 28, 3)). Hence, the code for the above settings will look something like this,

Figure 8: Code for Convolution operation for multiple Input images.

Let’s apply this to the handwritten digits — MNIST dataset. MNIST dataset consists of 60,000 training images of 28 x 28. The input size will be (60000, 28, 28, 1). After applying the convolution layer with a kernel size of 3 x 3 x 1 and with 32 filters (Output Channel), the output of the layer will be (60000, 26, 26, 32).

The output of the convolution layer with 32 different filters will look something like this,

Figure 9: Output of the Convolution Layer with 32 filters.

But the above method of implementation will work very slow considering the looping over all the input images (60000 images for MNIST) in the dataset. Hence, the above method is not the most prominent one and the vectorization method is the key, in order to better understand and facilitate the parallel implementation. Key steps in training and testing deep CNNs are abstracted as matrix and vector operators, upon which parallelism can be easily achieved.

Considering the mini-batch of 5000 images, the above implementation takes 658.169 Sec. As the number of input images increases, the runtime also increases exponentially.

Strategy to vectorize convolution

Vectorization refers to the process that transforms the original data structure into a vector representation so that the scalar operators can be converted into a vector implementation.

Considering the input image of 3 x 3 with 3 different filters of 2 x 2, during the convolution operation each filter will run over the same input image and will result in an output feature map of 2 x 2. Hence, 3 different filters will result in 3 “Feature maps”. There is a way to vectorize this operation.

Steps to vectorize the above operation includes,

  1. Convert all kernels/ filters to rows and get a kernel matrix.
  2. Split your input (image) into slices for convolution then convert to columns and get an input matrix. You can append other inputs (images) to form a mini-batch
  3. multiply input matrix with the kernels matrix. In the result matrix, each row is one feature map.
Figure 10: Strategy to vectorize convolution. Illustration of the way to convolve a 3x3 input image with three 2x2 kernels and generates three 2x2 feature maps [2].

A similar strategy was used for implementation and the code for the input size of (5000, 28, 28, 1) for 32 different filters which result in the output feature map of (5000, 26, 26, 32) will look something like this,

Figure 11: Flattening the input image
Figure 12: Code for Vectorizing the Convolution Operation for multiple input images.

The above implementation resulted in similar output but now the implementation runtime was reduced to only 1.511 Sec,

Figure 13: Output of vectorized Convolution Layer with 32 filters.

For further details, I recommend the readers to have a look at the paper “On Vectorization of Deep Convolutional Neural Networks for Vision Tasks” by Jimmy SJ. and Ren Li Xu.

Summary

In this article, I elaborated on how the vectorization of the Convolution Layer in Deep Convolution Neural Networks (CNNs) works. Vectorization is the key to reduce the total run time.

References

[1] https://en.wikipedia.org/wiki/Kernel_(image_processing)

[2] https://en.wikipedia.org/wiki/Convolutional_neural_network

[3] Jimmy SJ. Ren Li Xu. On Vectorization of Deep Convolutional Neural Networks for Vision Tasks.

--

--