Convolutional Neural Networks — ‘What Is What’ Understanding By a Naive Kid

Published in

The Startup

5 min readSep 17, 2020

This article will help others understand common terminologies involved in Convolution Neural Network in the form of ‘what is what’ in plain English and without much interference of mathematics or statistics.

You can refer to my other blog for ‘What is what’ understanding of neural networks.

Let's start…

Convolutional Neural Network:
or CNN or ConvNets is a variant of Deep Learning algorithm for learning and analyzing visual tasks.

Key applications:
Object recognition and detection
Face recognition

Image Representation: Images can be represented in a matrix form and each cell is a pixel. The hand-written number 1 can be represented in a matrix form as below (the image is taken from MNIST dataset). Depending upon the intensity of the darkness in each pixel the value ranges between 0 and 1. Rest all cells are 0.

And, all colored images are represented using Red Green and Blue channels as we know already. Depending upon the concentration of each RGB color at a pixel, the actual color is defined. We can represent the colored images as well in a Matrix form.

Convolution: is a mathematical operation on two functions to generate a third one and analyze how one function affects the other

Kernel: is a filter used to extract features from images. Also called as filter, mask or operator. Its a matrix to perform component-wise multiplication with the input matrix. Sobel Edge detector is one of the most commonly used Kernels to detect an edge in an image.

After convolution and normalization, output of the Sobel edge detector will have two horizontal or vertical white lines denoting an edge in the image. Depending on the horizontality or verticality of the edge the white line will range from light to bright.

Output matrix after Convolution:
(N x N) x (K x K) → (N-K+1) x (N-K+1)
N is the dimension of input image; K is the dimension of the kernel

Padding: Adding extra row or column to the top, bottom, left or right to the input image matrix to make the output matrix same as input after convolution. Zero padding is the most used one.

Output matrix after Convolution with Padding:
(N x N) x (K x K) → (N-K+2P+1) x (N-K+2P+1)
N is the dimension of input image; K is the dimension of the kernel; P is the number of padding layers

Strides: Shifting of Convolution matrix by n positions. If n = 1, the matrix is shifted by 1 position. Helps to increase of decrease the output matrix to match dimensions of input.

Output matrix after Convolution with Padding and Stride:
(N x N) x (K x K) → ((N-K+2P)/S+1) x ((N-K+2P)/S+1)
N is the dimension of input image; K is the dimension of the kernel; P is the number of padding layers; S is the number of strides

Pooling: progressively reduces the spatial size of the input to reduce the amount of parameters and computation in the network. It can also be used as a technique to make the model invariant to location, scale or rotation. Max-pool is the most commonly used technique. Here, for each filter of size 2x2, the maximum value is taken to the corresponding cell in the output matrix. Other variants of pooling are Average pooling and global pooling.

Flatten: convert the input data into a 1-Dimensional array.

Data Augmentation: helps in making the models robust against rotation, scale, cropping, flip, rotate, noisy etc. It will add more images to the dataset covering. An example of data augmentation is shown below:

Fully Connected Network: All activation functions of the previous layer is connected to all activation functions of the next layer.

Typical CNN architecture for MNIST dataset:

That is all for now friends, will come up with a super simple article soon to get you onto the deep learning journey. Your feedback will help me in improving my knowledge, please do comment. I can also be reached at anupkkumar@gmail.com.

References:

Neural Networks — ‘What is what’ understanding by a naive kid

This is my first blog to help others understand Neural Network terms in the form of ‘what is what’ in plain English…

medium.com

Sobel operator

The Sobel operator, sometimes called the Sobel-Feldman operator or Sobel filter, is used in image processing and…

en.wikipedia.org

CS231n Convolutional Neural Networks for Visual Recognition

Table of Contents: Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous…

cs231n.github.io