Demystifying Convolutional Neural Networks

Vijay Choubey
8 min readJul 26, 2020

Well, in this article, we are going to understand Convolutional Neural Network and will do short implementation using of CNN using python. We will go through the basics and how it is working. So lets first understand it.

Cnn Illustration

CNN (Convolutional Neural Network) is a feed-forward neural network as the information moves from one layer to the next. CNN is also called ConvNets. It consists of hidden layers having convolution and pooling functions in addition to the activation function for introducing nonlinearity.

CNN is mainly used for image recognition. CNN first learns to recognize the components of an image (e.g. lines, corners, curves, shapes, texture etc.) and then learns to combine these components (pooling) to recognize larger structures (e.g. faces, objects etc.).

Layers in CNN

1.Convolutional Layer

2.ReLU Layer

3.Pooling Layer

4.Normalization Layer

5.Fully connected Layer

Computers see an input image as an array of pixels. Numerical representation of the pixels is processed through many layers of a CNN. Each input image passes through a series of hidden layers like convolutional layers with filters (kernels), ReLU layers, pooling layers and fully connected layers. These hidden layers perform feature extraction from the image.

Convolutional Layer

Convolution is the first layer to extract features from an input image. This layer uses a matrix filter and performs convolution operation to detect patterns in the image. Convolution of an image with different filters can perform operations such as edge detection, blur and sharpen by applying filters.

Convolution is a mathematical operation that happens between two matrices (image matrix and a filter or kernel) to form a third matrix as an output (convoluted matrix). This output is also called feature map matrix.

Conv2D (Convolution layer)
Arguments:

  • filters : Denotes the number of Feature detectors.
  • kernel_size : Denotes the shape of the feature detector. (3,3) denotes a 3 x 3 matrix.
  • strides = controls how many units the filter shifts
  • input shape : standardizes the size of the input image
  • padding(optional argument) : used to control the dimensionality of the convolved feature with respect to that of input. It can be either valid (also called zero padding/no padding) which reduces the dimension or same which preserves or increases the dimension.

The dimension of the output of a convolution layer is :
o = ((N — K + 2P)/s) + 1
where,
N is the width and height of the input image
K is the kernel size
P is the padding(same or valid).
s is the stride

Filters (Kernels)

Filters act as pattern detectors. Filters help in finding out edges, curves, corners, textures, colors, dark and light areas in the image and many other details like height, width, and depth etc. Kernels keep sliding over an entire image to extract different components or patterns of an image. First filters learn to extract simple features in initial convoluted layers, and later on these filters get more sophisticated in deeper layers and find out complex patterns.

We rotate this filter over an input matrix and get an output which is of less dimension.

Formula: Consider that our input matrix dimension is n X n. Filter size is f X f. Then our output matrix would be (n — f + 1) X (n — f + 1). Just replace n with 4, f with 3 and observe that the output matrix comes out to be 2 X 2.

Padding

We can observe that the input size is reduced from 4 X 4 to 2 X 2 after one convolution using 3 X 3 filter. This may lead to a problem. We may lose some information about edges and corners in the image. So, in order to preserve this information, we should use padding.

Type of Padding: We have two types of padding: Zero Padding and Valid Padding (no padding).

  1. Zero Padding: Pad the image with zeros so that we don’t lose any information about edges and corners.

In the above image, we have padded the input with zero. Now, if we use 3 X 3 filter over this, we get 4 X 4 output matrix (no reduction in dimensions) instead of 2 X 2.

2. Valid Padding: Drop the part of the image where the filter does not fit. This is called valid padding which keeps only valid part of the image. In this case, we compromise to lose some edge information in the image. We will get only 2 X 2 matrix in above example. We can go with this approach if we know that the information at edges is not that much useful and we can safely ignore that.

ReLU Layer

ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x). ReLU’s main purpose is to introduce non-linearity in the ConvNets. It performs element-wise operation and set negative pixels to zero.

ReLU function is applied to the output matrix (feature map matrix) in the convolutional layer and converts it into a rectified feature map matrix.

We can also use other activation functions like tanh and sigmoid but generally ReLU performs better than other activation functions in many scenarios. So, by default, we consider ReLU over other activation functions.

Stride

Stride is the number of pixels shifts over the input matrix. Alternatively, stride can be thought as by how many pixels we want our filter to move as it slides across the image. When the stride is 1, then we move the filters to 1 pixel at a time. When the stride is 2, then we move the filters to 2 pixels at a time and so on. We will use this concept in pooling layer.

Pooling Layer

Pooling layer is added after convolutional layer. Output of convolutional layer acts as an input to the pooling layer. Pooling layer does down-sampling of the image which reduces dimensionality by retaining important information. In this way, memory requirements are also reduced.

It does further feature extraction and detects multiple components of the image like edges, corners etc.

It converts the rectified feature map matrix to pooled feature map matrix.

Pooling Types

1.Max Pooling: It takes the maximum value from the rectified feature map.

2.Min Pooling: It takes the minimum value from the rectified feature map.

3.Average Pooling: It takes the average of all the elements from the rectified feature map.

4.Sum Pooling: It takes the sum of all the elements from the rectified feature map. We can also specify padding parameter in pooling layer just like in convolutional layer.

Advantages of Pooling Layer

1.Reduces the resolution and dimensions and hence reduces computational complexity.

2.It also helps in reducing overfitting.

Normalization Layer

Normalization is a technique used to improve the performance and stability of the neural networks. It converts all inputs such that mean is zero and standard deviation is one.

Fully Connected Layer

Fully connected layers are used to connect every neuron in one layer to all the neurons in another layer. We flatten our pooled feature map matrix into vector and then feed that vector into a fully connected layer.

Hyperparameters in CNN

1.Number of convoluted layers

2.Number of kernels / filters in a convoluted layer

3.Kernel / Filter size in a convoluted layer

4.Padding in a convoluted layer (zero or valid padding)

Pooling layer problem in CNN

Pooling layer is used to perform down-sampling the data due to which a lot of information is lost. These layers reduce the spatial resolution, so their outputs are invariant to small changes in the inputs. This is a problem when detailed information must be preserved throughout the network. With CapsNets, detailed pose information (such as precise object position, rotation, thickness, skew, size, and so on) is preserved throughout the network. Small changes to the inputs result in small changes to the outputs — information is preserved. This is called “equivariance.”

Implementation of the CNN model

Code:

  1. Dataset used

MNIST dataset: which is the database of handwritten digits, having 60,000 images for training and 10,000 images as a test set. The digits have been normalized and centered.

Availability: It is available through TensorFlow dataset API, thereby saving a lot of time spent in loading the data. Follow the code below to understand how to access it!

2. Importing the libraries and packages:

import numpy as np
import tensorflow as tf
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout
from keras.models import Model,Sequential
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
%matplotlib inline

3. Loading the dataset:

(X_train,Y_train),(X_test,Y_test) = tf.keras.datasets.mnist.load_data()

4.Visualizing a sample:

image_index = 7777 
print(Y_train[image_index])
# The label is 8
plt.imshow(X_train[image_index], cmap='Greys')

Output:

5. Little Preprocessing

# Reshaping the images, so that they can be fed into a model
X_train = X_train.reshape(X_train.shape[0],28,28,1)
X_test = X_test.reshape(X_test.shape[0],28,28,1)
# normalizing again after reshaping
X_train = X_train/255
X_test = X_test/255

6.Model Construction

model = Sequential()model.add(Conv2D(28,(3,3),strides=(1,1),input_shape=X_train.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D((2,2)))
model.add(Flatten()) # flattens the input
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))

7.Model Training

# model compilation
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# model fitting
history = model.fit(x=X_train,y=Y_train, epochs=10)
history

8. Model Evaluation
Let’s evaluate the performance of our model on test data

model.evaluate(X_test, Y_test)

Output:
Loss — 0.05
Accuracy — 98.73%
This is an impressive result, even though model was trained for just 10 epochs.

9. Visualizing loss and accuracy

acc = history.history['acc']
loss = history.history['loss']
plt.figure(figsize=(8, 8))
plt.subplot(2, 1, 1)
plt.plot(acc, label='Training Accuracy')
plt.legend(loc='lower right')
plt.ylabel('Accuracy')
plt.ylim([min(plt.ylim()),1])
plt.title('Training Accuracy')
plt.subplot(2, 1, 2)
plt.plot(loss, label='Training Loss')
plt.legend(loc='upper right')
plt.ylabel('Cross Entropy')
plt.ylim([0,max(plt.ylim())])
plt.title('Training Loss')
plt.show()

Output:

Why CNN Work so well?

  • CNN are hugely popular because of their architecture — the best thing is that there is no need for feature extraction. The system learns to do feature extraction and the core concept of CNN is, it uses convolution of image and filters to generate invariant features that are passed onto the next layer. The features in the next layer are convoluted with different filters to generate more invariant and abstract features and the process continues till one gets final feature/output.
  • A CNN is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and the re-usability of weights. In other words, the network can be trained to understand the sophistication of the image better.

Thanks for reading! I am going to be writing more Deep Learning articles in the future too. Follow me on Medium to be informed about them. And I am also a freelancer,If there is some freelancing work on data-related projects feel free to reach out over Linkedin.Nothing beats working on real projects!

Clap if you liked the article!

--

--

Vijay Choubey

Data Scientist @ Accenture AI|| Medium Blogger || NLP Enthusiast || Freelancer LinkedIn: https://www.linkedin.com/in/vijay-choubey-3bb471148/