Introduction to Convolutional Neural Nets and Implementation in TensorFlow

Bhavesh Singh Bisht
Analytics Vidhya
Published in
6 min readAug 14, 2020

Introduction:

Convolutional Neural Networks are deep neural networks that were designed typically to handle image datasets. When we are dealing with pixels, generalization becomes extremely difficult if we feed all the pixels directly to a fully connected network after flattening. Imagine, if we have a data-set with images each of size (360*360) in R.G.B, then we would have 388800 pixels(input vector) of a single image to feed to a multi layer Perceptron. Hence, there is need to highlight the features, get rid of noisy pixels and perform dimensionality Reduction.

Dealing with Images directly with Ordinary Neural Nets

For a better understanding, lets begin training our model without convolutions. Here we’ll use fashion-mnist data-set which consists of 70,000 Zalando’s article images. Each image is 28*28 in gray scale associated with 10 different classes.

The figure illustrates the fashion-mnist dataset

Refer: https://github.com/zalandoresearch/fashion-mnist

Lets load the data using the following commands:

The figure demonstrates the commands to load fashion-mnist data-set and perform normalization

Now, let’s define our model layers sequentially. The first layer is the input layer where we have flatten the (28*28) matrix into a vector. Next is the first hidden layer, which consists of 521 Neurons. And in the output layer we have specified 10 neurons as we have 10 different classes. The activation used is “Softmax” to perform Multi-class classification. Moreover, we can use model.summary() to get the overall architecture in detail.

The figure shows different layers in the network

Next, we compile the model by specifying different parameters. Here we have used “Adam” as the optimizer, “sparse_categorical_crossentropy” as the loss function, and “accuracy” as the validation metric. Finally, we perform “model.fit” to begin the training with 10 epochs.

The figure shows the commands to compile and train the model
The figure illustrates the training with loss and accuracy in each epoch

Here, the data-set consisted of images of size 28*28 and in greyscale. Hence, the training was quite fast and resulted in decent training accuracy however validation accuracy was much lower than the training at 0.87.

Shortcomings

  1. When we have higher resolution images, computation becomes expensive
  2. We can’t feed an image directly in this network, hence conversion of 2D pixels in a vector is required wherein we lose the spatial relationships within the data.
  3. Feature extraction becomes difficult where they are sparsely located within images belonging to the same dataset.

What actually is Convolutional Neural Network ?

ConvNets are similar to typical Neural Nets, they also perform several dot product (w1*x1,w2*x2…) , have the same activation functions(ReLu, Sigmoid, etc.) for adding non linearity and similar loss functions.However, the input is an image rather than a vector of pixels. Moreover, in CNN, we see the entire image through a window which keeps sliding over an image rather than seeing it entirely at once. Therefore, ConvNets captures sophisticated dependencies within the image pixels and fits it better due to less and presence of only important features . The filters also contains an additional dimension “depth” which gives them a block like structure.

The figure shows the difference between traditional Neural Nets and ConvNets

Different Layers in ConvNets

  1. Convolutional Layer
  2. Pooling Layer
  3. Fully Connected Layer

1. Convolutional Layer:

This layer consist of filters whose values are learnable. A typical filter consist of width, height and depth (similar to the input image depth). During the forward pass each filter slides over the input image according to the stride. Stride is basically the number of pixel shifts the filter perform while moving over the image. The output is the dot product of filter and the image one at a time and by stacking the activation maps of all filters along the depth dimension.

Finally, the network will learn the filter values those activates when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer. The main function of these filters is to summarize the presence of features in an image.

As the filter is applied multiple times to the input array, the result is a two-dimensional array of output values that represent a filtering of the input. The two-dimensional output array from this operation is called a “feature map“. You might notice a reduction in dimension when applying this layer which can be removed by using padding.

2. Pooling Layer

The main function of Pooling layer is to down sample the feature map. A lower resolution version of an input signal is created that still contains the large or important structural elements, without the fine detail that may not be as useful to the task and thus removes noise from the image.
There are two types of Pooling:

  1. Max Pooling
    In max Pooling the largest value is selected in the region of feature map covered by the filter. Thus the output is the most prominent features. Max pooling is most widely used as it also remove the noisy data from the image.
The figure demonstrates max pooling layer operation

2. Average Pooling
In Average pooling the average of all the values present in the range of feature map covered by the filter is taken. Max pooling throws away the data whereas in Average pooling everything is preserved hence it preserves localization.

The figure illustrates the average pooling filter operation

3. Fully Connected Layer

The task of fully connected layer is to take the results from convolution process and perform prediction. The first layer is a flatten layer which converts the values into a single dimension vector. The fully connected layer works like a regular neural network and goes through forward and back propagation iterations where the weight parameters are learnt. Each hidden layer contains an activation function to add non-linearity.

The figure illustrates an example of ConvNets with all the layers involved

Implementation in TensorFlow

The figure illustrate different Layers in ConvNets

Here, we have defined a Sequential model, where there are 3 layers of Convolution followed by pooling finally linking to a fully connected network with one hidden layer. “keras.layers.Conv2D” is used to define the Convolution layer wherein, the first parameter(16) defines the number of filters, second parameter defines the filter size(3,3) ,third parameter is the activation function, and the last parameter, input_shape defines the dimension of the input image. “keras.layer.Maxpooling2D” is used to define a Maxpooling layer with filter dimension.

Finally, we compile the model and perform model.fit by specifying the epochs. Sub-sequentially we can use model callback to stop its training if the validation metric crosses a certain desirable threshold which can be used to regularize the model and prevent over-fitting.

The figure shows the model.compile command

--

--

Bhavesh Singh Bisht
Analytics Vidhya

I am passionate about data science and have a profound history of working in an AI/ML firm