Architecture and Training Of Convolutional Neural Networks (7 points):

Ayantika Sarkar
Analytics Vidhya
Published in
6 min readAug 10, 2020

This post provides the details of the architecture of Convolutional Neural Network (CNN), functions and training of each layer, ending with a summary of the training of CNN.

  1. The basic CNN architecture consists of: Input->(Conv+ReLU)->Pool->(Conv+ReLU)->Pool-> Flatten->Fully Connected->Softmax->Output
  2. The feature extraction is carried out in the Convolutional layer+ReLU and Pooling layers and the classification is carried out in Fully Connected and Softmax layers.

3. First Convolutional Layer:

  • The primary purpose of this layer is to extract features from the input image.
  • The convolution is used to extract features because it preserves the spatial relationship between pixels by learning image features by using small squares of input data.
  • The convolutional layer have the following attributes:-

i) Convolutional neurons/kernels/filters defined by a width and height(hyper-parameters).

ii) The number of input channels and the output channels(hyper-parameters).

iii) The depth/number of channels of the Convolutional filter/kernel must be equal to the depth/number of channels of the input.

Photo by Haneen Krimly on Unsplash
  • Now, the filter/kernel/neuron (a matrix of weights/parameters) slides over the input image (a matrix of image pixels with a depth 3,usually, for the 3 colors- red,green,blue) starting from the left upper hand side of the input image each time covering the number of pixels as the number of weights in the filter/kernel/neuron.
  • The outcome of each convolution is stored in the matrix,known as a feature map or convolved feature or activation map, whose depth is equal to the number of filters/kernels/neurons used.
  • The dimensions of the feature map/convolved feature/activation map can be determined as:
  • Input image * Filter = Feature Map/Activation Map

[n x n x nc ] * [f x f x nc ] = [n-f+1 x n-f+1 x m], where,

n is the dimension of the matrix of image pixels, f is the dimension of the matrix of weights , nc is the depth of the image and m is the number of filters used.

  • The greater the number of filters, the better is the feature extraction.
  • The feature map is then made non-linear by using ReLU.

4. ReLU:

  • ReLU (Rectified Linear Unit) is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero.
  • The purpose of ReLU is to introduce non-linearity in the ConvNet/Convolution Neural Network, since most of the real-world data that ConvNet/Convolution Neural Network learns would be non-linear and the convolution carried out in the First Convolutional Layer is a linear operation, so non-linearity is implemented by a nonlinear function like ReLU.
  • The output of the ReLU is ƒ(x) = max(0,x).
  • There are other nonlinear functions such as tanh or sigmoid(used in Single Layer Perceptron) that can also be used instead of ReLU. Most of the data scientists use ReLU since performance wise ReLU is better than the other two.

5. Pooling Layer:

Photo by Jesper Stechmann on Unsplash
  • The pooling layer reduces the dimensions of the data by combining the outputs of neuron/filter/kernel clusters at one layer into a single neuron/filter/kernel in the next layer.
  • Convolutional networks/ConvNet may include local and global pooling layers.
  • The hyperparameters for this layer are: 1) filter size (f), 2) stride size(s).
  • Spatial pooling (also called subsampling or downsampling) reduces the dimensionality of each map but retains important information. It can be of different types:

i) Max pooling- This pooling technique works better compared to other techniques and hence used more. Here, depending on the hyperparameters, clusters are formed in the feature map and the maximum of each cluster is taken and a resultant matrix is obtained by taking the maximum values. The number of channels/depth of the resultant matrix is the same as that of the feature map. There is no padding here.

ii) Average pooling- Here, depending on the hyperparameters, clusters are formed in the feature map and the average of each cluster is taken and a resultant matrix is obtained by taking the average values. The number of channels/depth of the resultant matrix is the same as that of the feature map.

iii) Sum pooling- Here, depending on the hyperparameters, clusters are formed in the feature map and the sum of each cluster is taken and a resultant matrix is obtained by taking the average values. The number of channels/depth of the resultant matrix is the same as that of the feature map.

  • Functions of pooling:

a) Makes the input representations (feature dimension) smaller and more manageable.

b) Reduces the number of parameters and computations in the network, therefore, controlling overfitting.

c) Makes the network invariant to small transformations, distortions and translations in the input image (a small distortion in input will not change the output of pooling — since we take the maximum / average value in a local neighborhood).

d) Helps us arrive at an almost scale invariant representation of our image (the exact term is “equivariant”).

6. Fully-Connected Layer:

Photo by michael podger on Unsplash
  • This layer takes an input volume of its preceding layer and outputs an N-dimensional vector, where N is the number of classes that the program has to choose from. Each number in the N dimensional vector represents the probability of a certain class.

7. Softmax:

  • Softmax (also known as softargmax/normalized exponential function/ multi-class logistic regression) is a function that turns a vector of K real values that sum to 1.
  • The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities.
  • If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1.
  • The softmax is very useful because it converts the outcomes of the previous layer to a normalized probability distribution, which can be displayed to a user or used as input to other systems. For this reason it is usual to append a softmax function as the final layer of the convolutional neural network.

The overall training process of the Convolution Neural Network may be summarized as below:

Photo by Prateek Katyal on Unsplash
  • Step 1: We initialize all filters and parameters / weights with random values
  • Step 2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
  • Step 3: Calculate the total error at the output layer:
  • Total Error = ∑ ½ (target probability — output probability) ²
  • Step 4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.

For queries, feel free to write in the comment💬 section below. You can connect with me on LinkedIn !!

Thank you for reading!Have a great day ahead😊

--

--