Convolutional Neural Networks (CNN, or ConvNets)

Convolutional Neural networks allow computers to see, in other words, Convnets are used to recognize images by transforming the original image through layers to a class scores. CNN was inspired by the visual cortex. Every time we see something, a series of layers of neurons gets activated, and each layer will detect a set of features such as lines, edges. The high level of layers will detect more complex features in order to recognize what we saw.

This article will present my brief notes about the elements that constitute Convolutional Neural Networks.

ConvNet has two parts: feature learning (Conv, Relu,and Pool) and classification(FC and softmax).

reference: CS231n notes

Input (the training data):

  • The input layer or input volume is an image that has the following dimensions: [width x height x depth].It is a matrix of pixel values.
  • Example: Input: [32x32x3]=>(width=32, height=32, depth=3)The depth here, represents R,G,B channels.
  • The input layer should be divisible many times by 2 . Common numbers include 32, 64, 96, 224, 384, and 512.

CONV layer:

The objective of a Conv layer is to extract features of the input volume.

A demo of a Conv layer with K = 2 filters, each with a spatial extent F = 3 , moving at a stride S = 2, and input padding P = 1. (Reference : CS231n notes.)

A part of the image is connected to the next Conv layer because if all the pixels of the input is connected to the Conv layer, It will be too computationally expensive. So we are going to apply dot products between a receptive field and a filter on all the dimensions. The outcome of this operation is a single integer of the output volume (feature map). Then we slide the filter over the next receptive field of the same input image by a Stride and compute again the dot products between the new receptive field and the same filter. We repeat this process until we go through the entire input image. The output is going to be the input for the next layer.

Filter, Kernel, or Feature Detector is a small matrix used for features detection. A typical filter on the first layer of a ConvNet might have a size [5x5x3].

Convolved Feature, Activation Map or Feature Map is the output volume formed by sliding the filter over the image and computing the dot product.

Receptive field is a local region of the input volume that has the same size as the filter.

Depth is the number of filters.

Depth column (or fibre) is the set of neurons that are all pointing to the same receptive field.

Stride has the objective of producing smaller output volumes spatially. For example, if a stride=2, the filter will shift by the amount of 2 pixels as it convolves around the input volume. Normally, we set the stride in a way that the output volume is an integer and not a fraction. Common stride: 1 or 2 (Smaller strides work better in practice), uncommon stride: 3 or more.

Zero-padding adds zeros around the outside of the input volume so that the convolutions end up with the same number of outputs as inputs. If we don’t use padding the information at the borders will be lost after each Conv layer, which will reduce the size of the volumes as well as the performance.

How to compute the output volume[W2xH2xD2]?


  • W2=(W1−F+2P)/S+1
  • H2=(H1−F+2P)/S+1
  • D2=K


  • [W1xH1xD1] : input volume size
  • F: receptive field size
  • S: stride
  • P: amount of zero padding used on the border.
  • K: depth


What is the output volume of the first Convolutional Layer of Krizhevsky et al. architecture that won the ImageNet challenge in 2012?

Input size: [227x227x3], W=227, F=11, S=4, P=0, and K=96.


  • (227 - 11) / 4 + 1 = 55
  • The size of the Conv layer output volume is [55x55x96].

Parameter Sharing (shared weights): We think that if a feature is useful it will also be useful to look for it everywhere in the image. However, sometimes, it is weird to share the same weights in some cases. For example, in a training data that contains faces centered, we don’t have to look for eyes in the bottom or the top of the picture.

Dilation is a new hyperparameter introduced to the Conv layer. dilation is filters with spaces between its cells. for example, we have one dimension filter W of size 3 and an input X:

  • Dilation of 0: w[0]*x[0] + w[1]*x[1] + w[2]*x[2].
  • Dilation of 1: w[0]*x[0] + w[1]*x[2] + w[2]*x[4].

ReLU layer :

ReLU Layer applies an elementwise activation function max(0,x), which turns negative values to zeros (thresholding at zero). This layer does not change the size of the volume and there are no hyperparameters.

POOL layer:

Pool Layer performs a function to reduce the spatial dimensions of the input, and the computational complexity of our model. And it also controls overfitting. It operates independently on every depth slice of the input. There are different functions such as Max pooling, average pooling, or L2-norm pooling. However, Max pooling is the most used type of pooling which only takes the most important part (the value of the brightest pixel) of the input volume.

Example of a Max pooling with 2x2 filter and stride = 2. So, for each of the windows, max pooling takes the max value of the 4 pixels.

Max pooling with 2x2 filter and stride = 2. Ref: Wikipedia

- Pool layer doesn’t have parameters (the weights and biases of the neurons), and no zero padding, but it has two hyperparameters: Filter (F) and Stride (S). More generally, having the input W1×H1×D1, the pooling layer produces a volume of size W2×H2×D2 where:

  • W2=(W1−F)/S+1
  • H2=(H1−F)/S+1
  • D2=D1

A common form of a Max pooling is filters of size 2x2 applied with a stride of 2. The Pooling sizes with larger filters are too destructive and they usually lead to worse performance.

Many people don’t like using a pooling layer because it throws away information and they replace it by a Conv layer with increased stride once in a while.

Fully_Connected Layer (FC):

Fully connected layers connect every neuron in one layer to every neuron in another layer. The last fully-connected layer uses a softmax activation function for classifying the generated features of the input image into various classes based on the training dataset.

Example of a ConvNet architecture:

CIFAR-10 classification [INPUT — CONV — RELU — POOL — FC]