Demystifying Convolutional Neural Networks

Aegeus Zerium
Sep 2, 2018 · 10 min read

An Intuitive Explanation of Convolutional Neural Networks.


Simply put, a Convolutional Neural Network is a Deep learning model or a multilayered percepteron similar to Artificial Neural Networks which is most commonly applied to analyzing visual imagery. The founding father of Convolutional Neural Networks is the well known computer scientist working in Facebook Yann LeCun who was the first one to use them to solve the hand written digits problem using the famous MNIST Dataset.

Yann LeCunn

Convolutional Neural Networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex.

Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

Computer vision vs Human vision

As you can see we can’t possibly talk about any type of Neural Networks without mentioning a little bit of neuroscience and how the human body (especially the brain) and its functions have been the primary inspiration for the creation of various Deep learning models.

The architecture of ConvNets:

ConvNet architecture

As you can see in the illustration above a ConvNet architecture is very similar to the regular ANN architecture especially in the last layers of the network namely the Fully connected layers area, you will also notice that a ConvNet accepts a volume as an input instead of a vector.

let’s now explore the layers that constitute a ConvNet and the mathematical operations that the latter goes through to visualize and classify pictures based on the features and attributes it has learnt during the training process.

Input Layer:

The input layer is mostly an n × m × 3 RGB (short for Red, Green and Blue) image(s) unlike an Artificial Neural Networks which gets fed with a n × 1 vector, nothing hard to grasp here.

RGB image

Convolution Layer:

In the Convolution layer we compute the output of the dot product between an area of the input image(s) and a weight matrice called a filter, the filter will slide through out the whole image repeating the same dot product operation. Two things that should be mentioned:

  • The filter must have the same number of channels as the input image.
  • it’s commonly known that the deeper you go into the Network the more filters you use the intuition behind it is that the more filters we have the more edge and feature detection you’ll get
Forward Convolution Operation

We calculate the dimensions of the output of the Convolution layer:

Output Width:

Output Height:


  • W : the width of the input image
  • H : the height of the input image
  • Fw : the width of the filter or kernel
  • Fh : the height of the filter
  • P : padding
  • S : stride

The number of channels of the Convolution layer output equals to the number of filters used during the convolution operation.

Why Convolutions ?

You are probably asking yourself, why do we use convolutions in the first place ? why not flatten the input images from the beginning ? well if we do that we will end up with a massive number of parameters that need to be trained and most of us don’t have the computational power that will solve our computationally expensive task in the fastest way possible. In addition, with the fewer parameters that the ConvNets have we can avoid overfitting.

Pooling Layer:

There are two widely used types of pooling, average pooling and max pooling where the latter being the most used of the two. The pooling layer is used to reduce the spatial dimensions, but not depth, on a convolutional neural network. When using the max pooling layer we take the highest number (the most responsive area in the image) of the input’s area (an n × m matrice), whereas when we use the average pooling layer we take the mean of the input area instead.

Max Pooling

Why Pooling ?

One of the core goals of the pooling layer (max pooling in this case) is to provide spatial variance, which simply means that you or the machine will be capable of recognizing an object as an object even when its appearance varies in some way. for more in depth explanation of the pooling layer check this rigourous paper by Yann LeCunn.

Non-linearity Layer:

In the Non-linearity layer we use the ReLU activation function most if not all the time instead of the Sigmoid or Tan-H activation function. The ReLU activation function returns 0 for every negative value in the input image while it returns the same value for every positive value in the input image (for more in depth explanation of activation functions please check this article of mine).

ReLU activation function

Fully Connected Layer:

In the FC layer we flatten the output of the last Convolution layer and connect every node of the current layer with the other node of the next layer, Fully connected layer is just another word for the regular Artificial neural network as you will see in the image below. The operations in the fully connected layer are exactly the same as in any artificial neural network:

Flattening of the Convolution layer
Fully connected layer

The layers and operations discussed above are the core components of every Convolutional neural network.

Now that we’ve discussed the operations that a ConvNet goes through in a forward pass let’s jump to the operations that a ConvNet goes through in a backward pass.


Fully Connected Layer:

in the Fully connected layer backpropagation works exactly the same as in any regular artificial neural network, in backpropagation (using gradient descent as an optimization algorithm) we use partial derivatives namely the derivative of the loss function with regard to the weights, in order to calculate the latter we use a well known operation in calculus called The Chain rule where we multiply (in the backpropagation context) the derivative of the loss function w.r.t the activated output with the derivative of the activated output w.r.t the non-activated output with the derivative of the non-activated output w.r.t to the weights.

I was going to write in front of each partial derivative its mathematical notation but unfortunately Medium does not natively support writing mathematical expressions which is quite infuriating to be honest, anyway here is how we can summarize what we’ve just said in plaintext above with this mathematical expression below:

Backpropagation illustration

after calculating the gradient we substract it from the initial weights to get newly optimized ones:


  • θi+1 : optimized weights
  • θi : initial weights
  • α : learning rate
  • ∇J(θi) : gradient of the loss function
Gradient descent

In the animation below, gradient descent is applied to linear regression, you can clearly see that the more the cost function gets minimized the better the linear model fits the data.

Gradient descent applied to linear regression

note that you should be careful with choosing the value of the learning rate, a very high learning rate could cause the gradient to overshoot the target minimum.

Small learning rate vs Big learning rate

In all optimization tasks ,whether in physics, economics or Computer science, partial derivatives are overwhelmingly used, partial derivatives are primarily used to calculate the rate of change of a dependent variable f(x,y,z) with regard to one of its independent variables while the rest of the variables remain constant. for example imagine you own a share of a company, the stocks of the latter will go up or down based on multiple factors (security, politics, sales revenue etc …), to implement partial derivatives on your situation you would calculate how much the stock price of your company change if security (for example) get affected while others factors remain constant and repeat the same process with each and every other factor.

Pooling Layer:

In the Max Pooling layer the gradient gets backpropagated through the maximum values only since changing them slightly won’t affect the output. In the process we replace the maximum values before max pooling with 1 and set all the non maximum values to zero then use the Chain rule to multiply the gradient by them.

Backpropagation through the Pooling layer

Unlike the max pooling layer, in the average pooling layer the gradient passes through all the inputs (before average pooling) the maximum and the non maximum ones.

Convolution layer:

You are probably asking yourself right now, if the forward pass of a convolution layer is a convolution then what is its backward pass ? luckily, its backward pass is also a convolution (as you can clearly see below) so you don’t need to worry about learning new set of hard to grasp mathematical operations.

Backpropagation through the Convolution layer


  • ∂hij: the derivative of the loss function w.r.t the output of the convolution layer

This is in a nutshell how backpropagation works in a Convolution layer.

Now that you have a robust theoretical understanding of Convolutional Neural Networks let’s build our first ConvNet with TensorFlow.

Convolutional Neural Network with TensorFlow:

What is Tensorflow ?

TensorFlow is an open source software library for numerical computation using data-flow graphs. It was originally developed by the Google Brain Team within Google’s Machine Intelligence research organization for machine learning and deep neural networks research.

What is a Tensor ?

A tensor is an organized multidimensional array of numerical values. The order (also degree or rank) of a tensor is the dimensionality of the array needed to represent it.

Types of tensors

What is a Computational Graph ?

Computational graphs are a powerful formalism that have been extremely fruitful in deriving algorithms and software packages for neural networks and other models in machine learning. The basic idea in a computational graph is to express some model — for example a feedforward neural network — as a directed graph expressing a sequence of computational steps. Each step in the sequence corresponds to a vertex in the computational graph; each step corresponds to a simple operation that takes some inputs and produces some output as a function of its inputs.

In the illustrated graph below we have two inputs w1=x and w2=y, the inputs will flow through the graph where each node in the graph is a mathematical operation to give us the following outputs:

  • w3 = cos(x) where the operation is the Cosine trigonometric function
  • w4 = sin(x) where the operation is the Sine trigonometric function
  • w5 = w3∙w4 where the operation is multiplication
  • w6 = w1/w2 where the operation is division
  • w7 = w5+w6 where the operation is addition

Now that we understand what a computational graph is, let’s build our own in tensorflow, we will build the same one above.


Visualization with Tensorboard:

What is Tensorboard ?

TensorBoard is a suite of web applications for inspecting and understanding your TensorFlow runs and graphs, it’s one of the biggest edges that Google’s TensorFlow has over Facebook’s Pytorch.

The previous code visualized in Tensorboard

Now that you have a robust understanding of ConvNets, TensorFlow and TensorBoard, let’s build our first ConvNet that will recognize hand written digits using the MNIST dataset.

MNIST dataset

The architecture of our Convnet will be a set of convolution, max-pooling and non linearity operation layers similar to the LeNet-5 architecture.

Convolutional Neural Network 3D Simulation


The code is lengthy but shouldn’t be intimidating if you break it piece by piece.

In case you run the program your results should be like this:

We have just finished building our first convolutional neural network, as you can see in the results above, the accuracy has dramatically increased from the first step to the last step, but still there is more room for improvement of our ConvNet.

let’s now visualize our ConvNet in Tensorboard:

ConvNet Visualization
Accuracy and Loss evaluation


Convolutional Neural Networks are powerful deep learning models that are applied in a wide range of fields such as radiology, the use of ConvNets will only increase as the data gets bigger and the problems become more sophisticated and challenging.


You can find the Jupyter Notebook of this article in:


Aegeus Zerium

Written by

A finance and banking university student occationally writing about cryptocurrencies and machine learning.