Convolutional Neural Networks — Simplified

An un-convoluted explanation to a powerful algorithm widely used in image recognition tasks

Take a moment to observe and look at your surroundings. Even if you are sitting still on your chair or lying on your bed, your brain is constantly trying to analyze the dynamic world around you. Without your conscious effort your brain is continuously making predictions and acting upon them.

Photo By Adrienn, Source:

After just a brief look at this photo you identified that there are humans and objects in the scene. You immediately identified some of the objects in the scene as wine glasses, plate, table, lights etc. You probably also guessed that the ladies in the photograph are enjoying their meal. How were you able to make those predictions? How did you identify the numerous objects in the picture?

It took nature millions of years of evolution to achieve this remarkable feat. Our eye and our brain work in perfect harmony to create such beautiful visual experiences. The system which makes this possible for us is the eye, our visual pathway and the visual cortex inside our brain.

Primary Visual Pathway, Source:

Our eyes capture the lights and colors on the retina. The receptors on the retina pass these signals to the optic nerve which passes them to the brain to make sense of this information. The eye and the visual cortex is a very complex and hierarchical structure. The whole visual pathway plays an important role in the process of understanding and making sense of what we see around us. It is this system inside us which allows us to make sense of the picture above, the text in this article and all other visual recognition tasks we perform everyday.

We’ve been doing this since our childhood. We were taught to recognize an umbrella, a dog, a cat or a human being. Can we teach computers to do so? Can we make a machine which can see and understand as well as humans do?

Computers “see” the world in a different way than we do. They can only “see” anything in form of numbers

Each number on the right is representing a pixel

To teach computers to make sense out of this bewildering array of numbers is a challenging task. Computer scientists have spent decades to build systems, algorithms and models which can understand images. Today in the era of Artificial Intelligence and Machine Learning we have been able to achieve remarkable success in identifying objects in images, identifying the context of an image, detect emotions etc. One of the most popular algorithm used in computer vision today is Convolutional Neural Network or CNN.

Inspiration behind CNN

CNN is a type of neural network which loosely draws inspiration from the workings and hierarchical structure of the primary visual pathway of the brain.


In the 1950s and 1960s David Hubel and Torsten Wiesel conducted experiments on the brain of mammals and suggested a model for how mammals perceive the world visually. They recorded activity from neurons in the visual cortex of a cat, as they moved a bright line across its retina. During their recordings, they noticed a few interesting things

  1. The neurons fired only when the line was in a particular place on the retina
  2. The activity of these neurons changed depending on the orientation of the line
  3. Sometimes the neurons fired only when the line was moving in a particular direction

Turn up your volume and watch the video of the experiment here —

In their paper, they described two basic types of visual neuron cells in the brain that each act in a different way: simple cells (S cells) and complex cells (C cells) which are arranged in a hierarchical structure.

The simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle. The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field. The complex cells continue to respond to a certain stimulus, even though its absolute position on the retina changes. I’ve used some jargon here, let us try to understand what a receptive field is.

Receptive field for sensory neurons

In the image above 3 primary neurons have their own receptive field which means that the blue neuron will be activated only if there is a stimulus in the blue region, the yellow primary neuron will be activated if there is a stimulus in the yellow region and so on. If there is a stimulus in the overlap region, all the neurons associated with that region will get activated.

Apart from simple and complex cells the hierarchical structure of the brain plays an important role in storing and making sense of information. In 1980 Kunihiko Fukushima proposed a hierarchical neural network called Neocognitron which was inspired by the simple and complex cell model. The neocognitron was able to recognize patterns by learning about the shapes of objects.

Later, in 1998, Bengio, LeCun, Bottou and Haffner introduced Convolutional Neural Networks. Their first Convolutional Neural Network was called LeNet-5 and was able to classify digits from hand-written numbers.

That was about the history of CNN. You can read more about the history and evolution of CNN all over the internet.


Convolutional Neural Networks

If you have a basic idea about multi-layer perceptron and neural networks you already understand a small part of the whole structure of a CNN.

CNN is composed of two major parts:

  1. Feature Extraction
  2. Classification
Basic Components in CNN. Source: Researchgate

Don’t worry about the perplexing squares and lines inside the red dotted region we will break it down later. The green circles inside the blue dotted region named classification is the neural network or multi-layer perceptron which acts as a classifier. The inputs to this network come from the preceding part named feature extraction.

1) Feature Extraction

This is the part of CNN architecture from where this network derives its name. Convolution is the mathematical operation which is central to the efficacy of this algorithm. Lets understand on a high level what happens inside the red enclosed region. The input to the red region is the image which we want to classify and the output is a set of features. Think of features as attributes of the image, for instance, an image of a cat might have features like whiskers, two ears, four legs etc. A handwritten digit image might have features as horizontal and vertical lines or loops and curves. Lets see how do we extract such features from the image.

Feature Extraction: Convolution

Convolution in CNN is performed on an input image using a filter or a kernel. To understand filtering and convolution make a small peephole with the help of your index finger and thumb by rolling them together as you would do to make a fist. Now through this peep hole look at your screen, you can look at a very small part of the screen through the peep hole. You will have to scan the screen starting from top left to right and moving down a bit after covering the width of the screen and repeating the same process until you are done scanning the whole screen.

Convolution of an image with a kernel works in a similar way. The kernel or the filter, which is a small matrix of values, acts as the peephole which performs a mathematical operation on the image while scanning the image in a similar way. For instance if the input image and the filter look like —

The filter (green) slides over the input image (blue) one pixel at a time starting from the top left. The filter multiplies its own values with the overlapping values of the image while sliding over it and adds all of them up to output a single value for each overlap.

Source: Applied deep learning

In the above animation the value 4 (top left) in the output matrix (red) corresponds to the filter overlap on the top left of the image which is computed as —

Similarly we compute the other values of the output matrix. Note that the top left value, which is 4, in the output matrix depends only on the 9 values (3x3) on the top left of the original image matrix. It does not change even if the rest of the values in the image change. This is the receptive field of this output value or neuron in our CNN. Each value in our output matrix is sensitive to only a particular region in our original image. Scroll up to see the overlapping neurons receptive field diagram, do you notice the similarity?
Each adjacent value (neuron) in the output matrix has overlapping receptive fields like our red, blue & yellow neurons in the picture earlier. The animation below will give you a better sense of what happens in convolution.

OK so that is the basic idea of the convolution operation. What does performing this operation on the image achieve? What is the output if this?

Lets say we have a handwritten digit image like the one below. We want to extract out only the horizontal edges or lines from the image. We will use a filter or kernel which when convolved with the original image dims out all those areas which do not have horizontal edges.

Notice how the output image only has the horizontal white line and rest of the image is dimmed. The kernel here is like a peephole which is a horizontal slit. Similarly for a vertical edge extractor the filter is like a vertical slit peephole and the output would look like —

Feature Extraction: Non-Linearity

If you go back and read about a basic neural network you will notice that each successive layer of a neural network is a linear combination of its inputs. The introduction of non-linearity or an activation function allows us to classify our data even if it is not linearly separable.

Left: Linearly separable vs. Right: Not linearly separable

Which leads us to another important operation — non-linearity or activation. After sliding our filter over the original image the output which we get is passed through another mathematical function which is called an activation function. The activation function usually used in most cases in CNN feature extraction is ReLU which stands for Rectified Linear Unit. Which simply converts all of the negative values to 0 and keeps the positive values the same.

After passing the outputs through ReLU functions they look like below —

So for a single image by convolving it with multiple filters we can get multiple output images. For the handwritten digit here we applied a horizontal edge extractor and a vertical edge extractor and got two output images. We can apply several other filters to generate more such outputs images which are also referred as feature maps.

Feature Extraction: Pooling

After a convolution layer once you get the feature maps, it is common to add a pooling or a sub-sampling layer in CNN layers. Pooling reduces the dimensionality to reduce the number of parameters and computation in the network. This shortens the training time and controls over-fitting.

The most frequent type of pooling is max pooling, which takes the maximum value in a specified window. The windows are similar to our earlier kernel sliding operation. This decreases the feature map size while at the same time keeping the significant information.

Max pooling Source: CS231

2) Classification

Alright, so now we have all the pieces required to build a CNN. Convolution, ReLU and Pooling. The output of max pooling is fed into the classifier we discussed initially which is usually a multi-layer perceptron a.k.a fully connected layer. Usually in CNNs these layers are used more than once i.e. Convolution -> ReLU -> Max-Pool -> Convolution -> ReLU -> Max-Pool and so on. We won’t discuss the fully connected layer in this article. You can read this article for a basic intuitive understanding of the fully connected layer.

Final Remarks

CNN is a very powerful algorithm which is widely used for image classification and object detection. The hierarchical structure and powerful feature extraction capabilities from an image makes CNN a very robust algorithm for various image and object recognition tasks. There are numerous different architectures of Convolutional Neural Networks like LeNet, AlexNet, ZFNet, GoogleNet, VGGNet, ResNet etc. But the basic idea behind these architectures remains the same.

LeNet architecture

In this article I have not dealt with the training of these networks and the kernels. Training these networks is similar to training multi-layer perceptron using back propagation but the mathematics a bit more involved because of the convolution operations.

I’ve touched upon the very basics of the CNN architecture and its building blocks and its inspirations. Hopefully it has slightly demystified and eased your understanding of the CNN architectures, like the one above. This article is intended to elicit curiosity to explore and learn further, not because your boss has asked you to learn about CNN, because learning is fun!


Recommendations & References

X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. We publish an article on such simplified AI concepts every Friday. If you liked this or have some feedback or follow-up questions please comment below.

Thanks for Reading!