Image for post
Image for post

Introduction to how CNNs Work

Simran Bansari
Feb 12, 2019 · 5 min read
Image for post
Image for post

One of the main parts of Neural Networks is Convolutional neural networks (CNN). CNNs use image recognition and classification in order to detect objects, recognize faces, etc. They are made up of neurons with learnable weights and biases. Each specific neuron receives numerous inputs and then takes a weighted sum over them, where it passes it through an activation function and responds back with an output.

CNNs are primarily used to classify images, cluster them by similarities, and then perform object recognition. Many algorithms using CNNs can identify faces, street signs, animals, etc.

Image for post
Image for post

How do CNNs work ?

They are prompt by volume and utilize multi-channeled images. As opposed to flat images that humans can see that only have width and height, CNNs cannot recognize that. Due to digital color images having red-blue-green (RGB) encoding, CNNs mix those three colors to produce the color spectrum humans perceive.

A convolutional network ingests such images as three separate strata of color stacked one on top of the other. A normal color image is seen as a rectangular box whose width and height are measured by the number of pixels from those dimensions. The depth layers in the three layers of colours(RGB) interpreted by CNNs are referred to as channels.

Image for post
Image for post

The first layer in a CNN network is the CONVOLUTIONAL LAYER, which is the core building block and does most of the computational heavy lifting. Data or imaged is convolved using filters or kernels. Filters are small units that we apply across the data through a sliding window. The depth of the image is the same as the input, for a color image that RGB value of depth is 4, a filter of depth 4 would also be applied to it. This process involves taking the element-wise product of filters in the image and then summing those specific values for every sliding action. The output of a convolution that has a 3d filter with color would be a 2d matrix.

Now, the best way to explain a convolutional layer is to imagine a flashlight that is shining over the top left of the image. In order to understand how this works, imagine as if a flashlight shines its light and covers a 5 x 5 area. And now, let’s imagine this flashlight sliding across all the areas of the input image. This flashlight is called a filter(or sometimes referred to as a neuron or a kernel) and the region that it is shining over is called the receptive field. This filter is also an array of numbers (the numbers are called weights or parameters).

Image for post
Image for post

Second is the ACTIVATION LAYER which applies the ReLu (Rectified Linear Unit), in this step we apply the rectifier function to increase non-linearity in the CNN. Images are made of different objects that are not linear to each other.

Third, is the POOLING LAYER, which involves downsampling of features. It is applied through every layer in the 3d volume. Typically there are hyperparameters within this layer:

  1. The dimension of spatial extent: which is the value of n which we can take N cross and feature representation and map to a single value
  2. Stride: which is how many features the sliding window skips along the width and height
Image for post
Image for post
Image for post
Image for post

A common POOLING LAYER uses a 2 cross 2 max filter with a stride of 2, this is a non-overlapping filter. A max filter would return the max value in the features within the region. Example of max pooling would be when there is 26 across 26 across 32 volume, now by using a max pool layer that has 2 cross 2 filters and astride of 2, the volume would then be reduced to 13 crosses, 13 crosses 32 feature map.

Image for post
Image for post

Lastly, is the FULLY CONNECTED LAYER, which involves Flattening. This involves transforming the entire pooled feature map matrix into a single column which is then fed to the neural network for processing. With the fully connected layers, we combined these features together to create a model. Finally, we have an activation function such as softmax or sigmoid to classify the output.

Key Takeaways

  • Each input image will pass it through a series of convolution layers with filters
  • In order to perceive the same as humans, CNNs have digital colour images that have red-blue-green (RGB) encoding
  • There is a Convolutional Layer, Activation Layer, Pooling Layer, and Fully Connected Layer, these are all interconnected so that CNNs can process and perceive data in order to classify images

Thank you for reading my article, if you enjoyed the read, please clap and comment any feedback you have below. If you want to reach out to me, you can connect with me through LinkedIn.

Data Driven Investor

empower you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Simran Bansari

Written by

Data Driven Investor

empower you with data, knowledge, and expertise

Simran Bansari

Written by

Data Driven Investor

empower you with data, knowledge, and expertise

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store