Convolutional Neural Networks(CNN) Tutorial

Onur Akköse

Published in

Analytics Vidhya

4 min readNov 8, 2020

I prepared this tutorial for beginners.

This tutorial is a guide for understanding what is CNN.

Computer Vision Problems:

Image Classification
Object Detection or Recognition
Neural Style Transfer

Types of layer in a CNN:

Convolution
Pooling
Fully Connected

Convolutional Neural Networks

1. Convolution Operation

The convolution operation is one of the fundamental building a CNN.
‘*’ is the notation of convolution
We have a input matrix(the input picture) and a filter(feature detector).
Filter usally is a 3x3 matrix but it is not a rule.
Filter detects horizantal or vertical lines and convex shape on the picture. For example in a person picture, we can find ears or noise etc.

Edge Detection

When given a picture like that to figure out what is the object in this picture. The first thing we may do is vertical or horizantal edge detection.

Above, we have 6x6 input matrix and 3x3 filter. End of convolution operation we’ll have 4x4 matrix. The reason of that :

nxn(input matrix) * fxf(filter) => n — f(6–3) + 1(stride) = 4x4

If we increase the stride number, the result will change. For example:

for nxn(6) * fxf(3) , stride(s) is 2 => (n — f + 1) / 2 = 3x3

When we use edge detection, we loss information but model works faster.

Padding

After edge detection we need to use padding. In edge detection step, we saw that If we use 6x6 input and 3x3 filter then we end up with a 4x4 matrix. Everytime we apply convolution operation then out image shrinks. If a convolution operation as above is applied, we can repeat this operation two or three time because our image getting starts really small. So we are throwing away information near the edge of image. To solve this problem, we can pad the image. If we pad additional one border 6x6 image, we get 8x8 image instead of 6x6 image. After padding, we apply 3x3 filter again we end up with 6x6 matrix. So, we preserve the original input size.

2. Pooling Operation

We apply pooling to reduce the size of network, to speed the computation. We can apply average pooling or max pooling. Let’s suppose we have 4x4 input matrix, If we apply max pooling then the output will be 2x2 matrix. The way you do that is really simple. It has two hyperparameters, filter size(f) and stride(s).

Flattening

Flattening is converting the output of convolutional layers into a 1 dimensional array for inputing it to next layer. It is connected to fully connected layer.

3. Fully Connected Layer

FC is the last layers in a network. The input of FC is output of final convolution or pooling layers which is flattened.

Classic Networks

Lenet-5 was created to recognize handwritten characters in 1990's.

2. AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge in 2012.

3. VGG16 was trained on Imagenet in 2016. It has 16 layers.

Data Augmentation

Data Augmetntation is a technique to improve the performance computer vision systems. It expends size of a dataset by creating modified versions of images in the dataset. The ways to modify images; mirroring, random cropping, rotation, shearing, color shifting etc.