CNN’s Building Blocks

Güldeniz Bektaş
7 min readJun 1, 2021

--

Photo by Greg Rakozy on Unsplash

In our era, we’re trying to train machines to make them do what we’re doing in daily basis. Not with the perfect rate but we are trying to develop machines to help us, makes us better, and faster at some important tasks. Computer vision is one of the most important task we’re working on. Seeing is a gift for human race. We can see, interpret, and analyze everything around us. When we look at a picture with dog chasing a ball eagerly, we can say just like I wrote. We can perceive the emotions in the picture, what is happening at that exact moment, or what can happen just after the shot. Computer vision task is trying to be developed for this purpose. Because sometimes, human vision may be inadequate or delayed. Sometimes instant responses may be required. Like, tumor diagnoses, intervention at the time of crime.

Computer vision is the science that enables computers to analyze and elaborate videos, and images like human brain does. Deep Neural Networks’ class Convolutional Neural Networks (CNN or ConvNet) are widely used for Computer Vision algorithms.

What is Convolutional Neural Network?

A CNN is a neural network: an algorithm used to recognize patterns in data. CNN is a specialized type of DNN (deep neural network) model designed for working with two or more dimensional image data. CNN takes its name from the ‘convolutional’ layer. This layer performs an operation called ‘convolution’.

Architecture of a traditional CNN

🥁 Now, let’s break it down to every layer of CNN with detailed.

1. Convolution Layer

Source

A convolutional layer involves the multiplication of a set of weights with the input, like traditional neural network but for CNNs, we have two dimensional input. The multiplication is performed between an array of input data and a two dimensional array of weights, called a filter or a kernel.

This filter or kernel is always smaller than input, and moves all over the image matrix. It multiply its values by the original pixel values, and all these multiplications are summed up to one number at the end. This filter moves to the right and down in n (can vary) steps. Result matrix is (should be) smaller than the input matrix.

Source

The output array that occurs after this operation between input and filter is called ‘feature (or activation) map’.

Source

It’s hyper parameters include the filter size F, and stride (step size) S.

Color images consist of three channels (red-green-blue, or RGB). Therefore, the convolution process is done for three channels. The number of channels of the feature map will also be equal to the number of channels of the filter.

After every convolution layer, there is a non-linear layer where non-linear activation functions such as ReLu are applied to each value in the feature map to add non-linearity to the data. Input image and filter are the matrix of weights updated by backpropagation. The bias (b) value is added to the output matrix resulting from the non-linear layer.

Source

By increasing the nonlinearity, a complex network is created to find new patterns in the images.

Understanding Hyperparameters

  1. Padding — after convolution process, we can control the size difference between the input and output matrix. Symmetrically adding zeros (zero padding) on the edges of the input matrix is the most used padding method because of its performance, simplicity, and computational efficiency (AlexNet).
  • Output size after padding — (W−F+2P)/S+1, for input volume size (W), filter size (F), the stride (S), and the amount of zero padding used on the border (P). For example, we have 7x7 input shape, 3x3 filter shape with stride 1 and padding 0. (7–3+2x0)/1+1 = 5. Output shape would be 5x5.
  • Setting zero padding to be P = (F - 1)/2 when the stride is 1 ensures that the input volume and output volume will have the same size.

2. Kernel Size — refers to the dimensions of the sliding filter over the input. Small size filters can extract a larger amount of information from the input matrix. It performs better as it will cause a smaller reduction in layer dimensions.

3. Stride — indicates how many pixels the kernel should be shifted over at a time. Let stride be 1, and filter will be shifted to the right by one pixel for every operation. Smaller the stride is more data is extracted but it leads to larger output.

🧩 Activation Functions

Activation functions are one of the crucial steps of deep learning designs. Activation function in the hidden layer will control how well the model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network.

📌 Different activation functions may be used in different parts of the model. Hidden layers usually use the same activation function when the output layer will usually use a different activation function, and dependent upon the type of prediction type of dataset.

Four activation functions that you probably will see the most are:

  1. Rectified Linear Activation (ReLU)
  2. Logistic (Sigmoid)
  3. Hyperbolic Tangent (Tanh)
  4. Softmax

2. Pooling Layer

Its functions is used to reduce the spatial size of the representation to reduce the amount of parameters, and computation in the network, and also control overfitting.

📌 This layer apply learned filters to input images in order to create feature maps the summarize the presence of those features in the input.

📌 The size of the pooling filter must be smaller than the size of the feature map.

There are two main types of pooling layers. They are max pooling, and average pooling.

Max Pooling — Calculate the maximum value for each patch of the feature map.

Geeks for Geeks

Average Pooling — Calculate the average value for each patch on the feature map.

Geeks for Geeks

3. Fully-Connected Layer

Neurons in a fully connected layer have full connections to all activations in the previous layer, and seen in regular neural networks. The input of the fully-connected layer is the output from the final layer, pooling or convolutional layer, which is flattened, and then fed into the fully connected layer.

But, wait! What is flattened? Let me explain.

This layer converts three-dimensional layer into a one-dimensional vector to fit the input of a fully-connected layer. For example, 6x6x3 tensor would be converted into a size 108 vector after flatten layer.

After passing through the fully connected layers, the final layer uses the softmax activation function (instead of ReLU) which is used to get probabilities of the input being in a particular class (classification).

Source

In theory it kinda looks easy, but code part can be hard. So, the key is: code, code, code. I think that was Daniel Bourke’s favorite line.

Well, it’s done for now 💫 You can read my other articles on Medium!

Sources

--

--