An Introduction to Convolutional Neural Network (CNN)

Published in

SFU Professional Computer Science

10 min readFeb 11, 2022

Authors: Xiang Fan, Phuong Truong

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

Background

Convolutional neural network (CNN) is a type of artificial neural network, mainly used in the processing of data with grid-like topology, such as image recognition and classification. Compared to other algorithms for image classification, the main advantage of convolutional neural networks is that it is a form of unsupervised learning and it doesn’t require labeled samples to learn features from data. Due to this advantage, the convolutional neural network has become one of the most popular deep learning networks. In recent years, the convolutional neural network has also been proved to be valuable tools in other fields of data science including big data analysis.

Most readers who are familiar with machine learning may already have some first hand experience of working on the convolutional neural network and already have an basic understanding of its concept. However, for those who are looking for further understanding of the topic, the articles on the Internet are either fundamental overviews that provide little in depth knowledge or high-level essays that are difficult to understand.

In this post, we will discuss the architectures of the convolutional neural network and provide insight on not just which kind of layers exist in the convolutional neural network but also their functionality. We will also provide well known examples of convolutional neural networks. We hope this will help the readers understand not only the architecture of convolutional neural networks, but also understand how each component of convolutional neural network works and how it has been applied.

Architecture

Like fully connected neural networks, a convolutional neural network has an input layer, an output layer, and one or more hidden layers.

The input layer is the first layer of a neural network and contains the input data, the input layer gets the data into the neural network so the data can be passed to the following layers for processing. The output layer is the last layer of the neural network, it passes the output out of the neural network. The hidden layers are one or more layers that exist between the input layer and the output layer. Each layers also has one or more neurons. Neurons are the basic unit of the neural network. Each neuron has an input and produces output data.

In a fully connected neural network, all its hidden layers are fully connected layers as well as the necessary active function layers. Due to the limit of this design, the fully connected neural networks are known to have two major disadvantages when used to process large data with grid-like topology:

The coordinate information is lost when the data matrix is reshaped to one dimension array.
The large number of variables make it difficult to train, and also make it prone to overfitting.

However, the architecture of convolutional neural networks can easily solve these two problems.

Unlike the more well known Fully connected neural network, the convolutional neural network’s hidden layers not only can be fully connected layers and the active function, but also can be the convolutional layers, pooling layers. Generally the hidden layers of a convolutional neural are a combination of convolutional layers, pooling layers, fully connected layers, as well as the active function.

1. Convolution Layers

Convolution layers are the core building blocks of the convolutional neural network and where most computation in the convolutional neural network happened.

The neuron on the convolution layers receives input from the neuron on the previous layer and produces a single output. However, unlike fully connected layers, each neuron in the convolutional neural network does not connect to every neuron on the previous layer, but only connects to those neurons in a receptive field.

The convolution layers allow the neural network to study the features in each small local area rather than the entire data set. This is why convolutional neural networks are so powerful in image classification: they can learn local features more efficiently than traditional neural networks. Furthermore, this also reduces the number of variables during the training, making the convolutional neural network easier to train and less prone to overfitting.

To Further understand the convolution layer, we will discuss some important terminology related to the convolution layer.

Receptive field

In the neural network, the receptive field of each neuron is the area on the previous layer which they take input from.

In a fully connected layer, the receptive field of each neuron is the entire previous layer. In a convolution layer, the receptive field usually a square (ex. n*n rectangle of pixels on an image matrix), and the size of the receptive field is decided by the size of the kernel.

Sometimes, the input datasets are 3-dimensional rather than 2-dimensional. In this case, by adding an additional depth dimension to the kernel, receptive fields can become 3 dimensional and learn the 3D features just like in a 2D dataset.

During the training, each neuron on a convolution layer will study the features on its receptive field.

Kernel

Kernel is used for feature extraction in a convolution layer. Kernel is usually a square matrix.

In each convolution layer, the kernel of this layer scans the input data from the previous layer, moves from left to right and then up to down. Each time the kernel extracts data from an area (which is the receptive field of a neuron) on the previous layer that has a similar size as the kernel, and produces a single output based on the inputs and the weights of the kernel. The result dataset after a convolution layer is called convolution features.

The weights on each cell of the kernel represent the multiplication factor for the corresponding data point on the receptive field, and are used to calculate the value of corresponding data point on the matrix of convolution features.

Padding

Due to the kernel can only scan an area on the data set and cannot move outside the data set, the data point of the border of the dataset cannot be learnt properly like those in the center of the data set. And some time we want to make sure the dataset remains the same size after processed by the convolution layer.

To solve these problems, we can use padding, which creates additional data points outside the border of the dataset to increase its size. These data points typically have a value of 0.

Stride

When the kernel of a convolution layer scans the previous layer, the stride decides how far each time the kernel moves on the previous layer. If the value of stride is n, then when the kernel moves it will move n pixel from its current position. The default value of stride is 1.

2. Pooling Layers

Pooling layers reduce the size of datasets by reducing multiple input data into a single output data.

A pooling layer is usually added after several convolution continuous layers. The purpose of adding pooling layers is to reduce the size of the data as well as the number of the variables, thus reducing the number of calculations needed to perform.

There are several type of pooling algorithms:

Max Pooling: the input with maximum value will be the output.
Average pooling: the average value of all input will be the output.
L2-norm pooling: the value of L2 norm of all input will be the output.

3. Fully Connected Layers

As mentioned above, the fully connected layers are layers where every neuron connects to every neuron of the previous layer. In a convolutional neural network, the main purpose of a fully connected layer is learning the features of the entire input rather than features of a local cluster. For example, in classification problems, the fully connected layers can be used to train to determine which type of animal is in the input image.

4. Activation Function

The activation functions in a convolutional neural network are usually nonlinear functions which transfer the input of a neuron and pass to the output of the neuron. The activation functions are used to introduce nonlinearity to a neural network.

Examples of Convolutional Neural Networks

In this section, we will introduce some examples of convolutional neural networks, which include LeNet, AlexNet, VGGNet, and GoogLeNet.

LeNet

LeNet (or LeNet-5) was one of the first and simplest convolutional neural networks. It was introduced in 1998 by Yann LeCun et. Al. in his paper “Gradient-Based Learning Applied to Document Recognition”. It was primarily used to recognize simple handwritten characters and digit images based on the MNIST dataset. The figure below shows the architecture of LeNet-5:

LeNet-5 has the basic components of a convolutional neural network: 3 convolutional layers, 2 sub-sampling layers and 2 fully connected layers. Tanh activation function is used in every layer, and softmax activation function is used in the last layer.

LeNet-5 is basic, straightforward, and easy to understand. It is the basis for all future models; however, it is not very efficient in image recognition compared to more advanced models such as AlexNet or GoogLeNet. It still is the best model for beginners to learn about convolutional neural networks.

AlexNet

AlexNet is a convolutional neural network architecture designed by Alex Krizhevsky and his colleagues in 2012, and is considered one of the most influential papers in computer vision. The model was trained on the ImageNet dataset, which has more than 14 million images across 20,000 categories. The figure below shows the architecture of AlexNet:

AlexNet contains 8 layers: 5 convolutional layers and 3 fully-connected layers. It used ReLU activation function in every layer except the output layer. To avoid overfitting, AlexNet used data augmentation and dropout layers. What made AlexNet stand out was the use of ReLU instead of tanh or sigmoid functions, which improved the training time by 6 times. Moreover, it was the first major model to use multiple GPUs for training, which also significantly speeded up the process. The result of AlexNet was impressive: the model achieved top-1 and top-5 test set error rates of 37.5% and 17.0%, which surpassed the best performance models at that time.

AlexNet is an effective model for high-resolution image classification, which can be useful in many fields such as medical computer vision and speech recognition.

VGGNet

VGGNet is a convolutional neural network developed by Karen Simonyan and Andrew Zisserman from the University of Oxford in 2014. VGG is very efficient in understanding and extracting features from images and is widely used in deep learning. It was trained on the ImageNet dataset. It consists of five configurations that contain from 11 to 19 weight layers. It has multiple convolutional layers with smaller filters of size 3x3 instead of having a large filter size. It used ReLU activation function in all hidden layers, and softmax in the final layer. The training process used in VGG is similar to AlexNet. The table below describes the network’s architectures:

VGGNet performed much better than other models in previous competitions. It achieved an error rate of 7.0% in terms of the single-net performance. The increase in the depth of the model led to a significant improvement in the model accuracy. However, the model takes much longer time to train compared to other models because of its depth and the very large number of parameters (133–144 millions).

GoogLeNet

GoogLeNet (or Inception V1) is a type of convolutional neural network developed by researchers at Google and other universities in 2014. It is based on the Inception architecture and consists of 22 layers, including 27 pooling layers. The table below shows the details of GoogLeNet architecture:

GoogLeNet won first place at the ILSVRC 2014 challenge with an impressive top-5 error of 6.67% in classification performance. Even though the model is complicated to implement, it has significantly less parameter number, error rate, and training time compared to AlexNet or VGG. GoogLeNet is a powerful model for image classification and detection and can be widely used in computer vision tasks.

Summary

In this blog post, we briefly introduce the architecture of convolutional neural networks as well as several most well known examples. As one of the most important tools in the field of deep learning, convolutional neural networks have many potential applications in big data analysis. If you are interested in learning more, please check the articles in the ‘External Link’ section of this blog post.