Convolutional Neural Networks for Image Recognition

Nghi Huynh
6 min readMar 11, 2022

--

Lesson 3 notes from DeepMind lecture series in 2020

Most images presented here are from DeepMind lecture 3’s slides.

Background:

Based on my notes from lesson 2, we’ve known what neural networks are (Figure 1). Now, we want to use these networks for detecting, analyzing, and classifying images to enable the automation of a specific task.

Figure 1: A recap of a neural network with 2 hidden layers

For example: Given a tree image below (Figure 2), our goal is to train our neural networks to recognize and classify it as a tree.

Figure 2: A tree image. On the left is the original digital image, on the right is the simplified pixelated tree image

We know that a neural network receives a vector of numbers as input. So, how do we feed images to it?

Simply put it this way. A digital image is a 2D grid of pixels. Each pixel records the intensity of the light that creates the image (Figure 2). One way to represent these pixels as a vector of numbers is to flatten them out row by row (Figure 3).

Figure 3: A representation of a vector of numbers from the image

Now, we know how to represent images as vectors of numbers for our neural networks. Let’s explore some building blocks of a convolutional neural network (CNN or ConvNet).

Notes: Inputs and outputs of a ConvNet are tensors, 3D objects that have width x height x channels (Figure 4).

Figure 4: Inputs and outputs are tensors

Building blocks:

There are three main types of layers to build ConvNet architectures:

1. Fully-Connected layer: a traditional layer which connects every element of the input vector to every hidden unit (neuron) in that layer (Figure 5).

Figure 5: Fully connected layer

2. Convolutional layer: a core building block of a ConvNet that does most of the computational heavy lifting. This layer applies weight sharing to preserve the topology of the image. A kernel (filter or small window) slides across the image and produces an output value at each position. We then convolve multiple kernels and obtain multiple feature maps or channels (Figure 6).

Figure 6: Convolutional layer

Variants of the convolution operation (Table 1):

Table 1: Convolution operations

3. Pooling layer: a layer that is commonly inserted in-between successive Conv layers. This layer computes a mean or max over small windows to reduce resolution. Thus, it reduces the number of parameters and computation in the network (Figure 7).

Figure 7: Pooling layer

What are convolutional neural networks?

A convolutional neural network (CNN or ConvNet) is a sequence of layers, and each layer of a ConvNet transforms one volume of activations to another through a differentiable function. ConvNets are often used for image classification and other computer vision tasks.

Now, let’s stack these building blocks to create a simple convolutional neural network.

First, let’s recall how to represent a neural network as a computational graph from the previous notes. (Figure 8).

Figure 8: A neural network as a computational graph

Then, we simplify the above diagram such that parameters and loss are implicated (Figure 9).

Figure 9: A simplified diagram: implicit parameters and loss

Finally, we alternate convolutional and pooling layers to create the ConvNet, specifically, the LeNet-5-a convnet for handwritten digit recognition (Figure 10).

Figure 10: The architecture of LeNet-5

Note: Subsampling = pooling

So, the network has five layers in total, of which three of them are convolutional, and the last two layers are fully connected. The convolutional layers are alternated between the pooling layers. The output layer uses the Softmax activation function to classify the images into their respective class.

Let’s understand the architecture in more detail (Figure 11).

Figure 11: LeNet-5 in detail

The first layer is the input layer with a feature map size of 32 x 32 x 1 (32 x 32 grayscale image).

Then, we have the first convolution layer of 6 filters of size 5 x 5 and stride 1. The activation function used in this layer is tanh. The resulting feature map is 28 x 28 x 6.

Next, we have the first average pooling layer with a filter size of 2 x 2 and stride 2. This layer reduces the image’s resolution without affecting the number of channels. The resulting feature map is 14 x 14 x 6.

Next is the second convolution layer of 16 filters with a filter size of 5 x 5 and stride 1. The resulting feature map is 10 x 10 x 16. The activation used in this layer is also tanh.

Then, we have the second average pooling layer with a filter size of 2 x 2 with stride 2. The resulting feature map reduces to 5 x 5 x 16.

The final convolution layer has 120 filters with a filter size of 5 x 5 and stride 1. Again, the tanh activation function is used in this layer. The output feature map size is 1 x 1 x 120.

Next, we have the first fully connected layer with 84 neurons. We also use tanh as the activation function in this layer. The output size is 84.

Finally, we have the last fully connected layer with 10 neurons. This layer uses the Softmax activation function to give the probability for a data point in each of the 10 classes. The class that has the highest value is then selected.

So, that’s the entire architecture of the LeNet-5 model. Now, let’s dive deeper into other models.

Case studies:

Let’s explore some state-of-the-art CNNs that won the ImageNet Large Scale Visual Recognition Challenge (Figure 12).

Figure 12: Top-5 classification error rate of the competition winners in the ImageNet challenge
  • AlexNet (2012): has a similar architecture to LeNet-5 but was deeper and bigger with five Conv layers stacked on top of each other, followed by three fully connected layers (Figure 13).
Figure 13: The architecture of AlexNet
  • VGGNet (2014): is a very deep convnet. It stacks many convolutional layers before pooling. Moreover, it uses “same” convolutions to avoid resolution reduction. The architecture of VGGNet has up to 19 layers with 3 x 3 kernels and 2 x 2 pooling only (Figure 14).
Figure 14: The architecture of VGGNet
  • ResNet (2015): is the residual network which features special skip (residual) connections and a heavy use of batch normalization layer. The residual connections facilitate training deep networks. ResNets are currently state-of-the-art Convolutional Neural Network models and are implemented in most of the backbone in practice.
Figure 15: A residual block of ResNet

Beyond image recognition:

Besides image recognition, we can use convnets to perform other computer vision tasks such as: (Figure 16)

Figure 16: Other computer vision tasks beyond image recognition
  • Image recognition: identifying or classifying one or more objects in an image.
  • Object detection: to localize and classify one or more objects in an image, resulting in bounding boxes for each detected object.
  • Semantic Segmentation: labeling one or more specific regions of interest in an image. It treats multiple objects within a single category as one entity.
  • Instance Segmentation: detecting and delineating each object of interest in an image. It identifies individual objects within these categories.

What’s next?

Coming next is notes from Lecture 4 in DeepMind’s deep learning series: Vision beyond classification: Task I: Object Detection.

--

--

Nghi Huynh

I’m enthusiastic about the application of deep learning in medical domain. Here, I want to share the journey of learning with you. ^^