Convolutional Neural Networks
Course 4 in deep learning specialization (1st-week notes)
Update: If you have not read my previous articles on the notes of the other three deep learning specialization courses, then please do check out the series: article 1, article 2, article 3, article 4, article 5, article 6, article 7, article 8 and article 9.
Nowadays, deep learning is proving to be useful in several domains such as self-driving cars, healthcare, object detection, image recognition, and many more. In the image recognition field, when the size of the image is small, a neural network with a small number of hidden units perform well but if the size of the image is for example 1000*1000 pixel then input features for one image become 1000*1000*3 i.e. 3 million where 3 represents RGB color coding. If neural networks have 1000 hidden units, the shape of the weight parameter matrix W would be (1000, 3M). But for computer vision applications, you don’t want to be stuck using only tiny little images. You want to use large images. To do that, we need to better implement the convolution operation, which is one of the fundamental building blocks of convolutional neural networks.
Convolution Operation Example
As we know, early layers of the neural network might detect vertical and horizontal edges and then some later layers might detect the parts of objects, and then even later layers may detect the parts of complete objects like people’s faces.
To understand how a convolution neural network works, let's first understand the convolution operation.
Consider a 6 x 6 matrix as shown in the below image and in order to detect edges create a 3 x 3 matrix that looks like 1,1,1,0,0,0,-1,-1,-1. The later matrix is known as the filter or kernel. The first column in the filter matrix represents high pixels i.e. lighter region and the last row represents low pixels i.e. darker region.
The asterisk sign is used to denote convolution operation. In programming languages, it is also used to denote multiplication or element-wise multiplication operation. The convolution between these two matrices will result in a 4 x 4 matrix.
As a first step, we will combine the filter matrix with our target matrix and perform element-wise multiplication. As shown in below image, 3*1+ 1*1+2*1+0*0+5*0+7*0+1*-1+8*-1+2*-1 = -5 and put the final value in the resulting matrix.
Then gradually the filter matrix will move in the horizontal direction and perform a similar multiplication.
The complete convolution operation will generate the 4 x 4 matrix. The resulting matrix will represent the edges in the image.
As we have taken 6 X 6 matrix, the resulting matrix does not provide any clarity but when the images or input features have higher matrices like with 1000 dimensions then it does help to define edges.
If we rotate above the filter matrix to 90 degrees, we will be able to detect horizontal edges as well. Different filters allow you to find vertical and horizontal edges. It turns out that the three by three vertical edge detection filter we’ve used is just one possible choice. And historically, in the computer vision literature, there was a fair amount of debate about what is the best set of numbers to use. So here’s something else you could use, which is maybe 1, 2, 1, 0, 0, 0, -1, -2, -1. This is called a Sobel filter. And the advantage of this is it puts a little bit more weight on the central. Another example of the filter is 3, 10, 3, 0,0,0, -3, -10, -3 and this is called a Scharr filter. And this has also other slightly different properties. We will not require computer vision researchers to handpick these numbers for the 3 X 3 filter and we can treat these nine numbers as parameters and learn them through backpropagation.
When we have an N x N image convolved by the f x f filter, the dimension of the output filter would be n-f+1. There are two downsides to this approach.
- Every time the convolutional operator is applied, the image shrinks. So we do not want our image to shrink whenever a convolutional operator is applied.
- The pixels in the edges of the image are used very less than the pixels in the middle. Hence there is a possibility that a lot of useful information near the edge is being wasted.
To fix both the above problems, we can pad the image before applying the convolutional image. So for example, if we pad the 6 X 6 image with one pixel making it an 8 X 8 image, and apply convolutional operation then the resulting image would be a 6 X 6 image preserving the original image. If p is the number of pixels by which we are increasing the padding then the resulting matrix’s dimension becomes; n+2p-f+1 * n+2p-f+1.
In terms of how much to pad there are two choices; valid convolutions and the same convolutions. Valid convolution means no padding and the same convolution means when we pad, the output matrix will be of the same size as the input one. Padding size can be chosen using the f-1/2 formula where f is an odd number filter.
In a strided convolution, rather than going the usual way, if we select stride =2 then we can jump the two blocks while performing the convolution operation as shown below.
The input and output dimensions can be governed by the following formula.
If the resulting number is not an integer then we can round down the number.
Convolutions Over Volume
We saw how to perform convolution operations on 2D images. Now we will see how the convolution operation is performed on 3D images. In this, we will have a third dimension of the image which will represent the RGB color channel. One thing to keep in mind is that the third dimension of the image, and the filter, would be the same. The output matrix will be 2D if only one filter is used.
Summary of Notation for a 3D Convolution Layer :
Suppose we have 4 * 4 input and we want to apply max pooling to our input as shown in the below image. We can break the region into different colors and get the maximum numbers from each region. So the intuition behind max pulling is that if we consider each region as a set of features then each feature from every region is preserved in the output.
Complete Convolutional Neural Network Example
In the following example, the input size of the image is 32 * 32 * 7. The convolution and pooling layer combined is considered as one layer as the pooling layer does not have its own parameters. After convolution and pooling layers, the input vector has been flattened to have (400,1) vector. Then fully connected layers are applied following the softmax algorithm at the last layer.
With the end-to-end example of the convolutional neural network, we come to an end of the week one notes. Stay tuned for the next week’s notes.
Happy Learning :)