Faster AI: Lesson 3 — TL;DR version of Fast.ai Part 1

Published in

Deep Learning Journal

11 min readAug 29, 2017

This is Lesson 3 of a series called Faster AI. If you haven’t read Lesson 0, Lesson 1 and Lesson 2, please go through them first.

This lesson goes deeper into the concepts of Convolutional Neural Networks and provides some techniques on how to improve your model.

For the sake of simplicity, I have divided this lesson into 3 parts:

CNN in detail [Time: 08:48]
Underfitting and Overfitting [Time: 1:12:52]
Reducing Overfitting and maximizing accuracy of the model [Time: 1:28:42]

1. CNN in detail

On our last lesson, we went through some visualizations showing what each layer of a Convolutional Neural Network can learn from the input. In this lesson, Jeremy talks about another such tool called, Deep Visualization Toolbox. Using this tool, you can input your own image and go through each layer of a CNN model and see by yourself what each layer learned from it.

The purpose of all these tool is to provide you with some intuitive understanding of what a convolutional neural network does.

To better understand such networks, lets understand what convolution means.

Let us take example of this image of letter 7, from MNIST dataset.

Previously, we talked about filters. Now, these filters are some matrix with some numbers on it. In convolutional networks, what they do is, these typically 3x3 dimensional matrices, they slide across the input image from top to bottom, covering all the parts of the image. So, for example, this is one such type of filter

Here, this filter when slide across all of the pixels of above digit 7 image, each 3x3 part of the pixel value of the image gets matrix multiplied with the filter. This process is called convolution.

Here is the resultant image after convolution of above filter with digit 7 image.

If we zoom at some part of the original digit 7 image, before the convolution, this is what we get

Now, if we look at the pixel value of this part, we see

As you can see this value is multiplied with that above filter and the resultant output of that particular part is

In a convolutional neural network, one layer has more than one such kind of filters.

Although, in this case Jeremy predefined the value of a filter for demonstration, it is not the case with an actual neural network.

In order to find the optimum value of a filter to properly convolve across the image and to actually learn something from that convolution, these filters are randomly initialized at first. And with each iteration of training, the Stochastic Gradient Descent algorithm, optimizes this value of the filter and after certain iterations, we get the optimum value of the filters which can successfully understand the input image and learns from it.

You can visit this page for more information about filters and what are some popular filters used today: http://setosa.io/ev/image-kernels/

Pooling

Pooling is another important part of a convolutional neural network. After a convolution, these filters they create something called feature maps, like the above digit 7 image after convolution, which are the learned outputs. These feature maps, they need to be reduced in resolution in order to be processed in a faster way by other layers of the network. This process of reducing these feature maps is called pooling.

Generally, max pooling is one the famous pooling method used in CNN. What it does is, suppose this max pool has 5x5 dimension, so like filter, it starts at the top of the output image and it selects the maximum pixel value out of this 5x5 block on that image it placed on. It repeats this process across the whole image or feature map, picking only 1 value out of 5x5 block and reducing the resolution significantly. And this new matrix after max pool is what feed into subsequent layers of the network.

The reason we do max pooling is because, it is harder to convolve across large resolution images on deeper layer of CNN, but we also need to understand and learn from these images, so what it does is, it reduces the resolution on each max pool but, after each max pool, on another layer, we double the number of the filters used in this layer with respect to the previous layer.

This results in lowering the resolution but increasing the dimension of the feature maps, making the network even better at recognizing and learning images, because now with less number of convolution, network can learn greater details of an image with the help of this higher dimensions.

Another important aspect of max pooling is that, because of max pooling, suppose you want to recognize cats in an image, now regardless of wherever that cat is placed on your image, that is at top, right, bottom or middle, CNN can recognize that, making these networks position invariant. And at the same time, certain aspects of the image needs to be at fixed position as well, like, in case of a person, eyes needs to be closer to the nose. That can also be recognized because of max pooling, making networks position invariant as well as fixed position learner, depending on input image.

Here is an image showing resolution reduction after max pooling.

Padding

Padding in simplest terms means, to add a border of zero around the image, making the result after a convolution with same output size as the input size.

Suppose we have 3x3 filter. When CNN without padding is used, we end up losing 2 pixels at the sides or from edges of the image, each time we convolve. To overcome that padding is used.

CNN Architectures

Generally, in a convolutional neural network, following Hierarchy is used:

Input Layer
Convolutional Layers
Dense (Fully Connected Layers)

We can see such implementation in Keras as:

Here at third line of code, we are defining input layer of 3 channel, and 224x224 resolution or dimension. Following that, we have 5 Convolutional blocks with layer number as first argument and number of filters as second argument. After convolutional block we have linear blocks at the end.

Inside these ConvBlock, this is what we have:

Here we are adding convolutional layer and zero padding to each layer of block. After that at the end, we have max pool layer to pool from the last convolution layer.

Softmax

Generally, in classification problems involving more than 2 classes, Softmax function at the last layer of the network is used to give highest probability value to one class and lowest values to others. These values when added, sums up to 1.

This highest value classes is the predicted class of the input image.

Stochastic Gradient Descent (SGD)

Basic difference between gradient descent and SGD is that, SGD utilizes the batch or mini-batch of neural network, whereas Gradient Descent in generally are done on whole training set.

If we remember from our last lesson, and the example of linear problem involving a straight line, we went through the concept of gradient descent there as well.

With SGD what we are doing is, we are calculating the derivative of a loss function with respect to weights (a and b) and these derivatives gets multiplied with small fractional number like 0.1, 0.001, 0.0001 called learning rates. The resultant small number then is subtracted from the randomly initialized weights of the network.

Repeating this process each iteration, gives the optimum value of the weight and so the approximate function to calculate the equation of the straight line.

This same concept when applied to a neural network involving hundreds of parameters instead of just 2 in our previous case, works the same way only the operation is complex. So, to understand it intuitively, we have know that the basic function is the same, just parameters are more in deep neural networks.

2. Underfitting and Overfitting

There are two reasons for a model to be not good.

Underfitting
Overfitting

Underfitting is when you are using a less powerful model to solve certain problem or using less parameters than it needs in a model. For example using linear model to classify images.

Overfitting means you are using too many parameters in a model than it needs. It results in overfitting when your model is too much trained on your training data and fails to perform on your test set. It will learn too much details of your training data only and fails to recognize general patterns in test data. One way to check it is by looking at training accuracy and validation accuracy, if training accuracy is more, then its overfitting.

Dropout

Dropout is one of the technique to reduce overfitting. What it does is, it randomly kills the activation of a layer by certain percentage, making the network impossible to overfit on a given task. This technique of killing good activation neurons on random, surprisingly works fine to increase the accuracy of the network without overfitting it.

Dropout is mainly put on last layers because, putting it on early layers makes the network to lose information on input data, which is not a good sign, instead we can lose some activations later on to reduce overfitting.

3. Reducing Overfitting and maximizing accuracy of the model

There are five ways other than dropout to reduce overfitting. They are

Add more data to the network
Add Data Augmentation
Use Architecture that generalize well
Add Regularization
Reduce Architecture Complexity

Among them Lets talk about Data Augmentation

Data Augmentation

Data Augmentation means to modify the training data by applying certain dynamic function on it, which randomly rotates it, flips it, zoom it, modify its colors and other properties so that the model can reduce overfitting.

What type of augmentation should be used, is purely based on intuition and experimentation. If applying certain augmentation on data makes sense then you should definitely apply and until the right augmentation performs better, use different types.

One example of data augmentation is this, suppose we have a picture of a cat

After data augmentation, this is what we will get

Batch Normalization

To normalize any input means to subtract that input by its mean and then divide that by its standard deviations. What it does is, it makes the input values of a network, of same scale and range, this makes the weights of the network uniform and so the activations will be uniform, ultimately making network to perform well.

But normalization on Activation layer is also of equal importance as sometimes even after normalizing the input, some activations are little bit higher than the rest making the subsequent layers to act abnormally and which ultimately makes the network fragile.

So normalization on these activation layers is Batch Normalization. It is done in a similar way like input normalization on these layers but with 2 extra parameters which are optimizable with gradient descent. Making the whole normalization process friendly on each iteration and provides us with uniform activations.

Another great thing about batch normalization is that it speeds up the training process by 10 times and helps in reducing overfitting as well.

In Keras you can use it by simple calling BatchNormalization() after the layers of your model. Its good if you can put it after every layer of your model.

You can view the code of above implementations here

End to End MNIST

While explaining all these fantastic techniques to reduce overfitting, Jeremy shows us how to apply them on MNIST dataset. He does so my sequentially going through each technique one after another on same model in following way:

He first tried to classify MNIST data on a Linear Model
Then he added one hidden layer to that model and tried to classify on that
After that he applied it on Convolutional Neural Network
Under Convolutional Neural Network, he applied these techniques to reduce overfitting : a) Data Augmentation b) Batch Normalization + Data Augmentation c) Batch Normalization + Data Augmentation + Dropout

You can view the code of above implementation here.

Ensembling

Ensembling in simple terms means combining different models to increase the overall accuracy, making the network perform well to solve particular problem.

Here in this case Jeremy took all the code from above with different learning rates and wrap it in a function called fit_model().

Here different learning rates are given to the model after certain epochs. Now this function is trained 6 times and all the results from this six times training is put on a list, called models.

Now let us get the predictions from each of these models as a list

Now, if we check the shape of the prediction, what we get is

Here, 6 is the number of the models, 1000 is the test images and 10 are the outputs

Now, if we take mean of these predictions and calculate accuracy on them, we get this

Around 99.7% accuracy, which is much better result and we can see it doesn’t take much to come to this level of accuracy.

All the code of above implementation is here.

Here is the link to the actual video

Here are the notes of this lesson

If you want to jump to any particular topic and watch video from there, here is the link to the video timeline

In this lesson, we went through convolutional neural network in detail and ways to improve it. In our next lesson, we will go through different types of optimizers and we will learn about collaborative filtering.

See you there.