Deep Learning 2: Part 1 Lesson 2

22 min readJan 14, 2018

My personal notes from fast.ai course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12 ・ 13 ・ 14

Lesson 2

Notebook

Review of last lesson [01:02]

We used 3 lines of code to build an image classifier.
In order to train the model, data needs to be organized in a certain way under PATH (in this case data/dogscats/):

There should be train folder and valid folder, and under each of these, folders with classification labels (i.e. cats and dogs for this example) with corresponding images in them.
The training output: [epoch #, training loss, validation loss, accuracy]

[ 0.       0.04955  0.02605  0.98975]

Learning Rate [4:54]

The basic idea of learning rate is that it is going to decide how quickly we zoom/hone in on the solution.

If the learning rate is too small, it will take very long time to get to the bottom
If the learning rate is too big, it could get oscillate away from the bottom.
Learning rate finder (learn.lr_find) will increase the learning rate after each mini-batch. Eventually, the learning rate is too high that loss will get worse. We then look at the plot of learning rate against loss, and determine the lowest point and go back by one magnitude and choose that as a learning rate (1e-2 in the example below).
Mini-batch is a set of few images we look at each time so that we are using the parallel processing power of the GPU effectively (generally 64 or 128 images at a time)
In Python:

By adjusting this one number, you should be able to get pretty good results. fast.ai library picks the rest of the hyper parameters for you. But as the course progresses, we will learn that there are some more things we can tweak to get slightly better results. But learning rate is the key number for us to set.
Learning rate finder sits on top of other optimizers (e.g. momentum, Adam, etc) and help you choose the best learning rate given what other tweaks you are using (such as advanced optimizers but not limited to optimizers).
Questions: what happens for optimizers that changes learning rate during the epoch? Is this finder choosing an initial learning rate?[14:05] We will learn about optimizers in details later, but the basic answer is no. Even Adam has a learning rate which gets divided by the average previous gradient and also the recent sum of squared gradients. Even those so-called “dynamic learning rate” methods have a learning rate.
The most important thing you can do to make the model better is to give it more data. Since these models have millions of parameters, if you train them for a while, they start to do what is called “overfitting”.
Overfitting — the model is starting to see the specific details of the images in the training set rather than learning something general that can be transferred across to the validation set.
We can collect more data, but another easy way is data augmentation.

Data Augmentation [15:50]

Every epoch, we will randomly change the image a little bit. In other words, the model is going to see slightly different version of the image each epoch.
You want to use different types of data augmentation for different types of image (flip horizontally, vertically, zoom in, zoom out, vary contrast and brightness, and many more).

Learning Rate Finder Questions [19:11]:

Why not pick the bottom? The point at which the loss was lowest is where the red circle is. But that learning rate was actually too large at that point and will not likely to converge. So the one before that would be a better choice (it is always better to pick a learning rate that is smaller than too big)

When should we learn lr_find? [23:02] Run it once at the start, and maybe after unfreezing layers (we will learn it later). Also when I change the thing I am training or change the way I am training it. Never any harm in running it.

Back to Data Augmentation [24:10]

tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

transform_side_on — a predefined set of transformations for side-on photos (there is also transform_top_down). Later we will learn how to create custom transform lists.
It is not exactly creating new data, but allows the convolutional neural net to learn how to recognize cats or dogs from somewhat different angles.

data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True)learn.fit(1e-2, 1)

Now we created a new data object that includes augmentation. Initially, the augmentations actually do nothing because of precompute=True.
Convolutional neural network have these things called “activations.” An activation is a number that says “this feature is in this place with this level of confidence (probability)”. We are using a pre-trained network which has already learned to recognize features (i.e. we do not want to change hyper parameters it learned), so what we can do is to pre-compute activations for hidden layers and just train the final linear portion.

This is why when you train your model for the first time, it takes longer — it is pre-computing these activations.
Even though we are trying to show a different version of the cat each time, we had already pre-computed the activations for a particular version of the cat (i.e. we are not re-calculating the activations with the altered version).
To use data augmentation, we have to do learn.precompute=False:

learn.precompute=Falselearn.fit(1e-2, 3, cycle_len=1)[ 0.       0.03597  0.01879  0.99365]                         
[ 1.       0.02605  0.01836  0.99365]                         
[ 2.       0.02189  0.0196   0.99316]

Bad news is that accuracy is not improving. Training loss is decreasing but validation loss is not, but we are not overfitting. Overfitting when the training loss is much lower than the validation loss. In other words, when your model is doing much better job on the training set than it is on the validation set, that means your model is not generalizing.
cycle_len=1 [30:17]: This enables stochastic gradient descent with restarts (SGDR). The basic idea is as you get closer and closer to the spot with the minimal loss, you may want to start decrease the learning rate (taking smaller steps) in order to get to exactly the right spot.
The idea of decreasing the learning rate as you train is called learning rate annealing which is very common. Most common and “hacky” way to do this is to train a model with a certain learning rate for a while, and when it stops improving, manually drop down the learning rate (stepwise annealing).
A better approach is simply to pick some kind of functional form — turns out the really good functional form is one half of the cosign curve which maintains the high learning rate for a while at the beginning, then drop quickly when you get closer.

However, we may find ourselves in a part of the weight space that isn’t very resilient — that is, small changes to the weights may result in big changes to the loss. We want to encourage our model to find parts of the weight space that are both accurate and stable. Therefore, from time to time we increase the learning rate (this is the ‘restarts’ in ‘SGDR’), which will force the model to jump to a different part of the weight space if the current area is “spiky”. Here’s a picture of how that might look if we reset the learning rates 3 times (in this paper they call it a “cyclic LR schedule”):

The number of epochs between resetting the learning rate is set by cycle_len, and the number of times this happens is referred to as the number of cycles, and is what we're actually passing as the 2nd parameter to fit(). So here's what our actual learning rates looked like:

Question: Could we get the same effect by using random starting point? [35:40] Before SGDR was created, people used to create “ensembles” where they would relearn a whole new model ten times in the hope that one of them would end up being better. In SGDR, once we get close enough to the optimal and stable area, resetting will not actually “reset” but the weights keeps better. So SGDR will give you better results than just randomly try a few different starting points.
It is important to pick a learning rate (which is the highest learning rate SGDR uses) that is big enough to allow the reset to jump to a different part of the function. [37:25]
SGDR reduces the learning rate every mini-batch, and reset occurs every cycle_len epoch (in this case it is set to 1).
Question: Our main goal is to generalize and not end up in the narrow optima. In this method, are we keeping track of the minima and averaging them and ensembling them? [39:27] That is another level of sophistication and you see “Snapshot Ensemble” in the diagram. We are not currently doing that but if you wanted it to generalize even better, you can save the weights right before the resets and take the average. But for now, we are just going to pick the last one.
If you want to skip ahead, there is a parameter called cycle_save_name which you can add as well as cycle_len, which will save a set of weights at the end of every learning rate cycle and then you can ensemble them [40:14].

Saving model [40:31]

learn.save('224_lastlayer')learn.load('224_lastlayer')

When you precompute activations or create resized images (we will learn about it soon), various temporary files get created which you see under data/dogcats/tmp folder. If you are getting weird errors, it might be because of precomputed activations that are only half completed or are in some way incompatible with what you are doing. So you can always go ahead and delete this /tmp folder to see if it makes the error go away (fast.ai equivalent of turning it off and then on again).
You will also see there is a directory called /models that is where models get saved when you say learn.save

Fine Tuning and Differential Learning Rate [43:49]

So far, we have not retrained any of pre-trained features — specifically, any of those weights in the convolutional kernels. All we have done is we added some new layers on top and learned how to mix and match pre-trained features.
Images like satellite images, CT scans, etc have totally different kinds of features all together (compare to ImageNet images), so you want to re-train many layers.
For dogs and cats, images are similar to what the model was pre-trained with, but we still may find it is helpful to slightly tune some of the later layers.
Here is how you tell the learner that we want to start actually changing the convolutional filters themselves:

learn.unfreeze()

“frozen” layer is a layer which is not being trained/updated. unfreeze unfreezes all the layers.
Earlier layers like the first layer (which detects diagonal edges or gradient) or the second layer (which recognizes corners or curves) probably do not need to change by much, if at all.
Later layers are much more likely to need more learning. So we create an array of learning rates (differential learning rate):

lr=np.array([1e-4,1e-3,1e-2])

1e-4 : for the first few layers (basic geometric features)
1e-3 : for the middle layers (sophisticated convolutional features)
1e-2 : for layers we added on top
Why 3? Actually they are 3 ResNet blocks but for now, think of it as a group of layers.

Question: What if I have a bigger images than the model is trained with? [50:30] The short answer is, with this library and modern architectures we are using, we can use any size we like.

Question: Can we unfreeze just specific layers? [51:03] We are not doing it yet, but if you wanted, you can do lean.unfreeze_to(n) (which will unfreeze layers from layer n onwards). Jeremy almost never finds it helpful and he thinks it is because we are using differential learning rates, and the optimizer can learn just as much as it needs to. The one place he found it helpful is if he is using a really big memory intensive model and he is running out of GPU, the less layers you unfreeze, the less memory and time it takes.

Using differential learning rate, we are up to 99.5%! [52:28]

learn.fit(lr, 3, cycle_len=1, cycle_mult=2)[ 0.       0.04538  0.01965  0.99268]                          
[ 1.       0.03385  0.01807  0.99268]                          
[ 2.       0.03194  0.01714  0.99316]                          
[ 3.       0.0358   0.0166   0.99463]                          
[ 4.       0.02157  0.01504  0.99463]                          
[ 5.       0.0196   0.0151   0.99512]                          
[ 6.       0.01356  0.01518  0.9956 ]

Earlier we said 3 is the number of epochs, but it is actually cycles. So if cycle_len=2 , it will do 3 cycles where each cycle is 2 epochs (i.e. 6 epochs). Then why did it 7? It is because of cycle_mult
cycle_mult=2 : this multiplies the length of the cycle after each cycle (1 epoch + 2 epochs + 4 epochs = 7 epochs).

Intuitively speaking [53:57], if the cycle length is too short, it starts going down to find a good spot, then pops out, and goes down trying to find a good spot and pops out, and never actually get to find a good spot. Earlier on, you want it to do that because it is trying to find a spot that is smoother, but later on, you want it to do more exploring. That is why cycle_mult=2 seems to be a good approach.

We are introducing more and more hyper parameters having told you that there are not many. You can get away with just choosing a good learning rate, but then adding these extra tweaks helps get that extra level-up without any effort. In general, good starting points are:

n_cycle=3, cycle_len=1, cycle_mult=2
n_cycle=3, cycle_len=2 (no cycle_mult)

Question: why do smoother surfaces correlate to more generalized networks? [55:28]

Say you have something spiky (blue line). X-axis is showing how good this is at recognizing dogs vs. cats as you change this particular parameter. Something to be generalizable means that we want it to work when we give it a slightly different dataset. Slightly different dataset may have a slightly different relationship between this parameter and how cat-like vs. dog-like it is. It may, instead look like the red line. In other words, if we end up at the blue pointy part, then it will not going to do a good job on this slightly different dataset. Or else, if we end up on the wider blue part, it will still do a good job on the red dataset.

Here is some interesting discussion about spiky minima.

Test Time Augmentation (TTA) [56:49]

Our model has achieved 99.5%. But can we make it better still? Let’s take a look at pictures we predicted incorrectly:

Here, Jeremy printed out the whole of these pictures. When we do the validation set, all of our inputs to our model must be square. The reason is kind of a minor technical detail, but GPU does not go very quickly if you have different dimensions for different images. It needs to be consistent so that every part of the GPU can do the same thing. This may probably be fixable but for now that is the state of the technology we have.

To make it square, we just pick out the square in the middle — as you can see below, it is understandable why this picture was classified incorrectly:

We are going to do what is called “Test Time Augmentation”. What this means is that we are going to take 4 data augmentations at random as well as the un-augmented original (center-cropped). We will then calculate predictions for all these images, take the average, and make that our final prediction. Note that this is only for validation set and/or test set.

To do this, all you have to do is learn.TTA() — which brings up the accuracy to 99.65%!

log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)accuracy(probs, y)0.99650000000000005

Questions on augmentation approach[01:01:36]: Why not border or padding to make it square? Typically Jeremy does not do much padding, but instead he does a little bit of zooming. There is a thing called reflection padding that works well with satellite imagery. Generally speaking, using TTA plus data augmentation, the best thing to do is try to use as large image as possible. Also, having fixed crop locations plus random contrast, brightness, rotation changes might be better for TTA.

Question: Data augmentation for non-image dataset? [01:03:35] No one seems to know. It seems like it would be helpful, but there are very few number of examples. In natural language processing, people tried replacing synonyms for instance, but on the whole the area is under researched and under developed.

Question: Is fast.ai library open source?[01:05:34] Yes. He then covered the reason why Fast.ai switched from Keras + TensorFlow to PyTorch

Random note: PyTorch is much more than just a deep learning library. It actually lets us write arbitrary GPU accelerated algorithms from scratch — Pyro is a great example of what people are now doing with PyTorch outside of deep learning.

Analyzing results [01:11:50]

Confusion matrix

The simple way to look at the result of a classification is called confusion matrix — which is used not only for deep learning but in any kind of machine learning classifier. It is helpful particularly if there are four or five classes you are trying to predict to see which group you are having the most trouble with.

preds = np.argmax(probs, axis=1)
probs = probs[:,1]from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, preds)plot_confusion_matrix(cm, data.classes)

Let’s look at the pictures again [01:13:00]

Most incorrect cats (only the left two were incorrect — it displays 4 by default):

Most incorrect dots:

Review: easy steps to train a world-class image classifier [01:14:09]

Enable data augmentation, and precompute=True
Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer from precomputed activations for 1–2 epochs
Train last layer with data augmentation (i.e. precompute=False) for 2–3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer. Rule of thumb: 10x for ImageNet like images, 3x for satellite or medical imaging
Use lr_find() again (Note: if you call lr_find having set differential learning rates, what it prints out is the learning rate of the last layers.)
Train full network with cycle_mult=2 until over-fitting

Let’s do it again: Dog Breed Challenge [01:16:37]

You can use Kaggle CLI to download data for Kaggle competitions
Notebook is not made public since it is an active competition

%reload_ext autoreload
%autoreload 2
%matplotlib inlinefrom fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *PATH = 'data/dogbreed/'
sz = 224
arch = resnext101_64
bs=16label_csv = f'{PATH}labels.csv'
n = len(list(open(label_csv)))-1
val_idxs = get_cv_idxs(n)!ls {PATH}

This is a little bit different to our previous dataset. Instead of train folder which has a separate folder for each breed of dog, it has a CSV file with the correct labels. We will read CSV file with Pandas. Pandas is what we use in Python to do structured data analysis like CSV and usually imported as pd:

label_df = pd.read_csv(label_csv)
label_df.head()

label_df.pivot_table(index='breed', aggfunc=len).sort_values('id', ascending=False)

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, 
                       max_zoom=1.1)data = ImageClassifierData.from_csv(PATH, 'train', 
                 f'{PATH}labels.csv', test_name='test', 
                 val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

max_zoom — we will zoom in up to 1.1 times
ImageClassifierData.from_csv — last time, we used from_paths but since the labels are in CSV file, we will call from_csv instead.
test_name — we need to specify where the test set is if you want to submit to Kaggle competitions
val_idx — there is no validation folder but we still want to track how good our performance is locally. So above you will see:

n = len(list(open(label_csv)))-1 : Open CSV file, create a list of rows, then take the length. -1 because the first row is a header. Hence n is the number of images we have.

val_idxs = get_cv_idxs(n) : “get cross validation indexes” — this will return, by default, random 20% of the rows (indexes to be precise) to use as a validation set. You can also send val_pct to get different amount.

suffix=’.jpg’ — File names has .jpg at the end, but CSV file does not. So we will set suffix so it knows the full file names.

fn = PATH + data.trn_ds.fnames[0]; fn'data/dogbreed/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg'

You can access to training dataset by saying data.trn_ds and trn_ds contains a lot of things including file names (fnames)

img = PIL.Image.open(fn); img

img.size(500, 375)

Now we check image size. If they are huge, then you have to think really carefully about how to deal with them. If they are tiny, it is also challenging. Most of ImageNet models are trained on either 224 by 224 or 299 by 299 images

size_d = {k: PIL.Image.open(PATH+k).size for k in data.trn_ds.fnames}

Dictionary comprehension — key: name of the file, value: size of the file

row_sz, col_sz = list(zip(*size_d.values()))

*size_d.values() will unpack a list. zip will pair up elements of tuples to create a list of tuples.

plt.hist(row_sz);

Matplotlib is something you want to be very familiar with if you do any kind of data science or machine learning in Python. Matplotlib is always referred to as plt .

Question: How many images should we use as a validation set? [01:26:28] Using 20% is fine unless the dataset is small — then 20% is not enough. If you train the same model multiple times and you are getting very different validation set results, then your validation set is too small. If the validation set is smaller than a thousand, it is hard to interpret how well you are doing. If you care about the third decimal place of accuracy and you only have a thousand things in your validation set, a single image changes the accuracy. If you care about the difference between 0.01 and 0.02, you want that to represent 10 or 20 rows. Normally 20% seems to work fine.

def get_data(sz, bs):
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on,
                           max_zoom=1.1)
    data = ImageClassifierData.from_csv(PATH, 'train', 
               f'{PATH}labels.csv', test_name='test', num_workers=4,
               val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)    return data if sz>300 else data.resize(340, 'tmp')

Here is the regular two lines of code. When we start working with new dataset, we want everything to go super fast. So we made it possible to specify the size and start with something like 64 which will run fast. Later, we will use bigger images and bigger architectures at which point, you may run out of GPU memory. If you see CUDA out of memory error, the first thing you need to do is to restart kernel (you cannot recover from it), then make the batch size smaller.

data = get_data(224, bs)learn = ConvLearner.pretrained(arch, data, precompute=True)learn.fit(1e-2, 5)[0.      1.99245 1.0733  0.76178]                             
[1.      1.09107 0.7014  0.8181 ]                             
[2.      0.80813 0.60066 0.82148]                             
[3.      0.66967 0.55302 0.83125]                             
[4.      0.57405 0.52974 0.83564]

83% for 120 classes is pretty good.

learn.precompute = Falselearn.fit(1e-2, 5, cycle_len=1)

Reminder: a epoch is one pass through the data, a cycle is how many epochs you said is in a cycle

learn.save('224_pre')
learn.load('224_pre')

Increase image size [1:32:55]

learn.set_data(get_data(299, bs))

If you trained a model on smaller size images, you can then call learn.set_data and pass in a larger size dataset. That is going to take your model, however it has been trained so far, and it is going to let you continue to train on larger images.

Starting training on small images for a few epochs, then switching to bigger images, and continuing training is an amazingly effective way to avoid overfitting.

learn.fit(1e-2, 3, cycle_len=1)[0.      0.35614 0.22239 0.93018]                            
[1.      0.28341 0.2274  0.92627]
[2.      0.28341 0.2274  0.92627]

As you see, validation set loss (0.2274) is much lower than training set loss (0.28341) — which means it is under fitting. When you are under fitting, it means cycle_len=1 is too short (learning rate is getting reset before it had the chance to zoom in properly). So we will add cycle_mult=2 (i.e. 1st cycle is 1 epoch, 2nd cycle is 2 epochs, and 3rd cycle is 4 epochs)

learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2)[0.      0.27171 0.2118  0.93192]                            
[1.      0.28743 0.21008 0.9324 ]
[2.      0.25328 0.20953 0.93288]                            
[3.      0.23716 0.20868 0.93001]
[4.      0.23306 0.20557 0.93384]                            
[5.      0.22175 0.205   0.9324 ]
[6.      0.2067  0.20275 0.9348 ]

Now the validation loss and training loss are about the same — this is about the right track. Then we try TTA :

log_preds, y = learn.TTA()
probs = np.exp(log_preds)
accuracy(log_preds,y), metrics.log_loss(y, probs)(0.9393346379647749, 0.20101565705592733)

Other things to try:

Try running one more cycle of 2 epochs
Unfreezing (in this case, training convolutional layers did not help in the slightest since the images actually came from ImageNet)
Remove validation set and just re-run the same steps, and submit that — which lets us use 100% of the data.

Question: How do we deal with unbalanced dataset? [01:38:46] This dataset is not totally balanced (between 60 and 100) but it is not unbalanced enough that Jeremy would give it a second thought. A recent paper says the best way to deal with very unbalanced dataset is to make copies of the rare cases.

Question: Difference between precompute=True and unfreeze?

We started with a pre-trained network
We added a couple of layers on the end of it which start out random. With everything frozen and precompute=True, all we are learning is the layers we have added.
With precompute=True, data augmentation does not do anything because we are showing exactly the same activations each time.
We then set precompute=False which means we are still only training the layers we added because it is frozen but data augmentation is now working because it is actually going through and recalculating all of the activations from scratch.
Then finally, we unfreeze which is saying “okay, now you can go ahead and change all of these earlier convolutional filters”.

Question: Why not just set precompute=False from the beginning? The only reason to have precompute=True is it is much faster (10 or more times). If you are working with quite a large dataset, it can save quite a bit of time. There is no accuracy reason ever to use precompute=True .

Minimal steps to get good results:

Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer with data augmentation (i.e. precompute=False) for 2–3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer
Train full network with cycle_mult=2 until over-fitting

Question: Does reducing the batch size only affect the speed of training? [1:43:34] Yes, pretty much. If you are showing it less images each time, then it is calculating the gradient with less images — hence less accurate. In other words, knowing which direction to go and how far to go in that direction is less accurate. So as you make the batch size smaller, you are making it more volatile. It impacts the optimal learning rate that you would need to use, but in practice, dividing the batch size by 2 vs. 4 does not seem to change things very much. If you change the batch size by much, you can re-run learning rate finder to check.

Question: What are the grey images vs. the ones on the right?

Visualizing and Understanding Convolutional Networks

Layer 1, they are exactly what the filters look like. It is easy to visualize because input to it are pixels. Later on, it gets harder because inputs are themselves activations which is a combination of activations. Zeiler and Fergus came up with a clever technique to show what the filters tend to look like on average — called deconvolution (we will learn in Part 2). Ones on the right are the examples of patches of image which activated that filter highly.

Question: What would you have done if the dog was off to the corner or tiny (re: dog breed identification)? [01:47:16] We will learn about it in Part 2, but there is a technique that allows you to figure out roughly which parts of an image most likely have the interesting things in them. Then you can crop out that area.

Further improvement [01:48:16]

Two things we can do immediately to make it better:

Assuming the size of images you were using is smaller than the average size of images you have been given, you can increase the size. As we have seen before, you can increase it during training.
Use better architecture. There are different ways of putting together what size convolutional filters and how they are connected to each other, and different architectures have different number of layers, size of kernels, filters, etc.

We have been using ResNet34 — a great starting point and often a good finishing point because it does not have too many parameters and works well with small dataset. There is another architecture called ResNext which was the second-place winner in last year’s ImageNet competition.ResNext50 takes twice as long and 2–4 times more memory than ResNet34.

Here is the notebook which is almost identical to the original dogs. vs. cats. which uses ResNext50 which achieved 99.75% accuracy.

Satellite Imagery [01:53:01]