Deep Learning 2: Part 1 Lesson 2

Lesson 2

Review of last lesson [01:02]

  • We used 3 lines of code to build an image classifier.
  • In order to train the model, data needs to be organized in a certain way under PATH (in this case data/dogscats/):
  • There should be train folder and valid folder, and under each of these, folders with classification labels (i.e. cats and dogs for this example) with corresponding images in them.
  • The training output: [epoch #, training loss, validation loss, accuracy]
[ 0.       0.04955  0.02605  0.98975]

Learning Rate [4:54]

  • The basic idea of learning rate is that it is going to decide how quickly we zoom/hone in on the solution.
  • If the learning rate is too small, it will take very long time to get to the bottom
  • If the learning rate is too big, it could get oscillate away from the bottom.
  • Learning rate finder (learn.lr_find) will increase the learning rate after each mini-batch. Eventually, the learning rate is too high that loss will get worse. We then look at the plot of learning rate against loss, and determine the lowest point and go back by one magnitude and choose that as a learning rate (1e-2 in the example below).
  • Mini-batch is a set of few images we look at each time so that we are using the parallel processing power of the GPU effectively (generally 64 or 128 images at a time)
  • In Python:
  • By adjusting this one number, you should be able to get pretty good results. library picks the rest of the hyper parameters for you. But as the course progresses, we will learn that there are some more things we can tweak to get slightly better results. But learning rate is the key number for us to set.
  • Learning rate finder sits on top of other optimizers (e.g. momentum, Adam, etc) and help you choose the best learning rate given what other tweaks you are using (such as advanced optimizers but not limited to optimizers).
  • Questions: what happens for optimizers that changes learning rate during the epoch? Is this finder choosing an initial learning rate?[14:05] We will learn about optimizers in details later, but the basic answer is no. Even Adam has a learning rate which gets divided by the average previous gradient and also the recent sum of squared gradients. Even those so-called “dynamic learning rate” methods have a learning rate.
  • The most important thing you can do to make the model better is to give it more data. Since these models have millions of parameters, if you train them for a while, they start to do what is called “overfitting”.
  • Overfitting — the model is starting to see the specific details of the images in the training set rather than learning something general that can be transferred across to the validation set.
  • We can collect more data, but another easy way is data augmentation.

Data Augmentation [15:50]

  • Every epoch, we will randomly change the image a little bit. In other words, the model is going to see slightly different version of the image each epoch.
  • You want to use different types of data augmentation for different types of image (flip horizontally, vertically, zoom in, zoom out, vary contrast and brightness, and many more).

Learning Rate Finder Questions [19:11]:

  • Why not pick the bottom? The point at which the loss was lowest is where the red circle is. But that learning rate was actually too large at that point and will not likely to converge. So the one before that would be a better choice (it is always better to pick a learning rate that is smaller than too big)
  • When should we learn lr_find? [23:02] Run it once at the start, and maybe after unfreezing layers (we will learn it later). Also when I change the thing I am training or change the way I am training it. Never any harm in running it.

Back to Data Augmentation [24:10]

tfms = tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
  • transform_side_on — a predefined set of transformations for side-on photos (there is also transform_top_down). Later we will learn how to create custom transform lists.
  • It is not exactly creating new data, but allows the convolutional neural net to learn how to recognize cats or dogs from somewhat different angles.
data = ImageClassifierData.from_paths(PATH, tfms=tfms)
learn = ConvLearner.pretrained(arch, data, precompute=True), 1)
  • Now we created a new data object that includes augmentation. Initially, the augmentations actually do nothing because of precompute=True.
  • Convolutional neural network have these things called “activations.” An activation is a number that says “this feature is in this place with this level of confidence (probability)”. We are using a pre-trained network which has already learned to recognize features (i.e. we do not want to change hyper parameters it learned), so what we can do is to pre-compute activations for hidden layers and just train the final linear portion.
  • This is why when you train your model for the first time, it takes longer — it is pre-computing these activations.
  • Even though we are trying to show a different version of the cat each time, we had already pre-computed the activations for a particular version of the cat (i.e. we are not re-calculating the activations with the altered version).
  • To use data augmentation, we have to do learn.precompute=False:, 3, cycle_len=1)[ 0.       0.03597  0.01879  0.99365]                         
[ 1. 0.02605 0.01836 0.99365]
[ 2. 0.02189 0.0196 0.99316]
  • Bad news is that accuracy is not improving. Training loss is decreasing but validation loss is not, but we are not overfitting. Overfitting when the training loss is much lower than the validation loss. In other words, when your model is doing much better job on the training set than it is on the validation set, that means your model is not generalizing.
  • cycle_len=1 [30:17]: This enables stochastic gradient descent with restarts (SGDR). The basic idea is as you get closer and closer to the spot with the minimal loss, you may want to start decrease the learning rate (taking smaller steps) in order to get to exactly the right spot.
  • The idea of decreasing the learning rate as you train is called learning rate annealing which is very common. Most common and “hacky” way to do this is to train a model with a certain learning rate for a while, and when it stops improving, manually drop down the learning rate (stepwise annealing).
  • A better approach is simply to pick some kind of functional form — turns out the really good functional form is one half of the cosign curve which maintains the high learning rate for a while at the beginning, then drop quickly when you get closer.
  • However, we may find ourselves in a part of the weight space that isn’t very resilient — that is, small changes to the weights may result in big changes to the loss. We want to encourage our model to find parts of the weight space that are both accurate and stable. Therefore, from time to time we increase the learning rate (this is the ‘restarts’ in ‘SGDR’), which will force the model to jump to a different part of the weight space if the current area is “spiky”. Here’s a picture of how that might look if we reset the learning rates 3 times (in this paper they call it a “cyclic LR schedule”):
  • The number of epochs between resetting the learning rate is set by cycle_len, and the number of times this happens is referred to as the number of cycles, and is what we're actually passing as the 2nd parameter to fit(). So here's what our actual learning rates looked like:
  • Question: Could we get the same effect by using random starting point? [35:40] Before SGDR was created, people used to create “ensembles” where they would relearn a whole new model ten times in the hope that one of them would end up being better. In SGDR, once we get close enough to the optimal and stable area, resetting will not actually “reset” but the weights keeps better. So SGDR will give you better results than just randomly try a few different starting points.
  • It is important to pick a learning rate (which is the highest learning rate SGDR uses) that is big enough to allow the reset to jump to a different part of the function. [37:25]
  • SGDR reduces the learning rate every mini-batch, and reset occurs every cycle_len epoch (in this case it is set to 1).
  • Question: Our main goal is to generalize and not end up in the narrow optima. In this method, are we keeping track of the minima and averaging them and ensembling them? [39:27] That is another level of sophistication and you see “Snapshot Ensemble” in the diagram. We are not currently doing that but if you wanted it to generalize even better, you can save the weights right before the resets and take the average. But for now, we are just going to pick the last one.
  • If you want to skip ahead, there is a parameter called cycle_save_name which you can add as well as cycle_len, which will save a set of weights at the end of every learning rate cycle and then you can ensemble them [40:14].

Saving model [40:31]'224_lastlayer')learn.load('224_lastlayer')
  • When you precompute activations or create resized images (we will learn about it soon), various temporary files get created which you see under data/dogcats/tmp folder. If you are getting weird errors, it might be because of precomputed activations that are only half completed or are in some way incompatible with what you are doing. So you can always go ahead and delete this /tmp folder to see if it makes the error go away ( equivalent of turning it off and then on again).
  • You will also see there is a directory called /models that is where models get saved when you say

Fine Tuning and Differential Learning Rate [43:49]

  • So far, we have not retrained any of pre-trained features — specifically, any of those weights in the convolutional kernels. All we have done is we added some new layers on top and learned how to mix and match pre-trained features.
  • Images like satellite images, CT scans, etc have totally different kinds of features all together (compare to ImageNet images), so you want to re-train many layers.
  • For dogs and cats, images are similar to what the model was pre-trained with, but we still may find it is helpful to slightly tune some of the later layers.
  • Here is how you tell the learner that we want to start actually changing the convolutional filters themselves:
  • “frozen” layer is a layer which is not being trained/updated. unfreeze unfreezes all the layers.
  • Earlier layers like the first layer (which detects diagonal edges or gradient) or the second layer (which recognizes corners or curves) probably do not need to change by much, if at all.
  • Later layers are much more likely to need more learning. So we create an array of learning rates (differential learning rate):
  • 1e-4 : for the first few layers (basic geometric features)
  • 1e-3 : for the middle layers (sophisticated convolutional features)
  • 1e-2 : for layers we added on top
  • Why 3? Actually they are 3 ResNet blocks but for now, think of it as a group of layers., 3, cycle_len=1, cycle_mult=2)[ 0.       0.04538  0.01965  0.99268]                          
[ 1. 0.03385 0.01807 0.99268]
[ 2. 0.03194 0.01714 0.99316]
[ 3. 0.0358 0.0166 0.99463]
[ 4. 0.02157 0.01504 0.99463]
[ 5. 0.0196 0.0151 0.99512]
[ 6. 0.01356 0.01518 0.9956 ]
  • Earlier we said 3 is the number of epochs, but it is actually cycles. So if cycle_len=2 , it will do 3 cycles where each cycle is 2 epochs (i.e. 6 epochs). Then why did it 7? It is because of cycle_mult
  • cycle_mult=2 : this multiplies the length of the cycle after each cycle (1 epoch + 2 epochs + 4 epochs = 7 epochs).
  • n_cycle=3, cycle_len=1, cycle_mult=2
  • n_cycle=3, cycle_len=2 (no cycle_mult)
  • Here is some interesting discussion about spiky minima.

Test Time Augmentation (TTA) [56:49]

log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)
accuracy(probs, y)0.99650000000000005

Analyzing results [01:11:50]

Confusion matrix

preds = np.argmax(probs, axis=1)
probs = probs[:,1]
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, preds)
plot_confusion_matrix(cm, data.classes)

Let’s look at the pictures again [01:13:00]

Review: easy steps to train a world-class image classifier [01:14:09]

  1. Enable data augmentation, and precompute=True
  2. Use lr_find() to find highest learning rate where loss is still clearly improving
  3. Train last layer from precomputed activations for 1–2 epochs
  4. Train last layer with data augmentation (i.e. precompute=False) for 2–3 epochs with cycle_len=1
  5. Unfreeze all layers
  6. Set earlier layers to 3x-10x lower learning rate than next higher layer. Rule of thumb: 10x for ImageNet like images, 3x for satellite or medical imaging
  7. Use lr_find() again (Note: if you call lr_find having set differential learning rates, what it prints out is the learning rate of the last layers.)
  8. Train full network with cycle_mult=2 until over-fitting

Let’s do it again: Dog Breed Challenge [01:16:37]

  • You can use Kaggle CLI to download data for Kaggle competitions
  • Notebook is not made public since it is an active competition
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *
PATH = 'data/dogbreed/'
sz = 224
arch = resnext101_64
label_csv = f'{PATH}labels.csv'
n = len(list(open(label_csv)))-1
val_idxs = get_cv_idxs(n)
!ls {PATH}
label_df = pd.read_csv(label_csv)
label_df.pivot_table(index='breed', aggfunc=len).sort_values('id', ascending=False)
How many dog images per breed
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, 
data = ImageClassifierData.from_csv(PATH, 'train',
f'{PATH}labels.csv', test_name='test',
val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
  • max_zoom — we will zoom in up to 1.1 times
  • ImageClassifierData.from_csv — last time, we used from_paths but since the labels are in CSV file, we will call from_csv instead.
  • test_name — we need to specify where the test set is if you want to submit to Kaggle competitions
  • val_idx — there is no validation folder but we still want to track how good our performance is locally. So above you will see:
  • suffix=’.jpg’ — File names has .jpg at the end, but CSV file does not. So we will set suffix so it knows the full file names.
fn = PATH + data.trn_ds.fnames[0]; fn'data/dogbreed/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg'
  • You can access to training dataset by saying data.trn_ds and trn_ds contains a lot of things including file names (fnames)
img =; img
img.size(500, 375)
  • Now we check image size. If they are huge, then you have to think really carefully about how to deal with them. If they are tiny, it is also challenging. Most of ImageNet models are trained on either 224 by 224 or 299 by 299 images
size_d = {k: for k in data.trn_ds.fnames}
  • Dictionary comprehension — key: name of the file, value: size of the file
row_sz, col_sz = list(zip(*size_d.values()))
  • *size_d.values() will unpack a list. zip will pair up elements of tuples to create a list of tuples.
Histogram of rows
  • Matplotlib is something you want to be very familiar with if you do any kind of data science or machine learning in Python. Matplotlib is always referred to as plt .
def get_data(sz, bs):
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on,
data = ImageClassifierData.from_csv(PATH, 'train',
f'{PATH}labels.csv', test_name='test', num_workers=4,
val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
return data if sz>300 else data.resize(340, 'tmp')
  • Here is the regular two lines of code. When we start working with new dataset, we want everything to go super fast. So we made it possible to specify the size and start with something like 64 which will run fast. Later, we will use bigger images and bigger architectures at which point, you may run out of GPU memory. If you see CUDA out of memory error, the first thing you need to do is to restart kernel (you cannot recover from it), then make the batch size smaller.
data = get_data(224, bs)learn = ConvLearner.pretrained(arch, data, precompute=True), 5)[0.      1.99245 1.0733  0.76178]                             
[1. 1.09107 0.7014 0.8181 ]
[2. 0.80813 0.60066 0.82148]
[3. 0.66967 0.55302 0.83125]
[4. 0.57405 0.52974 0.83564]
  • 83% for 120 classes is pretty good.
learn.precompute =, 5, cycle_len=1)
  • Reminder: a epoch is one pass through the data, a cycle is how many epochs you said is in a cycle'224_pre')

Increase image size [1:32:55]

learn.set_data(get_data(299, bs))
  • If you trained a model on smaller size images, you can then call learn.set_data and pass in a larger size dataset. That is going to take your model, however it has been trained so far, and it is going to let you continue to train on larger images.

Starting training on small images for a few epochs, then switching to bigger images, and continuing training is an amazingly effective way to avoid overfitting., 3, cycle_len=1)[0.      0.35614 0.22239 0.93018]                            
[1. 0.28341 0.2274 0.92627]
0.28341 0.2274 0.92627]
  • As you see, validation set loss (0.2274) is much lower than training set loss (0.28341) — which means it is under fitting. When you are under fitting, it means cycle_len=1 is too short (learning rate is getting reset before it had the chance to zoom in properly). So we will add cycle_mult=2 (i.e. 1st cycle is 1 epoch, 2nd cycle is 2 epochs, and 3rd cycle is 4 epochs), 3, cycle_len=1, cycle_mult=2)[0.      0.27171 0.2118  0.93192]                            
[1. 0.28743 0.21008 0.9324 ]
[2. 0.25328 0.20953 0.93288]
[3. 0.23716 0.20868 0.93001]
[4. 0.23306 0.20557 0.93384]
[5. 0.22175 0.205 0.9324 ]
[6. 0.2067 0.20275 0.9348 ]
  • Now the validation loss and training loss are about the same — this is about the right track. Then we try TTA :
log_preds, y = learn.TTA()
probs = np.exp(log_preds)
accuracy(log_preds,y), metrics.log_loss(y, probs)
(0.9393346379647749, 0.20101565705592733)
  • Try running one more cycle of 2 epochs
  • Unfreezing (in this case, training convolutional layers did not help in the slightest since the images actually came from ImageNet)
  • Remove validation set and just re-run the same steps, and submit that — which lets us use 100% of the data.
  • We started with a pre-trained network
  • We added a couple of layers on the end of it which start out random. With everything frozen and precompute=True, all we are learning is the layers we have added.
  • With precompute=True, data augmentation does not do anything because we are showing exactly the same activations each time.
  • We then set precompute=False which means we are still only training the layers we added because it is frozen but data augmentation is now working because it is actually going through and recalculating all of the activations from scratch.
  • Then finally, we unfreeze which is saying “okay, now you can go ahead and change all of these earlier convolutional filters”.
  1. Use lr_find() to find highest learning rate where loss is still clearly improving
  2. Train last layer with data augmentation (i.e. precompute=False) for 2–3 epochs with cycle_len=1
  3. Unfreeze all layers
  4. Set earlier layers to 3x-10x lower learning rate than next higher layer
  5. Train full network with cycle_mult=2 until over-fitting
Visualizing and Understanding Convolutional Networks

Further improvement [01:48:16]

  1. Assuming the size of images you were using is smaller than the average size of images you have been given, you can increase the size. As we have seen before, you can increase it during training.
  2. Use better architecture. There are different ways of putting together what size convolutional filters and how they are connected to each other, and different architectures have different number of layers, size of kernels, filters, etc.

Satellite Imagery [01:53:01]

  • transforms_top_down — Since they satellite imagery, they still make sense when they were flipped vertically.
  • Much higher learning rate — something to do with this particular dataset
  • lrs = np.array([lr/9,lr/3,lr]) — differential learning rate now change by 3x because images are quite different from ImageNet images
  • sz=64 — this helped to avoid over fitting for satellite images but he would not do that for dogs. vs. cats or dog breed (similar images to ImageNet) as 64 by 64 is quite tiny and might destroy pre-trained weights.

How to get your AWS setup [01:58:54]




Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hiromi Suenaga

Hiromi Suenaga

More from Medium

Deep Neural Networks— Personality Classification based on Text

Transfer learning and active learning to find images of a small class in an unlabeled dataset.

On Statistical Biases and Biases in Machine Learning

Learn AI Model from Scratch in 15 Minutes