Deep Learning 2: Part 1 Lesson 7

34 min readJan 10, 2018

My personal notes from fast.ai course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12 ・ 13 ・ 14

Lesson 7

The theme of Part 1 is:

classification and regression with deep learning
identifying and learning best and established practices
focus is on classification and regression which is predicting “a thing” (e.g. a number, a small number of labels)

Part 2 of the course:

focus is on generative modeling which means predicting “lots of things” — for example, creating a sentence as in neural translation, image captioning, or question answering while creating an image such as in style transfer, super-resolution, segmentation and so forth.
not as much best practices but a little more speculative from recent papers that may not be fully tested.

Review of Char3Model [02:49]

Reminder: RNN is not in any way different or unusual or magical — just a standard fully connected network.

Arrows represent one or more layer operations — generally speaking a linear followed by a non-linear function, in this case matrix multiplications followed by relu or tanh
Arrows of the same color represent exactly the same weight matrix being used.
One slight difference from previous is that there are inputs coming in at the second and third layers. We tried two approaches — concatenating and adding these inputs to the current activations.

class Char3Model(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)

        # The 'green arrow' from our diagram
        self.l_in = nn.Linear(n_fac, n_hidden)

        # The 'orange arrow' from our diagram
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # The 'blue arrow' from our diagram
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
        in1 = F.relu(self.l_in(self.e(c1)))
        in2 = F.relu(self.l_in(self.e(c2)))
        in3 = F.relu(self.l_in(self.e(c3)))
        
        h = V(torch.zeros(in1.size()).cuda())
        h = F.tanh(self.l_hidden(h+in1))
        h = F.tanh(self.l_hidden(h+in2))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

By using nn.Linear we get both the weight matrix and the bias vector wrapped up for free for us.
To deal with the fact that there is no orange arrow coming in for the first ellipse , we invented an empty matrix

class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

Almost identical except for the for loop

class CharRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

PyTorch version — nn.RNN will create the loop and keep track of h as it goes along.
We are using white section to predict the green character — which seems wasteful as the next section mostly overlaps with the current section.

We then tried splitting it into non-overlapping pieces in multi-output model:

In this approach, we are throwing away our h activation after processing each section and started a new one. In order to predict the second character using the first one in the next section, it has nothing to go on but a default activation. Let’s not throw away h .

Stateful RNN [08:52]

class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

One additional line in constructor. self.init_hidden(bs) sets self.h to bunch of zeros.
Wrinkle #1 [10:51] — if we were to simply do self.h = h , and we trained on a document that is a million characters long, then the size of unrolled version of the RNN has a million layers (ellipses). One million layer fully connected network is going to be very memory intensive because in order to do a chain rule, we have to multiply one million layers while remembering all one million gradients every batch.
To avoid this, we tell it to forget its history from time to time. We can still remember the state (the values in our hidden matrix) without remembering everything about how we got there.

def repackage_var(h):
    return Variable(h.data) if type(h) == Variable else tuple(repackage_var(v) for v in h)

Grab the tensor out of Variable h (remember, a tensor itself does not have any concept of history), and create a new Variable out of that. The new variable has the same value but no history of operations, therefore when it tries to back-propagate, it will stop there.
forward will process 8 characters, it then back propagate through eight layers, keep track of the values in out hidden state, but it will throw away its history of operations. This is called back-prop through time (bptt).
In other words, after the for loop, just throw away the history of operations and start afresh. So we are keeping our hidden state but we are not keeping our hidden state history.
Another good reason not to back-propagate through too many layers is that if you have any kind of gradient instability (e.g. gradient explosion or gradient banishing), the more layers you have, the harder the network gets to train (slower and less resilient).
On the other hand, the longer bptt means that you are able to explicitly capture a longer memory and more state.
Wrinkle #2 [16:00] — how to create mini-batches. We do not want to process one section at a time, but a bunch in parallel at a time.
When we started looking at TorchText for the first time, we talked about how it creates these mini-batches.
Jeremy said we take a whole long document consisting of the entire works of Nietzsche or all of the IMDB reviews concatenated together, we split this into 64 equal sized chunks (NOT chunks of size 64).

For a document that is 64 million characters long, each “chunk” will be 1 million characters. We stack them together and now split them by bptt — 1 mini-bach consists of 64 by bptt matrix.
The first character of the second chunk(1,000,001th character) is likely be in the middle of a sentence. But it is okay since it only happens once every million characters.

Question: Data augmentation for this kind of dataset? [20:34]

There is no known good way. Somebody recently won a Kaggle competition by doing data augmentation which randomly inserted parts of different rows — something like that may be useful here. But there has not been any recent state-of-the-art NLP papers that are doing this kind of data augmentation.

Question: How do we choose the size of bptt? [21:36]

There are a couple things to think about:

the first is that mini-batch matrix has a size of bs (# of chunks) by bptt so your GPU RAM must be able to fit that by your embedding matrix. So if you get CUDA out of memory error, you need reduce one of these.
If your training is unstable (e.g. your loss is shooting off to NaN suddenly), then you could try decreasing your bptt because you have less layers to gradient explode through.
If it is too slow [22:44], try decreasing your bptt because it will do one of those steps at a time. for loop cannot be parallelized (for the current version). There is a recent thing called QRNN (Quasi-Recurrent Neural Network) which does parallelize it and we hope to cover in part 2.
So pick the highest number that satisfies all these.

Stateful RNN & TorchText [23:23]

When using an existing API which expects data to be certain format, you can either change your data to fit that format or you can write your own dataset sub-class to handle the format that your data is already in. Either is fine, but in this case, we will put our data in the format TorchText already support. Fast.ai wrapper around TorchText already has something where you can have a training path and validation path, and one or more text files in each path containing bunch of text that are concatenated together for your language model.

from torchtext import vocab, data  from fastai.nlp import * 
from fastai.lm_rnn import *  PATH='data/nietzsche/'  TRN_PATH = 'trn/' 
VAL_PATH = 'val/' 
TRN = f'{PATH}{TRN_PATH}' 
VAL = f'{PATH}{VAL_PATH}'%ls {PATH}
models/  nietzsche.txt  trn/  val/%ls {PATH}trn
trn.txt

Made a copy of Nietzsche file, pasted into training and validation directory. Then deleted the last 20% of the rows from training set, and deleted everything but the last 20% from the validation set [25:15].
The other benefit of doing it this way is that it seems like it is more realistic to have a validation set that was not a random shuffled set of rows of text, but it was totally separate part of the corpus.
When you are doing a language model, you do not really need separate files. You can have multiple files but they just get concatenated together anyway.

TEXT = data.Field(lower=True, tokenize=list)
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)
(963, 56, 1, 493747)

In TorchText, we make this thing called Field and initially Field is just a description of how to go about pre-processing the text.
lower — we told it to lowercase the text
tokenize — Last time, we used a function that splits on whitespace that gave us a word model. This time, we want a character model, so use list function to tokenize strings. Remember, in Python, list('abc') will return ['a', 'b', 'c'] .
bs : batch size, bptt : we renamed cs , n_fac : size of embedding, n_hidden : size of our hidden state
We do not have a separate test set, so we’ll just use validation set for testing
TorchText randomize the length of bptt a little bit each time. It does not always give us exactly 8 characters; 5% of the time, it will cut it in half and add on a small standard deviation to make it slightly bigger or smaller than 8. We cannot shuffle the data since it needs to be contiguous, so this is a way to introduce some randomness.
Question [31:46]: Does the size remain constant per mini-batch? Yes, we need to do matrix multiplication with h weight matrix, so mini-batch size must remain constant. But sequence length can change no problem.
len(md.trn_dl) : length of data loader (i.e. how many mini-batches), md.nt : number of tokens (i.e. how many unique things are in the vocabulary)
Once you run LanguageModelData.from_text_files , TEXT will contain an extra attribute called vocab. TEXT.vocab.itos list of unique items in the vocabulary, and TEXT.vocab.stoi is a reverse mapping from each item to number.

class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

Wrinkle #3 [33:51]: Jeremy lied to us when he said that mini-batch size remains constant. It is very likely that the last mini-batch is shorter than the rest unless the dataset is exactly divisible by bptt times bs . That is why we check whether self.h ‘s second dimension is the same as bs of the input. If it is not the same, set it back to zero with the input’s bs . This happens at the end of the epoch and the beginning of the epoch (setting back to the full batch size).
Wrinkle #4 [35:44]: The last wrinkle is something that slightly sucks about PyTorch and maybe somebody can be nice enough to try and fix it with a PR. Loss functions are not happy receiving a rank 3 tensor (i.e. three dimensional array). There is no particular reason they ought to not be happy receiving a rank 3 tensor (sequence length by batch size by results — so you can just calculate loss for each of the two initial axis). Works for rank 2 or 4, but not 3.
.view will reshape rank 3 tensor into rank 2 of -1 (however big as necessary) by vocab_size. TorchText automatically changes the target to be flattened out, so we do not need to do that for actual values (when we looked at a mini-batch in lesson 4, we noticed that it was flattened. Jeremy said we will learn about why later, so later is now.)
PyTorch (as of 0.3), log_softmax requires us to specify which axis we want to do the softmax over (i.e. which axis we want to sum to one). In this case we want to do it over the last axis dim = -1.

m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda() 
opt = optim.Adam(m.parameters(), 1e-3)fit(m, md, 4, opt, F.nll_loss)

Let’s gain more insight by unpacking RNN [42:48]

We remove the use of nn.RNN and replace it with nn.RNNCell . PyTorch source code looks like the following. You should be able to read and understand (Note: they do not concatenate the input and the hidden state, but they sum them together — which was our first approach):

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

Question about tanh [44:06]: As we have seen last week, tanh is forcing the value to be between -1 and 1. Since we are multiplying by this weight matrix again and again, we would worry that relu (since it is unbounded) might have more gradient explosion problem. Having said that, you can specify RNNCell to use different nonlineality whose default is tanh and ask it to use relu if you wanted to.

class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp = []
        o = self.h
        for c in cs: 
            o = self.rnn(self.e(c), o)
            outp.append(o)
        outp = self.l_out(torch.stack(outp))
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

for loop is back and append the result of linear function to a list — which in end gets stacked up together.
fast.ai library actually does exactly this in order to use regularization approaches that are not supported by PyTorch.

Gated Recurrent Unit (GRU) [46:44]

In practice, nobody really uses RNNCell since even with tanh , gradient explosions are still a problem and we need use low learning rate and small bptt to get them to train. So what we do is to replace RNNCell with something like GRUCell .

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/

Normally, the input gets multiplied by a weight matrix to create new activations h and get added to the existing activations straight away. That is not wha happens here.
Input goes into h˜ and it doesn’t just get added to the previous activations, but the previous activation gets multiplied by r (reset gate) which has a value of 0 or 1.
r is calculated as below — matrix multiplication of some weight matrix and the concatenation of our previous hidden state and new input. In other words, this is a little one hidden layer neural net. It gets put through the sigmoid function as well. This mini neural net learns to determine how much of the hidden states to remember (maybe forget it all when it sees a full-stop character — beginning of a new sentence).
z gate (update gate) determines what degree to use h˜ (the new input version of hidden states) and what degree to leave the hidden state the same as before.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Linear interpolation

def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih)
    gh = F.linear(hidden, w_hh, b_hh)
    i_r, i_i, i_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)

    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n)
    return newgate + inputgate * (hidden - newgate)

Above is what GRUCell code looks like, and our new model that utilize this is below:

class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

As a result, we can lower the loss down to 1.36 (RNNCell one was 1.54). In practice, GRU and LSTM are what people uses.

Putting it all together: Long Short-Term Memory [54:09]

LSTM has one more piece of state in it called “cell state” (not just hidden state), so if you do use a LSTM, you have to return a tuple of matrices in init_hidden (exactly the same size as hidden state):

from fastai import sgdr

n_hidden=512class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

The code is identical to GRU one. The one thing that was added was dropout which does dropout after each time step and doubled the hidden layer — in a hope that it will be able to learn more and be resilient as it does so.

Callbacks (specifically SGDR) without Learner class [55:23]

m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

After creating a standard PyTorch model, we usually do something like opt = optim.Adam(m.parameters(), 1e-3). Instead, we will use fast.ai LayerOptimizer which takes an optimizer optim.Adam , our model m , learning rate 1e-2 , and optionally weight decay 1e-5 .
A key reason LayerOptimizer exists is to do differential learning rates and differential weight decay. The reason we need to use it is that all of the mechanics inside fast.ai assumes that you have one of these. If you want to use callbacks or SGDR in code you are not using the Learner class, you need to use this.
lo.opt returns the optimizer.

on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb)

When we call fit, we can now pass the LayerOptimizer and also callbacks.
Here, we use cosine annealing callback — which requires a LayerOptimizer object. It does cosine annealing by changing learning rate in side the lo object.
Concept: Create a cosine annealing callback which is going to update the learning rates in the layer optimizer lo . The length of an epoch is equal to len(md.trn_dl) — how many mini-batches are there in an epoch is the length of the data loader. Since it is doing cosine annealing, it needs to know how often to reset. You can pass in cycle_mult in usual way. We can even save our model automatically just like we did with cycle_save_name in Learner.fit.
We can do callback at a start of a training, epoch or a batch, or at the end of a training, an epoch, or a batch.
It has been used for CosAnneal (SGDR), and decoupled weight decay (AdamW), loss-over-time graph, etc.

Testing [59:55]

def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return resprint(get_next_n('for thos', 400))for those the skemps), or imaginates, though they deceives. it should so each ourselvess and new present, step absolutely for the science." the contradity and measuring,  the whole!  293. perhaps, that every life a values of blood of intercourse when it senses there is unscrupulus, his very rights, and still impulse, love? just after that thereby how made with the way anything, and set for harmless philos

In lesson 6, when we were testing CharRnn model, we noticed that it repeated itself over and over. torch.multinomial used in this new version deals with this problem. p[-1] to get the final output (the triangle), exp to convert log probability to probability. We then use torch.multinomial function which will give us a sample using the given probabilities. If probability is [0, 1, 0, 0] and ask it to give us a sample, it will always return the second item. If it was [0.5, 0, 0.5], it will give the first item 50% of the time, and second item . 50% of the time (review of multinomial distribution)
To play around with training character based language models like this, try running get_next_n at different levels of loss to get a sense of what it looks like. The example above is at 1.25, but at 1.3, it looks like a total junk.
When you are playing around with NLP, particularly generative model like this, and the results are kind of okay but not great, do not be disheartened because that means you are actually very VERY nearly there!

Back to computer vision: CIFAR 10 [1:01:58]

CIFAR 10 is an old and well known dataset in academia — well before ImageNet, there was CIFAR 10. It is small both in terms of number of images and size of images which makes it interesting and challenging. You will likely be working with thousands of images rather than one and a half million images. Also a lot of the things we are looking at like in medical imaging, we are looking at a specific area where there is a lung nodule, you are probably looking at 32 by 32 pixels at most.

It also runs quickly, so it is much better to test our your algorithms. As Ali Rahini mentioned in NIPS 2017, Jeremy has the concern that many people are not doing carefully tuned and throught-about experiments in deep learning, but instead, they throw lots of GPUs and TPUs or lots of data and consider that a day. It is important to test many versions of your algorithm on dataset like CIFAR 10 rather than ImageNet that takes weeks. MNIST is also good for studies and experiments even though people tend to complain about it.

CIFAR 10 data in image format is available here

from fastai.conv_learner import *
PATH = "data/cifar10/"
os.makedirs(PATH,exist_ok=True)classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
stats = (np.array([ 0.4914 ,  0.48216,  0.44653]), np.array([ 0.24703,  0.24349,  0.26159]))def get_data(sz,bs):
     tfms = tfms_from_stats(stats, sz, aug_tfms=[RandomFlipXY()], pad=sz//8)
     return ImageClassifierData.from_paths(PATH, val_name='test', tfms=tfms, bs=bs)bs=256

classes — image labels
stats —When we use pre-trained models, you can call tfms_from_model which creates the necessary transforms to convert our data set into a normalized dataset based on the means and standard deviations of each channel in the original model that was trained in. Since we are training a model from scratch, we ned to tell it the mean and standard deviation of our data to normalize it. Make sure you can calculate the mean and the standard deviation for each channel.
tfms — For CIFAR 10 data augmentation, people typically do horizontal flip and black padding around the edge and randomly select 32 by 32 area within the padded image.

data = get_data(32,bs)

lr=1e-2

From this notebook by our student Kerem Turgutlu:

class SimpleNet(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(layers[i], layers[i + 1]) for i in range(len(layers) - 1)])
        
    def forward(self, x):
        x = x.view(x.size(0), -1)
        for l in self.layers:
            l_x = l(x)
            x = F.relu(l_x)
        return F.log_softmax(l_x, dim=-1)

nn.ModuleList — whenever you create a list of layers in PyTorch, you have to wrap it in ModuleList to register these as attributes.

learn = ConvLearner.from_model_data(SimpleNet([32*32*3, 40,10]), data)

Now we step up one level of API higher — rather than calling fit function, we create a learn object from a custom model. ConfLearner.from_model_data takes standard PyTorch model and model data object.

learn, [o.numel() for o in learn.model.parameters()](SimpleNet(
   (layers): ModuleList(
     (0): Linear(in_features=3072, out_features=40)
     (1): Linear(in_features=40, out_features=10)
   )
 ), [122880, 40, 400, 10])learn.summary()OrderedDict([('Linear-1',
              OrderedDict([('input_shape', [-1, 3072]),
                           ('output_shape', [-1, 40]),
                           ('trainable', True),
                           ('nb_params', 122920)])),
             ('Linear-2',
              OrderedDict([('input_shape', [-1, 40]),
                           ('output_shape', [-1, 10]),
                           ('trainable', True),
                           ('nb_params', 410)]))])learn.lr_find()learn.sched.plot()

%time learn.fit(lr, 2)A Jupyter Widget[ 0.       1.7658   1.64148  0.42129]                       
[ 1.       1.68074  1.57897  0.44131]                       

CPU times: user 1min 11s, sys: 32.3 s, total: 1min 44s
Wall time: 55.1 s%time learn.fit(lr, 2, cycle_len=1)A Jupyter Widget[ 0.       1.60857  1.51711  0.46631]                       
[ 1.       1.59361  1.50341  0.46924]                       

CPU times: user 1min 12s, sys: 31.8 s, total: 1min 44s
Wall time: 55.3 s

With a simple one hidden layer model with 122,880 parameters, we achieved 46.9% accuracy. Let’s improve this and gradually build up to a basic ResNet architecture.

CNN [01:12:30]

Let’s replace a fully connected model with a convolutional model. Fully connected layer is simply doing a dot product. That is why the weight matrix is big (3072 input * 40 = 122880). We are not using the parameters very efficiently because every single pixel in the input has a different weight. What we want to do is a group of 3 by 3 pixels that have particular patterns to them (i.e. convolution).
We will use a filter with three by three kernel. When there are multiple filters, the output will have additional dimension.

class ConvNet(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(layers[i], layers[i + 1], kernel_size=3, stride=2)
            for i in range(len(layers) - 1)])
        self.pool = nn.AdaptiveMaxPool2d(1)
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        for l in self.layers: x = F.relu(l(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)

Replace nn.Linear with nn.Conv2d
First two parameters are exactly the same as nn.Linear — the number of features coming in, and the number of features coming out
kernel_size=3 , the size of the filter
stride=2 will use every other 3 by 3 area which will halve the output resolution in each dimension (i.e. it has the same effect as 2 by 2 max-pooling)

learn = ConvLearner.from_model_data(ConvNet([3, 20, 40, 80], 10), data)learn.summary()OrderedDict([('Conv2d-1',
              OrderedDict([('input_shape', [-1, 3, 32, 32]),
                           ('output_shape', [-1, 20, 15, 15]),
                           ('trainable', True),
                           ('nb_params', 560)])),
             ('Conv2d-2',
              OrderedDict([('input_shape', [-1, 20, 15, 15]),
                           ('output_shape', [-1, 40, 7, 7]),
                           ('trainable', True),
                           ('nb_params', 7240)])),
             ('Conv2d-3',
              OrderedDict([('input_shape', [-1, 40, 7, 7]),
                           ('output_shape', [-1, 80, 3, 3]),
                           ('trainable', True),
                           ('nb_params', 28880)])),
             ('AdaptiveMaxPool2d-4',
              OrderedDict([('input_shape', [-1, 80, 3, 3]),
                           ('output_shape', [-1, 80, 1, 1]),
                           ('nb_params', 0)])),
             ('Linear-5',
              OrderedDict([('input_shape', [-1, 80]),
                           ('output_shape', [-1, 10]),
                           ('trainable', True),
                           ('nb_params', 810)]))])

ConvNet([3, 20, 40, 80], 10) — It start with 3 RGB channels, 20, 40, 80 features, then 10 classes to predict.
AdaptiveMaxPool2d — This followed by a linear layer is how you get from 3 by 3 down to a prediction of one of 10 classes and is now a standard for state-of-the-art algorithms. The very last layer, we do a special kind of max-pooling for which you specify the output activation resolution rather than how big of an area to poll. In other words, here we do 3 by 3 max-pool which is equivalent of 1 by 1 adaptive max-pool.
x = x.view(x.size(0), -1) — x has a shape of # of the features by 1 by 1, so it will remove the last two layers.
This model is called “fully convolutional network” — where every layer is convolutional except for the very last.

learn.lr_find(end_lr=100)
learn.sched.plot()

The default final learning rate lr_find tries is 10. If the loss is still getting better at that point, you can overwrite by specifying end_lr .

%time learn.fit(1e-1, 2)A Jupyter Widget[ 0.       1.72594  1.63399  0.41338]                       
[ 1.       1.51599  1.49687  0.45723]                       

CPU times: user 1min 14s, sys: 32.3 s, total: 1min 46s
Wall time: 56.5 s%time learn.fit(1e-1, 4, cycle_len=1)A Jupyter Widget[ 0.       1.36734  1.28901  0.53418]                       
[ 1.       1.28854  1.21991  0.56143]                       
[ 2.       1.22854  1.15514  0.58398]                       
[ 3.       1.17904  1.12523  0.59922]                       

CPU times: user 2min 21s, sys: 1min 3s, total: 3min 24s
Wall time: 1min 46s

It flattened out around 60% accuracy. Considering it uses about 30,000 parameters (compared to 47% with 122k parameters)
Time per epoch is about the same since their architectures are both simple and most of time is spent doing memory transfer.

Refactored [01:21:57]

Simplify forward function by creating ConvLayer (our first custom layer!). In PyTorch, layer definition and neural network definitions are identical. Anytime you have a layer, you can use it as a neural net, when you have a neural net, you can use it as a layer.

class ConvLayer(nn.Module):
    def __init__(self, ni, nf):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=2, padding=1)
        
    def forward(self, x): return F.relu(self.conv(x))

padding=1 — When you do convolution the image shrink by 1 pixel on each side. So it does not go from 32 by 32 to 16 by 16 but actually 15 by 15. padding will add a border so we can keep the edge pixel information. It is not as big of a deal for a big image, but when it’s down to 4 by 4, you really don’t want to throw away a whole piece.

class ConvNet2(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.layers = nn.ModuleList([ConvLayer(layers[i], layers[i + 1])
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        for l in self.layers: x = l(x)
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)

Another difference from the last model is that nn.AdaptiveMaxPool2d does not have any state (i.e. no weights). So we can just call it as a function F.adaptive_max_pool2d .

BatchNorm [1:25:10]

The last model, when we tried to add more layers, we had trouble training. The reason we had trouble training was that if we used larger learning rates, it would go off to NaN and if we used smaller learning rate, it would take forever and doesn’t have a chance to explore properly — so it was not resilient.
To make it resilient, we will use something called batch normalization. BatchNorm came out about two years ago and it has been quite transformative since it suddenly makes it really easy to train deeper networks.
We can simply use nn.BatchNorm but to learn about it, we will write it from scratch.
It is unlikely that the weight matrices on average are not going to cause your activations to keep getting smaller and smaller or keep getting bigger and bigger. It is important to keep them at reasonable scale. So we start things off with zero-mean standard deviation one by normalizing the input. What we really want to do is to do this for all layers, not just the inputs.

class BnLayer(nn.Module):
    def __init__(self, ni, nf, stride=2, kernel_size=3):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=kernel_size, 
                              stride=stride, bias=False, padding=1)
        self.a = nn.Parameter(torch.zeros(nf,1,1))
        self.m = nn.Parameter(torch.ones(nf,1,1))
        
    def forward(self, x):
        x = F.relu(self.conv(x))
        x_chan = x.transpose(0,1).contiguous().view(x.size(1), -1)
        if self.training:
            self.means = x_chan.mean(1)[:,None,None]
            self.stds  = x_chan.std (1)[:,None,None]
        return (x-self.means) / self.stds *self.m + self.a

Calculate the mean of each channel or each filter and standard deviation of each channel or each filter. Then subtract the means and divide by the standard deviations.
We no longer need to normalize our input because it is normalizing it per channel or for later layers it is normalizing per filter.
Turns out this is not enough since SGD is bloody-minded [01:29:20]. If SGD decided that it wants matrix to be bigger/smaller overall, doing (x=self.means) / self.stds is not enough because SGD will undo it and try to do it again in the next mini-batch. So we will add two parameters: a — adder (initial value zeros) and m — multiplier (initial value ones) for each channel.
Parameter tells PyTorch that it is allowed to learn these as weights.
Why does this work? If it wants to scale the layer up, it does not have to scale up every single value in the matrix. It can just scale up this single trio of numbers self.m , if it wants to shift it all up or down a bit, it does not have to shift the entire weight matrix, they can just shift this trio of numbers self.a. Intuition: We are normalizing the data and then we are saying you can then shift it and scale it using far fewer parameters than would have been necessary if it were to actually shift and scale the entire set of convolutional filters. In practice, it allows us to increase our learning rates, it increase the resilience of training, and it allows us to add more layers and still train effectively.
The other thing batch norm does is that it regularizes, in other words, you can often decrease or remove dropout or weight decay. The reason why is each mini-batch is going to have a different mean and a different standard deviation to the previous mini-batch. So they keep changing and it is changing the meaning of the filters in a subtle way acting as a noise (i.e. regularization).
In real version, it does not use this batch’s mean and standard deviation but takes an exponentially weighted moving average standard deviation and mean.
if self.training — this is important because when you are going through the validation set, you do not want to be changing the meaning of the model. There are some types of layer that are actually sensitive to what the mode of the network is whether it is in training mode or evaluation/test mode. There was a bug when we implemented mini net for MovieLens that dropout was applied during the validation — which was fixed. In PyTorch, there are two such layer: dropout and batch norm. nn.Dropout already does the check.
[01:37:01] The key difference in fast.ai which no other library does is that these means and standard deviations get updated in training mode in every other library as soon as you basically say I am training, regardless of whether that layer is set to trainable or not. With a pre-trained network, that is a terrible idea. If you have a pre-trained network for specific values of those means and standard deviations in batch norm, if you change them, it changes the meaning of those pre-trained layers. In fast.ai, always by default, it will not touch those means and standard deviations if your layer is frozen. As soon as you un-freeze it, it will start updating them unless you set learn.bn_freeze=True. In practice, this often seems to work a lot better for pre-trained models particularly if you are working with data that is quite similar to what the pre-trained model was trained with.
Where do you put batch-norm layer? We will talk more in a moment, but for now, after relu

Ablation Study [01:39:41]

It is something where you try turning on and off different pieces of your model to see which bits make which impacts, and one of the things that wasn’t done in the original batch norm paper was any kind of effective ablation. And one of the things therefore that was missing was this question which was just asked — where to put the batch norm. That oversight caused a lot of problems because it turned out the original paper did not actually put it in the best spot. Other people since then have now figured that out and when Jeremy show people code where it is actually in the spot that is better, people say his batch norm is in the wrong spot.

Try and always use batch norm on every layer if you can
Don’t stop normalizing your data so that people using your data will know how you normalized your data. Other libraries might not deal with batch norm for pre-trained models correctly, so when people start re-training, it might cause problems.

class ConvBnNet(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
        self.layers = nn.ModuleList([BnLayer(layers[i], layers[i + 1])
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        x = self.conv1(x)
        for l in self.layers: x = l(x)
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)

Rest of the code is similar — Using BnLayer instead of ConvLayer
A single convolutional layer was added at the start trying to get closer to the modern approaches. It has a bigger kernel size and a stride of 1. The basic idea is that we want the first layer to have a richer input. It does convolution using the 5 by 5 area which allows it to try and find more interesting richer features in that 5 by 5 area, then spit out bigger output (in this case, it’s 10 by 5 by 5 filters). Typically it is 5 by 5 or 7 by 7, or even 11 by 11 convolution with quite a few filters coming out (e.g. 32 filters).
Since padding = kernel_size — 1 / 2 and stride=1 , the input size is the same as the output size — just more filters.
It is a good way of trying to create a richer starting point.

Deep BatchNorm [01:50:52]

Let’s increase the depth of the model. We cannot just add more of stride 2 layers since it halves the size of the image each time. Instead, after each stride 2 layer, we insert a stride 1 layer.

class ConvBnNet2(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
        self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
            for i in range(len(layers) - 1)])
        self.layers2 = nn.ModuleList([BnLayer(layers[i+1], layers[i + 1], 1)
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        x = self.conv1(x)
        for l,l2 in zip(self.layers, self.layers2):
            x = l(x)
            x = l2(x)
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)learn = ConvLearner.from_model_data((ConvBnNet2([10, 20, 40, 80, 160], 10), data)%time learn.fit(1e-2, 2)A Jupyter Widget[ 0.       1.53499  1.43782  0.47588]                       
[ 1.       1.28867  1.22616  0.55537]                       

CPU times: user 1min 22s, sys: 34.5 s, total: 1min 56s
Wall time: 58.2 s%time learn.fit(1e-2, 2, cycle_len=1)A Jupyter Widget[ 0.       1.10933  1.06439  0.61582]                       
[ 1.       1.04663  0.98608  0.64609]                       

CPU times: user 1min 21s, sys: 32.9 s, total: 1min 54s
Wall time: 57.6 s

The accuracy remained the same as before. This is now 12 layers deep, and it is too deep even for batch norm to handle. It is possible to train 12 layer deep conv net but it starts to get difficult. And it does not seem to be helping much if at all.

ResNet [01:52:43]

class ResnetLayer(BnLayer):
    def forward(self, x): return x + super().forward(x)class Resnet(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
        self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
            for i in range(len(layers) - 1)])
        self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
            for i in range(len(layers) - 1)])
        self.layers3 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        x = self.conv1(x)
        for l,l2,l3 in zip(self.layers, self.layers2, self.layers3):
            x = l3(l2(l(x)))
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)

ResnetLayer inherit from BnLayer and override forward.
Then add bunch of layers and make it 3 times deeper, ad it still trains beautifully just because of x + super().forward(x) .

learn = ConvLearner.from_model_data(Resnet([10, 20, 40, 80, 160], 10), data)wd=1e-5%time learn.fit(1e-2, 2, wds=wd)A Jupyter Widget[ 0.       1.58191  1.40258  0.49131]                       
[ 1.       1.33134  1.21739  0.55625]                       

CPU times: user 1min 27s, sys: 34.3 s, total: 2min 1s
Wall time: 1min 3s%time learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2, wds=wd)A Jupyter Widget[ 0.       1.11534  1.05117  0.62549]                       
[ 1.       1.06272  0.97874  0.65185]                       
[ 2.       0.92913  0.90472  0.68154]                        
[ 3.       0.97932  0.94404  0.67227]                        
[ 4.       0.88057  0.84372  0.70654]                        
[ 5.       0.77817  0.77815  0.73018]                        
[ 6.       0.73235  0.76302  0.73633]                        

CPU times: user 5min 2s, sys: 1min 59s, total: 7min 1s
Wall time: 3min 39s%time learn.fit(1e-2, 8, cycle_len=4, wds=wd)A Jupyter Widget[ 0.       0.8307   0.83635  0.7126 ]                        
[ 1.       0.74295  0.73682  0.74189]                        
[ 2.       0.66492  0.69554  0.75996]                        
[ 3.       0.62392  0.67166  0.7625 ]                        
[ 4.       0.73479  0.80425  0.72861]                        
[ 5.       0.65423  0.68876  0.76318]                        
[ 6.       0.58608  0.64105  0.77783]                        
[ 7.       0.55738  0.62641  0.78721]                        
[ 8.       0.66163  0.74154  0.7501 ]                        
[ 9.       0.59444  0.64253  0.78106]                        
[ 10.        0.53      0.61772   0.79385]                    
[ 11.        0.49747   0.65968   0.77832]                    
[ 12.        0.59463   0.67915   0.77422]                    
[ 13.        0.55023   0.65815   0.78106]                    
[ 14.        0.48959   0.59035   0.80273]                    
[ 15.        0.4459    0.61823   0.79336]                    
[ 16.        0.55848   0.64115   0.78018]                    
[ 17.        0.50268   0.61795   0.79541]                    
[ 18.        0.45084   0.57577   0.80654]                    
[ 19.        0.40726   0.5708    0.80947]                    
[ 20.        0.51177   0.66771   0.78232]                    
[ 21.        0.46516   0.6116    0.79932]                    
[ 22.        0.40966   0.56865   0.81172]                    
[ 23.        0.3852    0.58161   0.80967]                    
[ 24.        0.48268   0.59944   0.79551]                    
[ 25.        0.43282   0.56429   0.81182]                    
[ 26.        0.37634   0.54724   0.81797]                    
[ 27.        0.34953   0.54169   0.82129]                    
[ 28.        0.46053   0.58128   0.80342]                    
[ 29.        0.4041    0.55185   0.82295]                    
[ 30.        0.3599    0.53953   0.82861]                    
[ 31.        0.32937   0.55605   0.82227]                    

CPU times: user 22min 52s, sys: 8min 58s, total: 31min 51s
Wall time: 16min 38s

ResNet block [01:53:18]

return x + super().forward(x)

y = x + f(x)

Where x is prediction from the previous layer, y is prediction from the current layer.Shuffle around the formula and we get:formula shuffle

f(x) = y − x

The difference y − x is residual. The residual is the error in terms of what we have calculated so far. What this is saying is that try to find a set of convolutional weights that attempts to fill in the amount we were off by. So in other words, we have an input, and we have a function which tries to predict the error (i.e. how much we are off by). Then we add a prediction of how much we were wrong by to the input, then add another prediction of how much we were wrong by that time, and repeat that layer after layer — zooming into the correct answer. This is based on a theory called boosting.

The full ResNet does two convolutions before it gets added back to the original input (we did just one here).
In every block x = l3(l2(l(x))) , one of the layers is not a ResnetLayer but a standard convolution with stride=2 — this is called a “bottleneck layer”. ResNet does not convolutional layer but a different form of bottleneck block which we will cover in Part 2.

ResNet 2 [01:59:33]

Here, we increased the size of features and added dropout.

class Resnet2(nn.Module):
    def __init__(self, layers, c, p=0.5):
        super().__init__()
        self.conv1 = BnLayer(3, 16, stride=1, kernel_size=7)
        self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
            for i in range(len(layers) - 1)])
        self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
            for i in range(len(layers) - 1)])
        self.layers3 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1)
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        self.drop = nn.Dropout(p)
        
    def forward(self, x):
        x = self.conv1(x)
        for l,l2,l3 in zip(self.layers, self.layers2, self.layers3):
            x = l3(l2(l(x)))
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        x = self.drop(x)
        return F.log_softmax(self.out(x), dim=-1)learn = ConvLearner.from_model_data(Resnet2([16, 32, 64, 128, 256], 10, 0.2), data)wd=1e-6%time learn.fit(1e-2, 2, wds=wd)
%time learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2, wds=wd)
%time learn.fit(1e-2, 8, cycle_len=4, wds=wd)log_preds,y = learn.TTA()
preds = np.mean(np.exp(log_preds),0)metrics.log_loss(y,preds), accuracy(preds,y)
(0.44507397166057938, 0.84909999999999997)

85% was a state-of-the-art back in 2012 or 2013 for CIFAR 10. Nowadays, it is up to 97% so there is a room for improvement but all based on these tecniques:

Better approaches to data augmentation
Better approaches to regularization
Some tweaks on ResNet

Question [02:01:07]: Can we apply “training on the residual” approach for non-image problem? Yes! But it has been ignored everywhere else. In NLP, “transformer architecture” recently appeared and was shown to be the state of the art for translation, and it has a simple ResNet structure in it. This general approach is called “skip connection” (i.e. the idea of skipping over a layer) and appears a lot in computer vision, but nobody else much seems to be using it even through there is nothing computer vision specific about it. Good opportunity!

Dogs vs. Cats [02:02:03]

Going back dogs and cats. We will create resnet34 (if you are interested in what the trailing number means, see here — just different parameters).

PATH = "data/dogscats/"
sz = 224
arch = resnet34  # <-- Name of the function 
bs = 64m = arch(pretrained=True) # Get a model w/ pre-trained weight loaded
mResNet(
  (conv1): Conv2d (3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1))
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    )
    (2): BasicBlock(
      (conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d (64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      (downsample): Sequential(
        (0): Conv2d (64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    )
    (2): BasicBlock(
      (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    )
    (3): BasicBlock(
      (conv1): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      (relu): ReLU(inplace)
      (conv2): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    )
  )  ...  (avgpool): AvgPool2d(kernel_size=7, stride=7, padding=0, ceil_mode=False, count_include_pad=True)
  (fc): Linear(in_features=512, out_features=1000)
)

Our ResNet model had Relu → BatchNorm. TorchVision does BatchNorm →Relu. There are three different versions of ResNet floating around, and the best one is PreAct (https://arxiv.org/pdf/1603.05027.pdf).

Currently, the final layer has a thousands features because ImageNet has 1000 features, so we need to get rid of it.
When you use fast.ai’s ConvLearner , it deletes the last two layers for you. fast.ai replaces AvgPool2d with Adaptive Average Pooling and Adaptive Max Pooling and concatenate the two together.
For this exercise, we will do a simple version.

m = nn.Sequential(*children(m)[:-2], 
                  nn.Conv2d(512, 2, 3, padding=1), 
                  nn.AdaptiveAvgPool2d(1), Flatten(), 
                  nn.LogSoftmax())

Remove the last two layers
Add a convolution which just has 2 outputs.
Do average pooling then softmax
There is no linear layer at the end. This is a different way of producing just two numbers — which allows us to do CAM!

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_paths(PATH, tfms=tfms, bs=bs)learn = ConvLearner.from_model_data(m, data)learn.freeze_to(-4)learn.fit(0.01, 1)
learn.fit(0.01, 1, cycle_len=1)

ConvLearner.from_model is what we learned about earlier — allows us to create a Learner object with custom model.
Then freeze the layer except the ones we just added.

Class Activation Maps (CAM) [02:08:55]

We pick a specific image, and use a technique called CAM where we take a model and we ask it which parts of the image turned out to be important.

How did it do this? Let’s work backwards. The way it did it was by producing this matrix:

Big numbers correspond to the cat. So what is this matrix? This matrix simply equals to the value of feature matrix feat times py vector:

f2=np.dot(np.rollaxis(feat,0,3), py)
f2-=f2.min()
f2/=f2.max()
f2

py vector is the predictions that says “I am 100% confident it’s a cat.” feat is the values (2×7×7) coming out of the final convolutional layer (the Conv2d layer we added). If we multiply feat by py , we get all of the first channel and none of the second channel. Therefore, it is going to return the value of the last convolutional layers for the section which lines up with being a cat. In other words, if we multiply feat by [0, 1] , it will line up with being a dog.

sf = SaveFeatures(m[-4])
py = m(Variable(x.cuda()))
sf.remove()

py = np.exp(to_np(py)[0]); pyarray([ 1.,  0.], dtype=float32)feat = np.maximum(0, sf.features[0])
feat.shape

Put it in another way, in the model, the only thing that happened after the convolutional layer was an average pooling layer. The average pooling layer took took the 7 by 7 grid and averaged out how much each part is “cat-like”. We then took the “cattyness” matrix, resized it to be the same size as the original cat image, and overlaid it on top, then you get the heat map.

The way you can use this technique at home is

when you have a large image, you can calculate this matrix on a quick small little convolutional net
zoom into the area that has the highest value
re-run it just on that part

We skipped this over quickly as we ran out of time, but we will learn more about these kind of approaches in Part 2.

“Hook” is the mechanism that lets us ask the model to return the matrix. register_forward_hook asks PyTorch that every time it calculates a layer it runs the function given — sort of like a callback that happens every time it calculates a layer. In the following case, it saves the value of the particular layer we were interested in:

class SaveFeatures():
    features=None
    def __init__(self, m): 
        self.hook = m.register_forward_hook(self.hook_fn)
    def hook_fn(self, module, input, output): 
        self.features = to_np(output)
    def remove(self): self.hook.remove()

Questions to Jeremy [02:14:27]: “Your journey into Deep Learning” and “How to keep up with important research for practitioners”

“If you intend to come to Part 2, you are expected to master all the techniques er have learned in Part 1”. Here are something you can do:

Watch each of the video at least 3 times.
Make sure you can re-create the notebooks without watching the videos — maybe do so with different datasets to make it more interesting.
Keep an eye on the forum for recent papers, recent advances.
Be tenacious and keep working at it!