Machine Learning 1: Lesson 9

49 min readOct 1, 2018

My personal notes from machine learning class. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12

Students’ work [0:00]

Welcome back to machine learning! I am really excited to be able to share some amazing stuff that University of San Francisco students have built or written about during the week. Quite a few things I’m going to show you have already spread around the internet quite a bit: lots of tweets and posts and all kinds of stuff happening.

Coloring With Random Forests

by Tyler White

He started out by saying what if I create a synthetic dataset where the independent variables are the x and the y, and the dependent variable was color. Interestingly, he showed me an earlier version of this where he wasn’t using color. He was just putting the actual numbers in here.

And this thing wasn’t really working at all. As soon as he started using color, it started working really well. So I wanted to mention that one of the things that unfortunately we don’t teach you at USF is a theory of human perception, perhaps we should. Because actually when it comes to visualization, the most important thing to know is what is the human eye or brain good at perceiving. There is a whole area of academic study on this. And one of the things that we’re best at perceiving is differences in color. So that’s why as soon as we look at this picture of this synthetic data he created, you can immediately see, oh there is kind of four areas of lighter red color. What he did was, he said okay, what if we tried to create a machine learning model of this synthetic dataset and specifically he created a tree. And the cool thing is that you can actually draw the tree. So after he created the tree, he did this all in matplotlib. Matplotlib is very flexible. He actually drew the tree boundaries which is already a pretty neat trick — to be able to actually draw the tree.

Then he did something even cleverer which is he said okay, so what predictions does the tree make? Well, it’s the average of each of these areas and so to do that, we can actually draw the average color. That is actually pretty. Here are the predictions the tree makes. Now here is where it gets really interesting. You can, as you know, randomly generate trees through resampling and so here are four trees generated through resampling. They are all pretty similar but a little bit different.

So now we can actually visualize bagging and to visualize bagging, we literally take the average of the four pictures. That’s what bagging is. And there it is:

So here is the fuzzy decision boundaries of a random forest and I think this is kind of amazing. Because it’s like, I wish I had this actually when I started teaching you all random forest because I could have skipped a couple of classes. It’s just like “okay, that’s what we do”. We create the decision boundaries, we average each area, then we do it a few times and average all of them. So that’s what a random forest does and I think this is just such a great example of making the complex easy through pictures. So congrats to Tyler for that. It actually turns out that he actually reinvented something that somebody else has already done. A guy called Christian Innie(?) who went on to be one of the world’s foremost machine learning researchers actually included almost exactly this technique in a book he wrote about decision forests. So it’s actually kind of cool that Tyler ended up reinventing something that one of the world’s foremost authorities on decision forest actually has created. So I thought that was neat. It’s nice because when we posted this on Twitter, got a lot of attention and finally somebody was able to say “oh, you know what, this actually already exists.” So Tyler has gone away and started reading that book.

Parfit — quick and powerful hyper-parameter optimization with visualizations [4:16]

by Jason Carpenter

Something else which is super cool is Jason Carpenter created a whole new library called Parfit. Parfit is a parallelized fitting of multiple models for the purpose of selecting hyper parameters. There’s a lot I really like about this. He’s shown a clear example of how to use it, and the API looks very similar to other grid search based approaches, but it uses the validation techniques that Rachel wrote about and that we learned about a couple weeks ago of using a good validation set. And what he’s done here is, in his blog post that introduces it, he’s gone right back and said what are hyper parameters, why do we have to train them, and he’s kind of explained every step. And then the module itself is very polished. He added documentation to it, he’s added a nice README to it. And it’s kind of interesting when you actually look at the code, you realize it’s very simple which is definitely not a bad thing ,that’s a good thing to make things simple. But writing this little bit of code and then packaging it up so nicely, he’s made it really easy for other people to use this technique which is great.

How to make SGD Classifier perform as well as Logistic Regression using parfit [5:40]

by Vinay Patlolla

One of the things I’ve been really thrilled to see is then Vinay went along and combined two things from our class: one was to take Parfit and then the other was to take the accelerated SGD approach to classification we learned about in the last lesson and combine the two to say “okay, let’s now use Parfit to help us find the parameters of a SGD logistic regression.” So I think that’s really a great idea.

Intuitive Interpretation of Random Forest [6:14]

by Prince Grover

Something else which I thought was terrific is Prince basically went through and summarized pretty much all the stuff we learnt in the random forest interpretation plus. He went even further than that as he described each of the different approaches to random forest interpretation. He described how it’s done so here , for example, is feature importance through variable permutation, a little picture of each one, and then super cool, here is the code to implement it from scratch. I think this is a really nice post describing something that not many people understand and showing exactly how it works both with pictures and with code that implements it from scratch. So I think that’s really great. One of the things I really like here is that for the tree interpreter, he actually showed how you can take the tree interpreter output and feed it into the new waterfall chart package that Chris, a USF student, built to show how you can actually visualize the contribution of the tree interpreter in a water fall chart. So again, kind of a nice combination of multiple pieces of technology we both learned about and built as a group.

Keras Model for Beginners (0.210 on LB)+EDA+R&D [7:37]

by Davesh Maheshwari

There’s been a few interesting kernels share and I’ll share more next week, Davesh wrote this really nice kernel showing, this is quite challenging Kaggle competition on detecting icebergs vs. ships. And it’s kind of a weird two channel satellite data which is very hard to visualize and he actually went through and basically described the formulas for how these radar scattering things actually work, then actually managed to come up with a code taht allowed him to recreate the actual 3D icebergs or ships. I have not seen that done before. It’s quite challenging to know how to visualize this data. Then he went on to show how to build a neural net to try to interpret this so that was pretty fantastic as well.

SGD [9:53]

Notebook

Let’s go back to SGD. So we are going back through this notebook which Rachel put together basically taking us through SGD from scratch for the purpose of digit recognition. Actually quite a lot of stuff we look at today is going to be closely following part of the computational linear algebra course which you can both find the MOOCs on fast.ai or at USF it’ll be an elective next year. So if you find some of this stuff interesting and I hope you do, then please consider signing up for the elective or checking out the video online.

So we are building neural networks. And we are starting with an assumption that we’ve downloaded the MNIST data, we’ve normalized it by subtracting the main and divided by the standard deviation. The data is slightly unusual in that although they represent images, they were downloaded as each image being 784 long rank 1 tensor, so it’s been flattened out. For the purpose of drawing pictures of it, we had to resize it to 28 by 28. But the actual data we’ve got is not 28 by 28, it’s 784 long flattened out.

The basic steps we’re going to take here is to start out with training the world’s simplest neural network basically a logistic regression [11:43]. So no hidden layers. And we’re going to train it using a library, Fast AI, and we’re going to build the network using a library PyTorch. Then we are going to gradually get rid of all the libraries. So first of all, we’ll get rid of the nn (neural net) library in PyTorch and write that ourselves. Then we’ll get rid of the Fast AI fit function and write that ourselves. And then we’ll get rid of the PyTorch optimizer and write that ourselves. So by the end of this notebook, we’ll have written all the pieces ourselves. The only thing that we’ll end up relying on is the two key things that PyTorch gives us which is:

the ability to write Python code and have it run it on the GPU
the ability to write Python code and have it automatically differentiated for us.

So they are the two things we are not going to attempt to write ourselves because it’s boring and pointless. But everything else, we’ll try and write ourselves on top of those two things.

Our starting point is not doing anything ourselves. It’s basically having it all done for us. So PyTorch has nn library which is where the neural net stuff lives. You can create a multi-layer neural network by using the Sequential function and then passing in a list of the layers that you want and we asked for a linear layer followed by a softmax layer and that defines our logistic regression.

from fastai.metrics import *
from fastai.model import *
from fastai.dataset import *

import torch.nn as nnnet = nn.Sequential(
    nn.Linear(28*28, 10),
    nn.LogSoftmax()
).cuda()

The input to our linear layer is 28 by 28 as we just discussed, the output is 10 because we want a probability for each of the numbers naught through 9 for each of our images. .cuda() sticks it on the GPU and then fit fits a model.

loss=nn.NLLLoss()
metrics=[accuracy]
opt=optim.Adam(net.parameters())fit(net, md, n_epochs=5, crit=loss, opt=opt, metrics=metrics)

So we start out with a random set of weights then fit uses gradient descent to make it better. We had to tell the fit function what criterion to use, in other words, what counts as better and we told it to use negative log likelihood. We’ll learn about that in the next lesson what that is exactly. We had to tell it what optimizer to use and we said please use optim.Adam, the details of that we won’t cover in this course. We are going to build something simpler called SGD. If you are interested in Adam, we just covered that in the deep learning course. And what metrics do you want to print out, we decided to print out accuracy. So that was that. So after we fit it, we get an accuracy of generally somewhere around 91, 92%.

Defining Module [14:47]

What we are going to do from here is, we are going to repeat this exact same thing, so we are going to rebuild this model 4 or 5 times, building it and fitting it with less and less libraries. So the second thing that we did lat time was to try to start to define the module ourselves. So instead of saying the network is a sequential bunch of these layers, let’s not use that library at all and try and define it ourselves from scratch. To do that, we have to use OO because that’s how we build everything in PyTorch. And we have to create a class which inherits from nn.Module. So nn.Module is a PyTorch class that takes our class and turns it into a neural network module, which basically means anything that you inherit from nn.Module like this, you can pretty much insert into a neural network as a layer or you can treat it as a neural network. It is going to get all the stuff that it needs automatically to work as a part of or a full neural network. Now we’ll talk about exactly what that means today and the next lesson.

So we need to construct the object so that means we need to define the constructor dunder init. And importantly, this is a Python thing, if you inherit from some other object, then you have to create the thing you inherit from first. So when you say super().__init__() , that says construct the nn.Module piece of that first. If you don’t do that, then the nn.Module stuff never gets a chance to actually get constructed. So this is just like a standard Python OO subclass constructor. If any of that’s unclear to you, then you know this is where you definitely want to just grab a Python intro to OO because this is the standard approach.

def get_weights(*dims): 
    return nn.Parameter(torch.randn(dims)/dims[0])
def softmax(x): 
    return torch.exp(x)/(torch.exp(x).sum(dim=1)[:,None])

class LogReg(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1_w = get_weights(28*28, 10)  # Layer 1 weights
        self.l1_b = get_weights(10)         # Layer 1 bias

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = (x @ self.l1_w) + self.l1_b  # Linear Layer
        x = torch.log(softmax(x)) # Non-linear (LogSoftmax) Layer
        return x

So inside our constructor, we want to do the equivalent of nn.Linear [17:06]. What nn.Linear is doing is it’s taking our 28 by 28 vector, so 784 long vector, and that’s going to be the input to a matrix multiplication. So now we need to create a something with 784 rows and 10 columns. Because the input to this is going to be a mini batch of size 64 by 784. So we are going to do this matrix product. So when we say in PyTorch nn.Linear, it’s going to construct 784 by 10 matrix for us. So since we are not using that, we are doing things from scratch, we need to make it ourselves. To make it ourselves, we can say generate normal random numbers with this dimensionality torch.randn(dims) which we passed in here 28*28, 10. So that gets us our randomly initialized matrix.

Then we want to add on to this. We don’t just want y = ax, we want y = ax + b. So we need to add on what we call in neural nets a bbias vector. So we create here a bias vector of length 10 self.l1_b = get_weights(10), again randomly initialized and so now here we are our two randomly initialized weight tensors.

So that’s our constructor. Now we need to define forward. Why do we need define forward? This is a PyTorch specific thing. What’s going to happen is when you create a module in PyTorch, the object that you get back behave as if it’s a function. You can call it with parentheses which we will do in a moment. And so you need to somehow define what happens when you call it as if it’s a function and the answer is PyTorch calls a method called forward. That’s the PyTorch approach they picked. So when it calls forward, we need to do our actual calculation of the output of this module or a layer. So here is the thing that actually gets calculated in a logistic regression. So basically we take our input x which gets passed to forward — that’s basically how forward works, it gets passed the mini batch, and we matrix multiply it by the layer one weights which we defined in the constructor. Then we add on the layer one bias which we also defined in the constructor. Actually nowadays we can define this a little bit more elegantly using the Python 3 matrix multiplication operator which is the @ sign.

When you use that, I think you kind of end up with something that looks closer to what the mathematical notation looked like and so I find that nicer.

Alright, so that’s our linear layer in our logistic regression (i.e. our zero hidden layer neural net). So then the next thing we do to that is softmax. We get the output of this matrix multiply which has the dimension 64 by 10. We get this matrix of outputs and we put this through a softmax. Why do we put it through a softmax? We put it through a softmax because in the end, for every image, we want a probability that is a 0, a 1, or a 2, or 3, or 4. So we want a bunch of probabilities that add up to 1 where each of those probability is between 0 and 1. So a softmax does exactly that for us.

For example, if we weren’t picking out numbers from nought to 10, but in stead, we are picking out cat, dog, plane, fish, or building, the output of that matrix multiply for one particular image might look like that (output column)[22:38]. These are just some random numbers. And to turn that into a softmax, I first go e to the power of each of those numbers. I sum up those e to the power of’s. Then I take each of those e to the power of’s and divide it by the sum. And that’s softmax. That’s the definition of softmax.

Because it was e to the power of, it means that it’s always positive. Because it was divided by the sum, it means that it’s always between 0 and 1 and it also means they always add up to 1. So by applying this softmax activation function, anytime we have a layer of a layer of outputs which we call activations and then we apply some nonlinear function to that which maps one scaler to one scaler like softmax (we call that an activation function). So the softmax activation function takes our outputs and turns it into something which behaves like a probability. We don’t, strictly speaking, need it. We could still try and train something which where the output directly is the probabilities. But by using this function that automatically makes them always behave like probabilities, it means there’s less for the network to learn, so it’s going to learn better. So generally speaking, whenever we design an architecture, we try to design it in a way where it’s as easy as possible for it to create something of the form we want. So that’s why we use softmax.

That’s the basic steps [24:40]. We have our input which is a bunch of images, it gets multiplied by a weight matrix, we also add on a bias to get an output of the linear function. We put it through a nonlinear activation function, in this case softmax, and that gives us our probabilities.

So there that all is. PyTorch also tends to use the log of softmax for reasons that don’t particularly need to bother us now. It’s basically a numerical stability convenience. So to make this the same as our version up here that you saw nn.LogSoftmax(), I’m going to use log here as well. Okay, so we can now instantiate this class (i.e. create an object of this class)[25:33].

Question: I have a question back for the probabilities where we were before. If we were to have a photo with a cat and a dog together, would that change the way that works? Or does it work in the same basic [25:43]. That’s a great question. So if you had a photo with a cat and a dog together and you wanted it to spit out both cat and dog, this would be a very poor choice. Softmax is specifically the activation function we use for categorical predictions where we only ever want to predict one of those things. So part of the reason why is that as you can see, because we are using e to the power of, e to the slightly bigger numbers create much bigger numbers. As a result of which, we generally have just one or two things large and everything else is pretty small. So if I recalculate these random numbers (in the excel sheet), you’ll see it tends to be a bunch of zeros and one or two high numbers. So it’s really designed to try to make it easy to predict this one thing i s the thing I want. If you are doing multi-label prediction, so I want to just find all the things in this image, rather than using softmax, we would instead use sigmoid. Sigmoid would cause each of these things to be between 0 and 1, but they would no longer add to 1.

A lot of these details about best practices are things that we cover in the deep learning course and we won’t cover heaps of them here in the machine learning course. We are more interested in the mechanics. But we are trying to them if they are quick.

So now that we’ve got that, we can instantiate an object of that class [27:30]. And of course we want to copy it over to the GPU. So we can do computations over there. Again, we need an optimizer which we are talking about what this is shortly. But you see here, we’ve called a function on our class called parameters. But we never defined a method called parameters and the reason that is going to work is because it actually was defined for us inside nn.Module. So nn.Module automatically go through the attributes we’ve created and finds anything that basically we said this is a parameter. The way you say something is a parameter is you wrap it in nn.Parameter. This is just the way that you tell PyTorch this is something that I want to optimize. So when we created the weight matrix, we just wrapped it with nn.Parameter, it’s exactly the same as a regular PyTorch variable which we will learn about shortly. It’s just a little flag to say hey you should optimize this. So when you call net2.parameters() on our net2 object we created, it goes through everything that we created in the constructor, checks to see if any of them are of type Parameter and if so, it sets all of those being things that we want to train with the optimizer. We will be implementing the optimizer from scratch later.

net2 = LogReg().cuda()
opt=optim.Adam(net2.parameters())

Having done that, we can fit [28:51]. And we should get basically the same answer as before (i.e. 91 ish). So that looks good.

fit(net2, md, n_epochs=1, crit=loss, opt=opt, metrics=metrics)

Data Loader [29:05]

So what have we actually built here? Well, what we’ve actually built as I said is something that can behave like a regular function. So I want to show you how we can actually call this as a function. To be able to call it as a function, we need to be able to pass data to it. To be able to pass data to it, I’m going to need to grab a mini batch of MNIST images. For convenience, we used the ImageClassifierData.from_arrays method from Fast AI and what that does is it creates a PyTorch DataLoader for us. A PyTorch DataLoader is something that grabs a few images and sticks them into a mini batch and makes them available. And you can basically say give me another mini batch, give me another mini batch, give me another mini batch. So in Python, we call these things generators. Generators are things where you can basically say I want another, I want another, I want another. There’s this very close connection between iterators and generators, I’m not going to worry about the difference between them right now. But you’ll see basically to get hold of something which we can say please give me another of, in order to grab something that we can use to generate mini batches, we have to take our data loader so you can ask for the training data from our model data object. You will see there’s a bunch of different data loader you can ask for: test data loader, train data loader, validation data loader, augmented images data loader, and so forth.

dl = iter(md.trn_dl)

So we’re going to grab the training data loader that was created for us. This is a standard PyTorch data loader, well slightly optimized by us, but same idea. You can then say this (iter) is a standard Python thing, we can say turn that into an iterator i.e. something where we can grab another one at a time from. Once you’ve done that, we’ve not got something that we can iterate through. You can use the standard Python next function to grab one more thing from that generator.

xmb,ymb = next(dl)

So that is returning x’s from a mini-batch and the y’s from our mini-batch. The other way that you can use generators and iterators in Python is with a for loop. I could have also said x mini-batch comma y mini-batch in data loader, and then do something:

So when you do that, it’s actually behind the scenes, it’s basically syntactic sugar for calling next lots of times. So this is all standard Python stuff.

So that returns a tensor of size 64 by 784 as we would expect. The Fast AI library we used defaults to a mini batch size of 64, that’s why it’s that long. These are all of the background zero pixels but they are not actually zero. In this case, why aren’t they zero? Because they are normalized. So we subtracted the mean, divided by the standard deviation.

xmb-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
          ...             ⋱             ...          
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
[torch.FloatTensor of size 64x784 (GPU 0)]

Now what we want to do is we want to pass that into our logistic regression. So what we might do is we’ll go vxmb (variable x mini-batch), I can take my x mini-batch, I can move it onto the GPU because remember my net2 object is on the GPU, so our data for it also has to be on the GPU. Then the second thing I do is, I have to wrap it in Variable. So what does variable do? This is how we get for free automatic differentiation. PyTorch can automatically differentiate pretty much any tensor. But to do so takes memory and time, so it’s not going to always keep track. To do automatic differentiation, it has to keep track of exactly how something was calculated. We added these things together, we multiplied it by that, we then took the sign blah blah blah. You have to know all of the steps because then to do the automatic differentiation, it has to take the derivative of each step using the chain rule, multiply them all together. So that’s slow and memory intensive. So we have to opt in to saying “okay, this particular thing, we’re going to be taking the derivative later so please keep track of all of those operations for us.” So the way we opt-in is by wrapping a tensor in Variable. That’s how we do it.

And you’ll see that it looks almost exactly like a tensor but it now says “variable containing” this tensor. So in PyTorch a variable has exactly identical API to a tensor, or actually more specifically a superset of the API of a tensor. Anything we can do to a tensor, we can do to a variable. But it’s going to keep track of exactly what we did so we can later on take the derivative.

vxmb = Variable(xmb.cuda())
vxmbVariable containing:
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
          ...             ⋱             ...          
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
-0.4245 -0.4245 -0.4245  ...  -0.4245 -0.4245 -0.4245
[torch.cuda.FloatTensor of size 64x784 (GPU 0)]

So we can now pass that into our net2 object [34:37]. Remember I said you can treat this as if it’s a function. So notice we’re not calling .forward() we’re just treating it as a function. And then remember, we took the log so to undo that, I’m taking the .exp() and that will give me my probabilities. So there’s my probabilities and it returns something of size 64 by 10 so for each image in the mini batch, we’ve got 10 probabilities. And you’ll see, most probabilities are pretty close to zero. And a few of them are quite a bit bigger which is exactly what we would have hoped. It’s like okay, it’s not a zero, it’s not a 1, it’s not a 2, it is a 3, it’s not a 4 and so forth.


preds = net2(vxmb).exp(); preds[:3]Variable containing:

Columns 0 to 5 
 1.6740e-03  1.0416e-05  2.5454e-05  1.9119e-02  6.5026e-05  9.7470e-01
 3.4048e-02  1.8530e-04  6.6637e-01  3.5073e-02  1.5283e-01  6.4995e-05
 3.0505e-08  4.3947e-08  1.0115e-05  2.0978e-04  9.9374e-01  6.3731e-05

Columns 6 to 9 
 2.1126e-06  1.7638e-04  3.9351e-03  2.9154e-04
 1.1891e-03  3.2172e-02  1.4597e-02  6.3474e-02
 8.9568e-06  9.7507e-06  7.8676e-04  5.1684e-03
[torch.cuda.FloatTensor of size 3x10 (GPU 0)]

We could call net2.forward(vxmb) and it will do exactly the same thing. But that’s not how all of the PyTorch mechanics actually work. They actually call it as if it’s a function. This is actually a really important idea because it means that when we define our own architectures or whatever, anywhere that you would put in a function, you could put in a layer; anywhere you put in a layer, you can put in a neural net; anywhere you put in a neural net, you can put in a function. Because as far as PyTorch is concerned, they are all just things that it’s going to call just like as if they are functions. So they are all interchangeable and this is really important because that’s how we create really good neural nets is by mixing and matching lots of pieces and putting them all together.

Let me give you an example [36:54]. Here is my logistic regression which got 91 and a bit % accuracy. I’m now going to turn it into a neural network with one hidden layer .

And the way I’m going to do that is I’m going to create more layer. I’m going to change this so it spits out a hundred rather than 10 which means this one’s input is going to be a hundred rather than 10. Now this as it is can’t possibly make things any better at all yet. Why is this definitely not going to be better than what I had before? Because a combination of two linear layers is just the same as one linear layer but different parameters.

So we’ve got two linear layers which is just a linear layer. To make things interesting, I’m going to replace all the negatives from the first layer with zeros. Because that’s a nonlinear transformation, and that nonlinear transformation is called a rectified linear unit (ReLU).

So nn.Sequential simply is going to call each of these layers in turn for each mini batch. So do a Linear layer, replace all of the negatives with zero, do another Linear layer and do a softmax. This is now a neural network with one hidden layer. So let’s try training that instead. The accuracy has now been up to 96%.

So the idea is that the basic techniques we are learning in this lesson become powerful at the point where you start stacking them together.

Question: Why did you pick 100 [38:55]? No reason. It was easier to type an extra zero. This question of how many activations should I have in a neural network layer is part of the scale of a deep learning practitioner, we covered in deep learning course and not in this course.

Question: When adding the additional layer, does it matter if you would have done two softmax’s or is that something you cannot do? You can absolutely use the softmax there. But it’s probably not going to give you what you want. The reason why is that a softmax tends to push most of its activations to zero. An activation, just to be clear as I’ve had a lot of questions in deep learning course about what is an activation, an activation is the value that is calculated in a layer. So this is an activation:

It’s not a weight. A weight is not an activation. It’s the value that you calculate from a layer. So softmax will tend to make most of its activations pretty close to zero and that’s the opposite of what you want. You generally want your activations to be as rich and diverse and used as possible. So nothing to stop you doing it, but it probably won’t work very well. Basically pretty much all of your layers will be followed by nonlinear activation functions that will nearly always be ReLU except for the last layer.

Question: When doing multiple layers, so let’s say 2 or 3, do you want to switch up these activation layers [40:41]? No. So if I wanted to go deeper, I would just do that.

That’s now a two hidden layer network.

Question: So I think I heard you said that there are a couple of different activation functions like that rectified linear units. What are some examples and why would you use each [41:09]? Yes, great question. So basically as you add more linear layers, your input comes in and you put it through a linear layer and then a nonlinear layer, linear layer, nonlinear layer, linear layer and the final nonlinear layer. The final nonlinear layer as we’ve discussed if it’s a multi category classification but you only ever pick one of them, you would use softmax. If it’s a binary classification or a multi-label classification where you are predicting multiple things, you would use sigmoid. If it’s a regression, you would often have nothing at all, although we learnt in last night’s DL course where sometimes you can use sigmoid there as well. So they are basically the main options for the final layer. For the hidden layers, you pretty much always use ReLU but there is another one you can pick which is kind of interesting called leaky ReLU. Basically if it’s above zero, it’s y = x, and if it’s below zero, it’s like y = 0.1x. So it’s very similar to ReLU but rather than being equal to 0 it’s something close to that. So they are the main two: ReLU and Leaky ReLU.

There are various others, but they are kind of like things that just look very close to that. For example, there’s something called ELU which is quite popular, but the details don’t matter too much honestly. Like ELU is something like ReLU but slightly more curvy in the middle. It’s not generally something that you so much pick based on the dataset. It’s more like over time we just find better activation functions. So two or three years ago, everybody used ReLU. A year ago, pretty much everybody used Leaky ReLU. Today, I guess probably most people are starting to move towards ELU. But honestly the choice of activation function doesn’t matter terribly much actually. People have actually showed that you can use pretty arbitrary nonlinear activation functions, like even a sine wave, and it still works.

So although what we are going to do today is showing how to create this network with no hidden layers, to turn it into that network (below) which is 96% ish accurate will be trivial [44:35]. It’s something you should probably try and do during the week to create this version.

So now that we’ve got something where we can take our network, passing our variable, and get back some predictions, that’s basically all that happened when we called fit [45:11]. So we are going to see how that approach can be used to create this stochastic gradient descent. One thing to note is that to turn the predicted probabilities into a predicted which digit is it, we would nee to use argmax. Unfortunately, PyTorch doesn’t call it argmax. Instead, PyTorch just calls it max and max returns two things: it returns the actual max across the given axis (so max(1) will return max across the columns), and the second thing it returns is the index of that maximum. So the equivalent of argmax is to call max and then get the first indexed thing:

So there’s our predictions. If this was numpy, we would instead use np.argmax().

preds = predict(net2, md.val_dl).argmax(1)
plots(x_imgs[:8], titles=preds[:8])

So here are the predictions from our hand created logistic regression and in this case, looks like we’ve got all but one correct [46:25].

The next thing we are going to try and get rid of in terms of using libraries is we will try to avoid using the matrix multiplication operator. Instead we are going to try and write that by hand.

Broadcasting [46:58]

So this next part, we are going to learn about something which kind of seems like a minor little kind of programming idea. But actually it’s going to turn out that, at least in my opinion, it’s the most important programming concept that we will teach in this course, and it’s possibly the most important programming concept in all the things you need to build machine learning algorithms. And it’s the idea of broadcasting. And the idea I will show by example.

If we create an array of 10, 6, -4 and an array of 2, 8, 7 and then add the two together, it adds each of the components of those two arrays in turn — we call that “element-wise”.

a = np.array([10, 6, -4])
b = np.array([2, 8, 7]) a + barray([12, 14,  3])

In other words, we didn’t have to write a loop. Back in the old days, we would have to have looped through each one and added them, and then concatenated them together. We don’t have to do that today. It happens for us automatically. So in numpy, we automatically get element-wise operations. We can do the same thing with PyTorch [48:17]. In Fast AI, we just add a little capital T to turn something into a PyTorch tensor. And if we add those together, exactly the same thing.

So element-wise operations are pretty standard in these kind of libraries. It’s interesting not just because we don’t have to write the for loop, but it’s actually much more interesting because of the performance things that are happening here.

Performance [48:49]

The first is if we were doing a for loop, that would happen in Python. Even when you use PyTorch, it still does the for loop in Python. It has no way of optimizing a for loop. So a for loop in Python is something like 10,000 times slower than in C. So that’s your first problem. I can’t remember it’s like 1,000 or 10,000.

The second problem, then, is that you don’t just want it to be optimized in C but you want C to take advantage of the thing that all of your CPUs do, something called SIMD, Single Instruction Multiple Data. Your CPU is capable of taking 8 things at a time in a vector and adding them up to another vector with 8 things in, in a single CPU instruction. So if you can take advantage of SIMD, you are immediately 8 time s faster. It depends on how big the data type is, it might be 4, might be 8.

The other thing you’ve got in your computer is you’ve got multiple processes (multiple cores). So if the vector addition is happening in one core, you’ve probably got about 4 of those. So if you’re using SIMD, you are 8 times faster, if you can use multiple cores, then you are 32 times faster. Then if you are doing that in C, you might be something like 32k times faster.

So the nice thing is when we do a + b , it’s taking advantage of all of these things.

Better still, if you do it in PyTorch and your data was created with .cuda() to to stick it on the GPU, then your GPU can do about 10,000 things at a time [50:40]. So that’ll be another hundred times faster than C. So this is critical to getting good performance. You have to learn how to write loop-less code by taking advantage of these element-wise operations. And it’s a lot more than just plus (+). I can also use less than (<), and that’s going to return 0, 1, 1.

Or if we go back to numpy, False, True, True.

So you can use this to do all kinds of things without looping. So for example, I could now multiply that by a and here are all of the values of a as long as they are less than b:

Or we could take the mean:

(a < b).mean()0.66666666666666663

This is the percentage of values in a that are less than b. So there’s a lot of stuff you can do with this simple idea.

Taking it further [52:04]

But to take it further, to take it further than just this element-wise operation, we are going to have to go to the next step to something called broadcasting. Let’s start by looking at an example of broadcasting [52:43].

aarray([10,  6, -4])

a is an array with one dimension, also known as a rank 1 tensor, also known as a vector. We can say a greater than zero:

a > 0array([ True,  True, False], dtype=bool)

So here, we have a rank 1 tensor (a) and a rank 0 tensor (0). A rank 0 tensor is also called a scalar, and rank 1 tensor is also called a vector. And we’ve got an operation between the two:

Now you’ve probably done it a thousand times without even noticing that’s kind of weird. You’ve got these things of different ranks and different sizes. So what is it actually doing? What it’s actually doing is it’s taking that scaler and copying it 3 times (i.e. [0, 0, 0]) and actually going element-wise and giving us back the three answers. That’s called broadcasting. Broadcasting means copying one of more axes of my tensor to allow it to be the same shape as the other tensor. It doesn’t really copy it though. What it actually does is it stores this kind of internal indicator that says pretend that this is a vector of three zeros, but it actually rather than going to the next row or going to the next scaler, it goes back to where it came from. If you are interested in learning about this specifically, it’s they set the stride on that axis to be zero. That’s a minor advanced concept for those who are curious.

So we could do a + 1 [54:52]. It’s going to broadcast the scalar 1 to be [1, 1, 1] and then do element wise addition.

a + 1array([11,  7, -3])

We could do the same with a matrix. Here is our matrix.

m = np.array([[1, 2, 3], [4,5,6], [7,8,9]]); marray([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

2 times that matrix is going to broadcast 2 to be [[2, 2, 2], [2,2,2], [2,2,2]], and then do element-wise multiplication. So that’s our most simple version of broadcasting.

2*marray([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

Broadcasting a vector to a matrix [55:27]

Here is a slightly more complex version of broadcasting. Here is an array called c. This is a rank 1 tensor.

c = np.array([10,20,30]); carray([10, 20, 30])

And here is our matrix m from before — rank 2 tensor. We can add m + c. So what’s going on here?

m + carray([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

You can see that what it’s done is to add [10, 20, 30] to each row.

So we can kind of figure it seems to have done the same kind of idea as broadcasting a scaler, it’s like made copies of it. And then it treats those as if it’s a rank 2 matrix. And now we can do element-wise addition.

Question: By looking at this example, it copies it down making new rows. How would we want to do it if we wanted to get new columns [56:50]? I’m so glad you asked. So instead, we would do this:

And now treat that as our matrix. To get numpy to do that, we need to not pass in a vector but to pass in a matrix with one column (i.e. a rank 2 tensor). So basically, it turns out that numpy is going to think of a rank 1 tensor for these purposes as if it was a rank 2 tensor which represents a row. So in other words, that it is 1 by 3. So we want to create a tensor which is 3 by 1. There is a couple ways to do that. One is to use np.expand_dims(c,1) and if you then pass in this argument it says “please insert a length 1 axis here.” So in our case, we want to turn it into a 3 by 1, so if we said expand_dims(c,1) it changes the shape to (3, 1). So if we look at what that looks like, that looks like a column.

np.expand_dims(c,1).shape(3, 1)np.expand_dims(c,1)array([[10],
       [20],
       [30]])

So if we now go that plus m, you can see it’s doing exactly what we hoped it would do which is to add 10, 20, 30 to the column [58:50]:

m + np.expand_dims(c,1)array([[11, 12, 13],
       [24, 25, 26],
       [37, 38, 39]])

Now because the location of a unit axis turns out to be so important, it’s really helpful to experiment with creating these extra unit axes and know how to do it easily. np.expand_dims isn’t in my opinion the easiest way to do this. The easiest way is to index into the tensor with a special index None. What None does is it creates a new axis in that location of length 1. So this is going to add a new axis at the start of length one.

c[None]array([[10, 20, 30]])c[None].shape(1, 3)

This is going to add a new axis at the end of length one.

c[:,None]array([[10],
       [20],
       [30]])c[:,None].shape(3, 1)

Or why not do both

c[None,:,None].shape(1, 3, 1)

So if you think about it, a tensor which has like 3 things in it, could be of any rank you like, you can just add unit axis all over the place. That way, we can decide how we want our broadcasting to work. So there is a pretty convenient thing in numpy called broadcast_to and what that does is it takes our vector and broadcasts it to that shape and shows us what that would look like.

np.broadcast_to(c, (3,3))array([[10, 20, 30],
       [10, 20, 30],
       [10, 20, 30]])

So if you are ever unsure of what’s going on in some broadcasting operation, you can say broadcast_to. So for example here, we could say rather than (3,3), we could say m.shape and see exactly what’s going to happen.

np.broadcast_to(c, m.shape)array([[10, 20, 30],
       [10, 20, 30],
       [10, 20, 30]])

And that’s what’s going to happen before we add it to m. So if we said turn it into a column, that’s what that looks like:

np.broadcast_to(c[:,None], m.shape)array([[10, 10, 10],
       [20, 20, 20],
       [30, 30, 30]])

So that’s kind of the intuitive definition of broadcasting. And so now hopefully we can go back to that numpy documentation and understand what it means.

From the Numpy Documentation [1:01:37]:

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array (lower rank tensor) is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

“Vectorizing” generally means using SIMD and stuff like that so that multiple things happen at the same time. It doesn’t actually make needless copies of data, it just acts as if it had. So there is our definition.

Now in deep learning, you very often deal with tensors of rank 4 or more, and you very often combine them with tenors of rank 1 or 2, and trying to just rely on intuition to do that correctly is nearly impossible. So you really need to know the rules.

Here is m.shape and c.shape [1:02:45]. So the rules are that we are going to compare the shapes of our two tensors element-wise. We are going to look at one at a time, and we are going to start at the end and go towards the front. Two dimensions are going to be compatible when one of these two things is true. So let’s check if our m and c are compatible. So we are going to start at the end (trailing dimensions first) and check “are they compatible?” They are compatible if the dimensions are equal. So these ones are equal, so they are compatible. Let’s go to the next one. Uh-oh, we are missing. c is missing something. So what happens if something is missing is we insert a 1. That’s the rule. So let’s now check — are these compatible? One of them is one, yes, they are compatible. So now you can see why it is that numpy treats the one dimensional array as if it is a rank 2 tensor which is representing a row. It’s because we are basically inserting a 1 at the front. So that’s the rule.

When operating on two arrays, Numpy/PyTorch compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

they are equal, or
one of them is 1

Arrays do not need to have the same number of dimensions. For example, if you have a 256*256*3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

Image  (3d array): 256 x 256 x 3
Scale  (1d array):             3
Result (3d array): 256 x 256 x 3

For example, above is something that you very commonly have to do which is you start with an image, 256 pixel by 256 pixel by 3 channels [1:04:11]. And you want to subtract the mean of each channel. So you’ve got 256 by 256 by 3 and you want to subtract something of length 3. So yes, you can do that. Absolutely. Because 3 and 3 are compatible because they are the same. 256 and empty is compatible because it’s going to insert a 1. 256 and empty is compatible because it’s going to insert a 1. So you are going to end up with this (the mean of each channel) is going to be broadcast over all of this axis (second from the right) and then that whole thing will be broadcast over this left most axis and so we will end up with a 256 by 256 by 3 effective tensor here.

So interestingly [1:05:15], very few people in the data science or machine learning communities understand broadcasting and the vast majority of the time, for example, when I see people doing pre-processing for computer vision like subtracting the mean, they always write loops over the channels. And I think it’s so handy to not have to do that and it’s often so much faster to not have to do that. So if you get good at broadcasting, you’ll have this super useful skill that very very few people have. And it’s an ancient skill. It goes all the way back to the days of APL. So APL was from the late 50’s which stands for A Programming Language, and Kenneth Iverson wrote this paper called notation as a tool for thought in which he proposed a new math notation. And he proposed that if we use this new math notation, it gives us new tools for thought and allows us to think things we couldn’t before. One of his ideas was broadcasting, not as a computer programming tool but as a piece of math notation. So he ended up implementing this notation as a tool for thought as a programming language called APL. And his son has gone on to further develop that into a piece of software called J which is basically what you get when you put 60 years of very smart people working on this idea. And with this programming language, you can express very complex mathematical ideas often just with a line of code or two. So it’s great that we have J but it’s even greater that these ideas have found their ways into the languages we all use like in Python the numpy and PyTorch libraries. These are not just little niche ideas, it’s like fundamental ways to think about math and to do programming.

Let me give an example of this kind of notation as a tool for thought [1:07:33]. Here we’ve got c:

Here we’ve got c[None]:

Notice this is now two square brackets. So this is kind of like a one row vector tensor. Here it is a little column:

So what is this going to do?

So to think of this in terms of like those broadcasting rules [1:09:13], we are basically taking this column which is of dimension (3, 1) and this row which is of dimension (1, 3). So make these compatible with our broadcasting rules, the column has to be duplicated 3 times because it needs to match 3. The row has to be duplicated 3 times to match 3. So now I’ve got two matrices to do an element wise product of.

So as you say, there is our outer product.

Now the interesting thing here is that suddenly now that this is not a special mathematical case but just a specific version of the general idea of broadcasting, we can do like an outer plus:

Or outer greater than:

Or whatever. So it’s suddenly we’ve got this concept that we can use build new ideas and then we can start to experiment with those new ideas.

Interestingly, numpy actually uses this sometimes [1:10:52]. For example, if you want to create a grid, this is how numpy does it:

It actually returns 0, 1, 2, 3, 4; one as a column, one as a row. So we could say okay that’s x grid (xg) comma y grid (yg) and now you could do something like that:

So suddenly we’ve expanded that out into a grid. So it’s kind of interesting how some of these simple little concepts get built on and built on and built on. So if you lose something like APL or J, it’s this whole environment of layers and layers of this. We don’t have such a deep environment in numpy but you can certainly see these idea of broadcasting coming through in simple things like how do we create a grid in numpy.

Implementing matrix multiplication [1:12:30]

So that’s broadcasting and so what we can do with this now is use this to implement matrix multiplication ourselves. Now why would we want to do that? Well, obviously we don’t. Matrix multiplication has already been handled perfectly nicely for us by our libraries. But very often you’ll find in all kinds of areas in machine learning and particularly in deep learning that there’ll be particular types of linear function that you want to do that aren’t quite done for you. For example, there’s whole areas called tensor regression and tensor decomposition which are really being developed a lot at the moment and they are talking about how do we take higher rank tensors and turn them into combinations of rows, columns, and faces. It turns out that when you do this, you can basically deal with really high dimensional data structures with not much memory and not much computation time. For example, there is a really terrific library called TensorLy which does a whole lot of this kind of stuff for you. So it’s a really really important area. It covers all of deep learning, lots of modern machine learning in general. So even though you’re not going to define matrix multiplication, you are very likely to wanting to find some other slightly different tensor product. So it’s really useful to understand how to do that.

Let’s go back and look at our 2D array and 1D array, rank 2 tensor and rank 1 tensor [1:14:27].

m, c(array([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]]), array([10, 20, 30]))

Remember, we can do a matrix multiplication using the @ sign or the old way, np.matmul. What that’s actually doing when we do that is we are basically saying 1*10 + 2*20 + 3*30 = 140, so we do that for each row and we can go through and do the same thing for the next one, and so on to get our result.

m @ c  # np.matmul(m, c)array([140, 320, 500])

You could do that in PyTorch as well

T(m) @ T(c)140
 320
 500
[torch.LongTensor of size 3]

But this (m * c) is not matrix multiplication. What’s that? Element-wise with broadcasting. But notice, the numbers it has created [10, 40, 90] are the exact three numbers that I needed to calculate when I did that first piece of my matrix multiplication (1*10 + 2*20 + 3*30).

m * carray([[ 10,  40,  90],
       [ 40, 100, 180],
       [ 70, 160, 270]])

So in other words, if we sum this over the columns which is axis equals 1, we get our matrix vector product:

(m * c).sum(axis=1)array([140, 320, 500])

So we can do this stuff without special help from our library. So now let’s expand this out to a matrix matrix product.

So matrix matrix product looks like this. This is a great site called matrixmultiplication.xyz and it shows us this is what happens when we multiply two matrices. That’s what matrix multiplication is, operationally speaking. So in other words, what we just did there was we first of all took the first column with the first row to get this 15:

Then we took the second column with the first row to get 27:

So we are basically doing the thing we just did, the matrix vector product, we are just doing it twice. Once with this column (left), once with that column (right), and then we concatenate the two together. So we can now go ahead and do that like so:

(m * n[:,0]).sum(axis=1)array([140, 320, 500])(m * n[:,1]).sum(axis=1)array([ 25, 130, 235])

So there are the two columns of our matrix multiplication.

I didn’t want to make our code too messy so I’m not going to actually use that, but we have it there now if we want to. We don’t need to use torch or numpy matrix multiplication anymore. We’ve got our own that we can use, using nothing but element wise operations, broadcasting, and sum.

Tis is our logistic regression from scratch class again [1:18:37]. I just copied it here.

class LogReg(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1_w = get_weights(28*28, 10)  # Layer 1 weights
        self.l1_b = get_weights(10)         # Layer 1 bias

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = x @ self.l1_w + self.l1_b 
        return torch.log(softmax(x))

Here is where we instantiate the object, copy to the GPU. We create an optimizer which we will learn about in a moment. And we call fit.

net2 = LogReg().cuda()
opt=optim.Adam(net2.parameters())

fit(net2, md, n_epochs=1, crit=loss, opt=opt, metrics=metrics)[ 0.       0.31102  0.28004  0.92406]

Writing Our Own Training Loop [1:18:53]

So the goal is to now repeat this without needing to call fit. To do that, we are going to need a loop which grabs a mini batch of data at a time. And with each mini batch of data, we need to pass it to the optimizer and say “please try to come up with a slightly better set of predictions for this mini batch.”

As we learnt, in order to grab a mini batch of the training set at a time, we have to ask the model data object for the training data loader. We have to wrap it in iter to create an iterator or a generator. So that gives us our data loader. So PyTorch call this a data loader. We actually wrote our own Fast AI data loader, but it’s basically the same idea.

dl = iter(md.trn_dl) # Data loader

So the next thing we do is we grab the x and the y tensor, the next one from our data loader. Wrap it in a Variable to say I need to be able to take the derivative of the calculations using this. Because if I can’t take the derivative, then I can’t get the gradients and I can’t update the weights. And I need to put it on the GPU because my module is on the GPU (net2 = LogReg().cuda()). So we can now take that variable and pass it to the object that we instantiated (i.e. our logistic regression). Remember, our module, we can use it as if it’s a function because that’s how PyTorch works. And that gives us a set of predictions as we’ve seen before.

xt, yt = next(dl)
y_pred = net2(Variable(xt).cuda())

So now we can check the loss [1:20:41]. And the loss, we defined as being a negative log likelihood loss object. We are going to learn about how that’s calculated in the next lesson and for now, think of it just like a root mean squared error but for classification problems. So we can call that also just like a function. So you can kind of see this is very general idea in PyTorch that treat everything ideally like it’s a function. In this case, we have a negative log likelihood loss object, we could treat it like a function. We pass in our predictions and we pass in our actuals. Again, the actuals need to be turned into a variable and put on the GPU because the loss is specifically the thing that we actually want to take the derivative of. So that gives us our loss, and there it is.

l = loss(y_pred, Variable(yt).cuda())
print(l)Variable containing:
 2.4352
[torch.cuda.FloatTensor of size 1 (GPU 0)]

That’s our loss 2.43. So it’s a variable and because it’s a variable, it knows how it was calculated. It knows it was calculated with this loss function (loss). It knows that the predictions were calculated with this network (net2). It knows that this network consisted of these operations:

So we can get the gradient automatically. To get the gradient, we call l.backward(). Remember l is the thing that contains our loss. So l.backward() is something which is added to anything that’s a variable. You then call .backward() and that says please calculate the gradients. So that calculates the gradients and stores them inside, basically for each of the weights/parameters that was used to calculate that, it’s now stored in .grad — we will see it later but it’s basically stored the gradients. So we can then call optimizer.step() and we are going to do this step manually shortly. And that’s the bit that says please make the weights a little bit better.

optimizer.step [1:22:49]

So what optimizer.step() is doing is if you had a really simple function like this, what the optimizer does is it says okay, let’s pick a starting point, let’s calculate the value of the loss, let’s take the derivative which tells us which way is down. So it tells us we need to go that direction. And we take a small step.

And then we take the derivative again, we take a small step, and repeat until eventually we are taking such small steps that we stop.

So that’s what gradient descent does. How big a step is a small step? Well, basically take the derivative here, so let’s say derivative there is like 8. And we multiply it by a small number, say, 0.01 and that tells us what step size to take. This small number here is called the learning rate, and it’s the most important hyper parameter to set. If you pick too small a learning rate, then your steps down are going to be tiny and it’s going to take you forever. Too big a learning rate and you jump too far and then you jump too far and you will diverge rather than converge.

We are not going to talk about how to pick a learning rate in this class, but in the deep learning class, we actually show you a specific technique that very reliably picks a very good learning rate.

So that’s basically what’s happening. We calculate the derivatives, we call the optimizer that does step, in other words update the weights based on the gradients and the learning rate.

We should hopefully find that after doing that we have a better loss than we did before [1:25:03]. So I just re-ran this and got a loss here of 4.16.

After one step, it’s now 4.03.

So it worked the way we hoped it would based on this mini batch, it updated all of the weights in our network to be a little better than they were. As a result of which our loss went down.

Training loop [1:25:28]

So let’s turn that into a training loop. We are going to go through a hundred steps:

Grab one more mini batch of data from the data loader
Calculate our predictions from our network
Calculate our loss from the predictions and the actuals
Every 10 goes, we’ll print out the accuracy just take the mean of whether they are equal or not.
One PyTorch specific thing, you have to zero the gradients. Basically you can have networks where you’ve got lots of different loss functions that you might want to add all of the gradients together. So you have to tell PyTorch when to set the gradients back to zero. So this just says set all the gradients to zero.
Calculate the gradients that’s called backward
Then take one step of the optimizer, so update the weights using the gradients and the learning rate

for t in range(100):
  xt, yt = next(dl)
  y_pred = net2(Variable(xt).cuda())
  l = loss(y_pred, Variable(yt).cuda())
    
  if t % 10 == 0:
    accuracy = np.mean(to_np(y_pred).argmax(axis=1) == to_np(yt))
    print("loss: ", l.data[0], "\t accuracy: ", accuracy)

  optimizer.zero_grad()
  l.backward()
  optimizer.step()loss:  2.2104923725128174 	 accuracy:  0.234375
loss:  1.3094730377197266 	 accuracy:  0.625
loss:  1.0296542644500732 	 accuracy:  0.78125
loss:  0.8841525316238403 	 accuracy:  0.71875
loss:  0.6643403768539429 	 accuracy:  0.8125
loss:  0.5525785088539124 	 accuracy:  0.875
loss:  0.43296846747398376 	 accuracy:  0.890625
loss:  0.4388267695903778 	 accuracy:  0.90625
loss:  0.39874207973480225 	 accuracy:  0.890625
loss:  0.4848807752132416 	 accuracy:  0.875

Once we run it, you can can see the loss goes down and the accuracy goes up. So that’s the basic approach. Next lesson, we will see what optimizer.step() does. We’ll look at it in detail. We are not going to look inside l.backward() as I said we are going to basically take the calculation of the derivative as given. But basically what’s happening there, in any kind of deep network, you have a function that’s like a linear function and then you pass the output of that into another function that might be like a ReLU. And you pass the output of that into another function that might be another linear layer, and so forth:

i( h( g( f(x) ) ) )

So these deep networks are just functions of functions of functions. So you could write them mathematically like that. So all back prop does is it says (let’s just simplify this down to the depth two version), we can say okay:

g( f(x) )

u = f(x)

Therefore, the derivative of g(f(x)) we can calculate with the chain rule as:

g’(u) f’(x)

So you can see, we can do the same thing for the functions of the functions of the functions. So when you apply a function to a function of a function, you can take the derivative just by taking the product of the derivatives of each of those layers. In neural networks, we call this back propagation. So when you hear back propagation, it just means use the chain rule to calculate the derivatives.

So when you see a neural network defined like here:

If it’s defined sequentially, literally, all this means is apply this function to the input, apply this function to that, apply this function to that, etc. So this is just defining a composition of a function to a function to a function to a function. So although we are not going to bother with calculating the gradients ourselves, you can now see why it can do it as long as it internally knows what’s the derivative of to-the-power-of (^), what’s the derivative of sine, what’s the derivative of plus, and so forth. Then our Python code in here is just combining those things together. So it just needs to know how to compose them together with the chain rule and away it goes.

So I think we can leave it there for now and in the next class, we’ll go and see how to write our own optimizer and then we’ll have solved MNIST from scratch ourselves. See you then!