Machine Learning 1: Lesson 10

53 min readOct 6, 2018

My personal notes from machine learning class. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12

Fast AI on pip [0:00]

Welcome back to machine learning! Certainly the most exciting thing this week is that Fast AI is now on pip, so you can pip install fastai:

fastai

fastai makes deep learning with PyTorch faster, more accurate, and easier

pypi.org

It’s probably still easiest just to do the conda env update but a couple of places it would be handy instead to pip install fastai would be if you are working outside of the repo in the notebooks, then this gives you access to Fast AI everywhere. Also they submitted a pull request to Kaggle to try and get it added to the Kaggle kernels. So hopefully you’ll be able to use it on Kaggle kernels soon. You can use it at your work or whatever else, so that’s exciting. I’m not gonna say it’s officially released yet. It’s still very early, obviously and we are still adding (and you’re helping add) documentation and all that kind of stuff. But it’s great that that’s now there.

Kaggle Kernels [1:22]

A couple of cool kernels from USF students this week. I thought I’d highlight two that were both from the text normalization competition which was about trying to take text which was written out standard English text, there’s also one for Russian. You’re trying to identify things that could be “first, second, third” and say that’s a cardinal number or this is a phone number or whatever. I did a quick little bit of searching and I saw that there had been some attempts in academia to use deep learning for this, but they hadn’t managed to make much progress and actually noticed Alvira’s kernel here which get’s 0.992 on the leader board which I think is top 20. It’s entirely heuristic and it’s a great example of feature engineering. In this case the whole thing is basically entirely feature engineering. It’s basically looking through and using lots of regular expressions to figure out for each token, what is it. And I think she’s done a great job here laying it all out clearly as to what all the different pieces are and how they all fit together. And she mentioned that she’s maybe hoping to turn this into a library which I think would be great. You could use this to grab a piece of text and pull out what are all the pieces in it. It’s the kind of thing that the natural language processing community hopes to be able to do without lots of hand written code like this. But for now, I’ll be interested to see what the winners turn out to have done, but I haven’t seen machine learning being used really to do this particularly well. Perhaps the best approach is the one which combine this kind of feature engineering along with some machine learning. But I think this is a great example of effective feature engineering.

This is another USF student who has done much the same thing, got a similar score, but used her own different set of rules. Again, it would get you a good leader board position with these as well. So I thought that was interesting to see examples of some of our students entering a competition and getting top 20ish results by basically just handwritten heuristics. This is where, for example, computer vision was six years ago still. Basically all the best approach was whole a lot of carefully handwritten heuristics often combined with some simple machine learning. So I think over time the field is definitely trying to move towards automating much more of this.

Porto Seguro’s Safe Driver Prediction Winner [4:21]

And actually interestingly, in the safe driver prediction competition was just finished. One of the Netflix prize winners won this competition and he invented a new algorithm for dealing with structured data which basically doesn’t require any feature engineering at all. So he came first place using nothing but five deep learning models and one gradient boosting machine. His basic approach was very similar to what we’ve been learning in this class so far and what we’ll be learning also tomorrow which is using fully connected neural networks and one hot encoding and specifically embedding which we will learn about. But he had a very clever technique which was there was a lot of data in this competition which was unlabeled. So in other words where they didn’t know whether that driver would file a claim or not. So when you got some labeled and some unlabeled data, we call that semi-supervised learning. In real life, most learning is semi-supervised learning. In real life, normally you have some things that are labeled and some things that are unlabeled. So this is kind of the most practically useful kind of learning. And then structured data is the most common kind of data that companies deal with day to day. So the fact that this competition was a semi-supervised, structure data competition made it incredibly practically useful.

So what his technique for winning this was to do data augmentation which those of you doing the deep learning course have learned about which is basically the idea like if you had pictures you would flip them horizontally or rotate them a bit. Data augmentation means creating new data examples which are slightly different versions of ones you already have. And the way he did it was for each row from the data, he would at random replaced 15% of the variables with a different row. So each row now would represent like a mix of 85% of the original row, the 15% randomly selected from a different row. So this was a way of randomly changing the data a little bit and then he used something called an autoencoder which we will probably won’t study until a part 2 of a deep learning course but the basic idea of an autoencoder is your dependent variable is the same as your independent variable. So in other words, you try to predict your input which obviously is trivial if you are allowed to identity transform, for example, trivially predicts the input, but the trick with an autoencoder is to have less activations in at least one of your layers than your input. So if your input was a hundred dimensional vector and you put it through a 100 by 10 matrix, create ten activations, and then have to create the original 100 long vector from that. Then you basically had to have compressed it effectively. So it turns out that that kind of neural network is forced to find correlations and features and interesting relationships in the data even when it’s not labeled. So he used that rather than doing any hand engineering. He just used an autoencoder. So these are some interesting kind of directions that if you keep going with your machine learning studies, particularly if you do a part two of a deep learning course next year, you’ll learn about. And you can kind of see how feature engineering is going away and this was just an hour ago. So this is very recent indeed. But this is one of the most important breakthroughs I’ve seen in a long time.

MNIST continued [8:32]

Notebook

This (LogReg) was that little handwritten nn.Module class we created [9:15]. We defined our loss. We defined our learning rate, and we defined our optimizer. And optim.SGD is the thing we are going to try and write by hand in a moment. So nn.NLLLoss and optim.SGD, we are stealing from PyTorch, but we’ve written the module LogReg and the training loop ourselves.

net2 = LogReg().cuda()
loss=nn.NLLLoss()
learning_rate = 1e-2
optimizer=optim.SGD(net2.parameters(), lr=learning_rate)for epoch in range(1):
    losses=[]
    dl = iter(md.trn_dl)
    for t in range(len(dl)):
        # Forward pass: compute predicted y and loss by passing x to
        # the model.
        xt, yt = next(dl)
        y_pred = net2(V(xt))
        l = loss(y_pred, V(yt))
        losses.append(l)        # Before the backward pass, use the optimizer object to zero
        # all of the gradients for the variables it will update
        # (which are the learnable weights of the model)
        optimizer.zero_grad()        # Backward pass: compute gradient of the loss with respect
        # to model parameters
        l.backward()        # Calling the step function on an Optimizer makes an update
        # to its parameters
        optimizer.step()
    
    val_dl = iter(md.val_dl)
    val_scores = [score(*next(val_dl)) for i in range(len(val_dl))]
    print(np.mean(val_scores))

So the basic idea was we are going to go through some number of epochs [9:39], so let’s go through one epoch. And we are going to keep track of for each mini batch, what the loss was so that we can report it at the end. We are going to turn our training data loader into an iterator so that we can loop through it — loop through every mini batch. So now we can go ahead and say for tensor in the length of the data loader and then we can call next to grab the next independent variable and the dependent variables from that iterator.

So then remember, we can then pass the x tensor (xt) into our model by calling the model as if it was a function. But first of all, we have to turn it into a variable. Last week, we were typing Variable(blah).cuda() to turn it into a variable, a shorthand for that is just the capital V. So capital T for a tensor, capital V for a variable. That’s just a shortcut in Fast AI. So that returns our predictions.

The next thing we needed was to calculate our loss because we can’t calculate the derivative of the loss if we haven’t calculated the loss [10:43]. So the loss takes the predictions and the actuals. The actuals, again, are the y tensor and we have to turn that into a variable. A variable keeps track of all of the steps to get computed. There’s actually a fantastic tutorial on the PyTorch website. On the PyTorch website, there is a tutorial section and there’s a tutorial there about Autograd. Autograd is the name of the automatic differentiation package that comes with PyTorch, and it’s an implementation of automatic differentiation. So the Variable class is really the key class here, because that’s the thing that turns a tensor into something where we can keep track of its gradients. So basically here they show how to create a variable, do an operation to a variable, and then you can go back and actually look at the grad function (grad_fn) which is the function it’s keeping track of to calculate the gradient. So as we do more and more operations to this variable and the variable calculated from that variable, it keeps keeping track of it. So later on, we can go .backward() and then print .grad and find out the gradient. So you notice we never defined the gradient, we just defined it as being (x + 2)² * 3 whatever, and it can calculate the gradient.

So that’s why we need to turn that into a variable [13:12]. So l is now a variable containing the loss. So it contains a single number for this mini batch which is the loss for this mini batch. But it’s not just a number. It’s a number as a variable, so it’s a number that knows how it was calculated.

We are going to append that loss to our array just so we can get the average of it later. And now we are going to calculate the gradient. So l.backward() is a thing that says calculate the gradient. So remember when we call the network, it’s actually calling our forward function. So that’s like go through it forward. And then backward is like using the chain rule to calculate the gradients backwards.

So optimizer.step() is the thing we are about to write which is update the weights based on the gradients and the learning rate. zero_grad(), we will explain when we write this out by hand.

So then at the end, we can turn our validation data loader into an iterator [14:16]. And we can then to through its length grabbing each x and y out of that, and asking for the score which we defined to be equal to which thing did you predict, which thing was actual, and check whether they are equal. Then the mean of that is going to be our accuracy:

def score(x, y):
    y_pred = to_np(net2(V(x)))
    return np.sum(y_pred.argmax(axis=1) == to_np(y))/len(y_pred)

Question: What’s the advantage that you converted into an iterator rather than use normal Python loop[14:53]? We are using a normal Python loop, so the question really is like compare to what. So the alternative, perhaps you are thinking of, could be like we could use something like a list with an indexer. So the problem there is that we want each time we grab a new mini batch, we want it to be random. We want a different shuffled thing. So this for t in range(len(dl)), you can actually iterate from forever. You can loop through it as many times as you like. So this is kind of idea it’s called different things in different languages, but a lot of languages call it a stream processing and it’s this basic idea that rather than saying I want the third thing or ninth thing, it’s just like I want the next thing. It’s great for network programming — like grab the next thing from the network. It’s great for UI programming — grab the next event where somebody clicked a button. It also turns out to be great for this kind of numeric programming — it’s like I just want the next batch of data. It means that the data can be kind of arbitrarily long because we are just grabbing one piece at a time. And I guess the short answer is because it’s how PyTorch works. PyTorch’s data loaders are designed to be called in this way. So Python has this concept of a generator which is a way that you can create a function that behaves like an iterator. So Python has recognized that this stream processing approach to programming is super handy and helpful, and supports it everywhere. So basically anywhere that you use for ... in loop, anywhere you use a list comprehension, those things can always be generators or iterators. So by programming this way, we get a lot of flexibility. Does that sound about right, Terrence? You’re the programming language expert.

Terrence: Yeah, I mean the short answer is what you said. You might say something about space, but in this case all that data has to be in memory anyway because we’ve got…

Jeremy: No doesn’t have to be in memory. In fact, most of the time with PyTorch, the mini batch will be read from separate images spread over your disk on demand, so most of the time it’s not in memory.

Terrence: But in general, you want to keep as little in memory as possible at a time. And so the idea of stream processing also is great because you can do compositions. You can pipe the data to a different machine.

Jeremy: Yeah, the composition is great. You can grab the next thing from here, and then send it off to the next stream, which you can then grab it and do something else.

Terrence: which you guys all recognize, of course, in the command-line pipes and I/O redirection.

Jeremy: Thanks, Terrence. It’s a benefit of working with people that actually know what they are talking about.

Implementing Stochastic Gradient Descent [18:24]

So let’s now take that and get rid of the optimizer. The only thing that we are going to be left with is the negative log likelihood loss function which we could also replace actually. We have an implementation of that from scratch that Yannet wrote in the notebooks. It’s only one line of code as we learned earlier. You can do it with a single if statement. So I don’t know why I was so lazy as to include this.

So what we are going to do is, we are going to, again, grab this module that we’ve written ourselves (the logistic regression module). We’re going to have one epoch again. We are going to loop through each thing in an iterator. We are going to grab our independent and dependent variable for the mini batch, pass it into our network, calculate the loss. So this is all the same as before, but now we’re going to get rid of optimizer.step(), and we are going to do it by hand. So the basic trick is, as I mentioned, we are not going to do the calculus by hand. So we call l.backward() to calculate the gradients automatically, and that’s going to fill in our matrix. Here is that module we built:

So the weight matrix for the linear layer weights, we call l1_w and the biases we call l1_b. They were the attributes we created. So I’ve just put them into things called w and b just to save some typing basically. So w is our weights, b is our biases. So the weights, remember the weights are a variable and to get the tensor out of the variable, we have to use .data. So we want to update the actual tensor that’s in this variable so we say:

w.data -= w.grad.data * lr

-= we want to go in the opposite direction to the gradient. The gradient tell us which way is up. We want to go down.
w.grad.data * lr whatever is currently in the gradients times the learning rate.

So that is the formula for gradient descent. As you can see, it’s as easier thing as you can possibly imagine. It’s like update the weights to be equal to whatever they are now minus the gradients times the learning rate. And do the same thing for the bias.

net2 = LogReg().cuda()
loss_fn=nn.NLLLoss()
lr = 1e-2
w,b = net2.l1_w,net2.l1_b

for epoch in range(1):
    losses=[]
    dl = iter(md.trn_dl)
    for t in range(len(dl)):
        xt, yt = next(dl)
        y_pred = net2(V(xt))
        l = loss(y_pred, Variable(yt).cuda())
        losses.append(loss)

        # Backward pass: compute gradient of the loss with respect 
        # to model parameters
        l.backward()
        w.data -= w.grad.data * lr
        b.data -= b.grad.data * lr
        
        w.grad.data.zero_()
        b.grad.data.zero_()   

    val_dl = iter(md.val_dl)
    val_scores = [score(*next(val_dl)) for i in range(len(val_dl))]
    print(np.mean(val_scores))

Question: When we do the next on top, when it’s the end of the loop, how do we grab the next element [21:08]? So this (for t in range(len(dl)):) is going through each index in range of length, so this is going 0, 1, 2… At the end of this loop, it’s going to print out the mean of the validation set, go back to the start of the epoch, at which point, it’s going to create a new iterator. So basically behind the scenes in Python when you call iter(md.trn_dl), it basically tells it to reset its state to create a new iterator. And if you are interested in how that works, the code is all available for you to look at. md.trn_dl is fastai.dataset.ModelDataLoader so we could take a look at the code of that and see exactly how it’s being built. So you can see here, the __next__ function which is keeping track of how many times it’s been through in this self.i, and here is the __iter__ function which is the thing that gets called when you create a new iterator. And you can see it’s passing it off to something else which is of type DataLoader, and then you can check out DataLoader if you’re interested to see how that’s implemented as well.

So the DataLoader that we wrote basically uses multi-threading to allow it to have multiple of these going on at the same time. It’s really simple. It’s only about a screen full of code so if you are interested in simple multi-threaded programming, it’s a good thing to look at.

Question: Why have you wrapped this in for epoch in range(1) since that’ll only run once [23:10]? Because in real life, we would normally be running multiple epochs. Like in this case, because it’s a linear model, it actually trains to as good as it’s going to get in one epoch, so if I type 3 here, it actually won’t improve after the first epoch much at all as you can see. But when we go back up to the top, we’re going to look at some slightly deeper and more interesting versions which will take more epochs. So if I was turning this into a function, I’d be going like def train_mdl and one of the things you would pass in is the number of epochs kind of thing.

One thing to remember is that when you are creating these neural network layers, and remember this (LogReg()) is, as far as PyTorch is concerned, just a nn.Module —we could be using it as a layer, we could be using as a function, we could be using it as a neural net [24:10]. PyTorch doesn’t think of those as different things. So this could be a layer inside some other network. So how do gradients work? So if you’ve got a layer which we can think of as activations or some activations that get computed through some other nonlinear/linear activation function. And from that layer, it’s very likely that we’re then putting it through a matrix product to create some new layer. So each one of these, so if we were to grab like one of these activations, is actually going to be used to calculate every one of those outputs.

So if you want to calculate the derivative, you have to know how this weight matrix impacts each output and you have to add all of those together to find the total impact of the one activation across all of its outputs. So that’s why in PyTorch you have to tell it when to set the gradients to zero. Because the idea is that you could be having lots of different loss functions or lots of different outputs in your next set of activations or whatever, all adding up increasing or decreasing your gradients. So you basically have to say, okay this is a new calculation — reset. So here is where we do that:

Before we do l.backward(), we say reset. So let’s take our weights, let’s take the gradients, let’s take the tensor that they point to and then zero_. Underscore as a suffix in PyTorch means “in place” so it sounds like a minor technicality but it’s super useful to remember. Every function pretty much has an underscore version suffix which does it in place. So normally zero returns a tensor of zeros of a particular size, so zero_ means replace the contents of this with a bunch of zeros.

Alright so that’s it. That’s SGD from scratch. And if I get rid of my menu bar, we can officially say it fits within a screen. Of course we haven’t gotten our definition of our logistic regression here, that’s another half the screen, but basically there’s not much to it.

Question: Why do we need multiple epochs [27:39]? The simple way to answer that would be, let’s say our learning rate was tiny. Then it’s just not going to get very far. There is nothing that says going through one epoch is enough to get you all the way there. So then it would be like okay, let’s increase our learning rate. Sure, we can increase the learning rate, but who is to say that the highest learning rate that learned stably is enough to learn this as well as it can be learnt. For most datasets for most architectures, one epoch is very rarely enough to get you to the best result you can get to. Linear models are very nicely behaved. So you can often use higher learning rates and learn more quickly. Also you can’t generally get as good an accuracy so there’s not as far to take them either. So doing one epoch is going to be the rarity.

Going backwards [28:54]

Let’s go backwards. So going backwards, we are basically going to say let’s not write these lines over and over again (on the left). Let’s have somebody do that for us.

So that’s the only difference between these versions. Rather than saying .zero_ or -= gradient * lr ourselves, these are wrapped up for us (on the right).

There is another wrinkle here which is the left approach to updating the weights is actually pretty inefficient. It doesn’t take advantage of momentum and and curvature. So in the DL course, we learned about how to do momentum from scratch as well. So if we actually just used plain old SGD instead of Adam, they are doing exactly the same and you’ll see that the left version learns is much slower.

Let’s do a little bit more stuff automatically [30:25]. Given that every time we train something, we have to loop through epoch, batch, do forward, get the loss, zero the gradient, do backward, do a step of the optimizer, let’s put all that in a function. And that function is called fit:

There it is. So let’s take a look at fit:

Then here is step [31:41]:

Zero out the gradients, calculate the loss (remember, PyTorch tends to call it criterion rather than loss), do backward. And then, there is something else we haven’t learned here, but we do learn in the deep learning course which is “gradient clipping” so you can ignore that. So you can see, all the stuff that we’ve learnt, when you look inside the actual framework, that’s the code you see. So that’s what fit does.

Then the next step would be this idea of having some weights and a bias and doing a matrix product and addition, let’s put that in a function [32:14]. This thing of doing the log softmax, let’s put that in a function. Then the very idea of first doing this and then doing that, the idea of chaining functions together, let’s put that into a function. And that finally gets us to:

So Sequential simply means through this do this function, take the result, send it to this function, etc. And Linear means create the weight matrix, create the biases. That’s it.

We can then, as we started to talk about, turns this into a deep neural network by saying rather than sending this straight off into 10 activations, let’s put it into, say, 100, activations. We could pick whatever number we like. Put it through a ReLU to make it nonlinear, put it through another linear layer, another ReLu, and then our final output with our final activation function.

This is now a deep network. So we could fit that. And this time now, because it’s deeper, I’m actually going to run a few more epochs. And you can see the accuracy increasing:

If you try and increase the learning rate from 0.1 further, it actually starts to become unstable.

Learning rate annealing [34:12]

I’ll show you a trick. This is called learning rate annealing and the trick is this. When you are trying to fit to a function, you’ve been taking a few steps. As you get closer to the bottom, your steps probably want to become smaller. Otherwise what tends to happen is you start finding yourself going back and forth the same spots (oscillating).

You can actually see it in accuracies above that it’s starting to flatten out. That could be because it has done as well as it can, or it could be that it’s going backwards and forwards. So it’s a good idea to decrease your learning rate later on in training, and take smaller steps. That’s called learning rate annealing. There is a function in Fast AI called set_lrs (set learning rates), you can pass in your optimizer and your new learning rate, and see if that helps. Very often it does. You should reduce by about an order of magnitude. In the deep learning course, we learn a much better technique than this to do learning rate annealing automatically and at more granular level. But if you are doing it by hand, an order of magnitude at a time is what people generally do.

You’ll see people in papers talk about learning rate schedules, this is like a learning rate schedule. So this schedule has got us to 97%. And I tried going further and we don’t seem to be able to get much better than that. So here we’ve got something where we can get 97% accuracy.

Question: I had a question about the data loading. I know it’s a Fast AI function, but could you go into a little detail of how it’s creating batches, how it’s done, and how it’s making those decisions [36:47]? Sure. Basically there’s really nice design in PyTorch where they basically say let’s create a thing called a dataset. A dataset is basically something that looks like a list. It has a length (e.g. how many images are in the dataset). And it has the ability to index into it like a list. So if you had Dataset d, you can do:

d = Dataset(...)
len(d)
d[i]

That’s basically all the dataset is as far as PyTorch is concerned. So you start with the dataset, so it’s like d[3] gives you the third image, etc. So you take a dataset and you can pass that into a constructor for a data loader dl = DataLoader(d). That gives you something which is now iterable. So you can now say iter(dl) and that’s something you can call next on (i.e. next(iter(dl))). And what that now is going to do is when you call data loader’s constructor you can choose to have shuffle on or shuffle off. Shuffle on means give me random mini batch, shuffle off means go through it sequentially. So what the data loader does when you call next is it basically, assuming you said shuffle=True and batch size is 64, it’s going to grab 64 random integers between 0 and length, and call this (d[i]) 64 times to get 64 different items and jam them together. So Fast AI uses the exact same terminology and the exact same API. We just do some of the details differently. Specifically, particularly with computer vision, you often want to do a lot of data augmentation like flipping, changing the colors a little bit, rotating, those turn out to be computationally expensive. Even just reading the JPEGS turns out to be computationally expensive. So PyTorch uses an approach where it fires off multiple processors to do that in parallel, where else the Fast AI library, instead, does something called multi-threading which can be a much faster way of doing it.

Question: Is an “epoch” a real epoch in the sense that all of the elements get returned once? Is it a shuffle at the beginning of the epoch [39:47]? Yeah, not all libraries work the same way. Some do sampling with replacement, some don’t. The Fast AI library actually hands off the shuffling off to the actual PyTorch version and I believe the PyTorch version actually shuffles and an epoch covers everything once, I believe.

Now the thing is when you start to get these bigger networks, potentially you’re getting quite a few parameters [40:17]. I want to ask you to calculate how many parameters there are, but let’s remember here we’ve got 28 by 28 input into 100 output, and 100 into 10. Then for each of those, we’ve got weights and biases.

So we can actually do this.net.parameters returns a list where each element of the list is a tensor of the parameters for not just that layer but if it’s a layer with both weights and biases, that would be two parameters. So basically returns us a list of all of the tensors containing the parameters. numel in PyTorch tells you how big that is.

So if I run this, here is the number of parameters in each layer. So I’ve got 784 inputs and the first layer has a hundred outputs, therefore the first weight matrix is of size 78,400. And the first bias vector is of size 100. Then the next one is a hundred by hundred, and there’s 100. And then the next one is 100 by 10 and 10 is the bias. So there’s the number of elements in each layer. I add them all up and it’s nearly a hundred thousand. So I’m possibly at a risk of overfitting here. So we might want to think about using regularization.

Regularization [42:05]

A really simple common approach to regularization in all of machine learning is something called L2 regularization. It’s super important, super handy, you can use it with just about anything. The basic idea is this. Normally we’d say our loss is equal to (let’s do RMSE to keep things simple) our predictions minus our actuals squared and we sum them up, take the average, take the square root.

What if we then want to say, if I’ve got lots and lots of parameters, don’t use them unless they are really helping enough. If you’ve got a million parameters and you only really needed 10 parameters to be useful, just use 10. So how could we tell the loss function to do that? Basically what we want to say is hey, if a parameter is zero, that’s no problem. It’s like it doesn’t exist at all. So let’s penalize a parameter for not being zero. What would be a way we could measure that? How can we calculate how un-zero our parameters are? L1 is the absolute value of the weights average. L2 is the squares of the weights themselves. Then we want to be able to say okay how much do we want to penalize not being zero? Because if we actually don’t have that many parameters, we don’t want to regularize much at all. If we’ve got heaps, we do want to regularize a lot. So then we put a parameter a:

Except I have a rule in my classes which is never to use Greek letters, so normally people use alpha, I’m going to use a. So this is some number which you often see around 1e–6 to 1e-4 ish. Now we actually don’t care about the loss other than maybe to print it out. What we actually care about is the gradient of the loss. So the gradient of aw² is 2aw. So there are two ways to do this:

We can actually modify our loss function to add in this square penalty.
We can modify that thing where we said weights equals weights minus gradient times learning rate to add 2aw as well.

These are basically equivalent but they have different names. The first one is called L2 regularization and the second one is called weight decay. So the first version was how it was first posed in the neural network literature, where else, the second version is how it was posed in the statistics literature, and they are equivalent.

As we talked about in the deep learning class, it turns out they are not exactly equivalent because when you have things like momentum and Adam, it can behave differently. And two weeks ago, a researcher figured out a way to actually do proper weight decay in modern optimizers and one of our Fast AI students implemented that in the Fast AI library, so Fast AI is now the first library to actually support this.

Anyways, for now, let’s do the version which PyTorch calls weight decay, but actually it turns out, based on this paper two weeks ago, is actually L2 regularization. It’s not quite correct, but it’s close enough. So here, we can say weight decay is 1e-3.

So this is going to set our penalty multiplier a to 1e-3 and it’s going to add that to the loss function. Let’s make a copy of these cells so we can compare how they train. So you might notice something kind of counterintuitive here [48:54]. 0.23547 is our training error. You would expect our training error with regularization to be worse because we are penalizing parameters that specifically can make it better. And yet actually, it started out better not worse (previously it was 0.29756). Why could that be?

The reason that can happen is that if you have a function that looks like this:

It takes potentially a really long time to train, where else, if you have a function that kind of looks more like this:

It’s going to train a lot more quickly. And there are certain things that you can do which sometime just can take a function that’s horrible and make it less horrible. And sometimes weight decay can actually make your functions a little more nicely behaved and that’s actually happened here. So I just mentioned that to say don’t let that confuse you. Weight decay really does penalize the training set and strictly speaking, the final number we get to for the training set shouldn’t end up being better, but it can train sometimes more quickly.

Question: I don’t get it. Why making it faster? The training time matters [50:26]? No, this is after one epoch. The bottom is our training without weight decay, and the top is with weight decay. This is not related to time, this is related to just one epoch. After one epoch, my claim was that you would expect the training set, all other things being equal, to have a worse loss with weight decay because we are penalizing it. And I’m saying “oh, it’s not. That’s weird.”

The reason it’s not is because in a single epoch, it matters a lot as to whether you are trying to optimize something that’s very bumpy or whether you are trying to optimize something that’s nice and smooth. If you are trying to optimize something that’s really bumpy, imagine in some high dimensional space, you end up rolling around through all these different tubes and tunnels and stuff. Where else, if it’s just smooth, you just go boom. Imagine a marble rolling down a hill where one of them you’ve got Lombard street in San Francisco — it’s like backwards, forwards, backwards, forwards, it takes a long time to drive down the road. Where else, if you kind of took a motorbike and just went straight over the top, it’s much faster. So the shape of the loss function surface defines how easy it is to optimize. Therefore, how far it can get in a single epoch, based on these results, it would appear that weight decay here has made this function easier to optimize.

Question: So just to make sure, penalizing is making the optimizer more than likely to reach the global minimum [52:45]? No, I wouldn’t say that. My claim actually is that at the end, it’s probably going to be less good on the training set, indeed this does look to be the case. At the end, after five epochs, our training set is now worse with weight decay. That’s what I would expect. Like I never use a term global optimum because it’s just not something we have any guarantees about. We don’t really care about. We just care where do we get to after a certain number of epochs. We hope that we found somewhere that’s a good solution. So by the time we get to a good solution, the training set with weight decay, the loss is worse because it’s penalty. But on the validation set, the loss is better because we penalized the training set in order to try and create something that generalizes better. So parameters that are pointless is now zero and it generalizes better. So all we are saying is that it just got to a good point after one epoch.

Question: Is it always true [54:04]? No. By “it” if you mean weight decay always make the function surface smoother, no it’s not always true. But it’s worth remembering that if you are having trouble training a function, adding a little bit of weight decay may help.

Question: So by regularizing the parameters, what it does is it smoothen out the loss function surface [54:29]? I mean, it’s not why we do it. The reason why we do it is because we want to penalize things that aren’t zero to say don’t make this parameter a high number unless its really helping the loss a lot. Set it to zero if you can because setting as many parameters to zero as possible means that it’s going to generalize better. It’s like same as having a smaller network. That’s why we do it. But it can change how it learns as well.

I wanted to check how we actually went here [55:11]. So after the second epoch, you can see here it really has helped. After the second epoch, before we got to 97% accuracy, now we are nearly up to about 98% accuracy. And you can see that the loss was 0.08 vs. 0.13. So adding regularization has allowed us to find 50% better solution (3% versus 2 %).

Question: So there are two pieces to this — one is L2 regularization and weight decay [55:42]? No, so my claim was they are the same thing. So weight decay is the version if you just take the derivative of L2 regularization you get weight decay. So you can implement it either by changing the loss function with a squared loss penalty or you can implement it by adding the weights themselves as part of the gradient.

Question: Can we use regularizations for convolution layer as well [56:19]? Absolutely. A convolution layer is just weights.

Question: Can you explain why you thought you needed weight decay in this particular problem [56:29]? Not easily. I mean, other than to say it’s something that I would always try. Question continued: Overfitting? So if my training loss was higher than my validation loss, then I’m underfitting. So there’s definitely no point regularizing. That would always be a bad thing. That would always mean you need more parameters in your model. In this case, I’m overfitting. That doesn’t necessarily mean regularization will help but it’s certainly worth trying.

Question: How do you choose the optimal number of epoch [57:27]? You do my deep learning course 😆 That’s a long story. We don’t have time to cover best practices in this class. We are going to learn the fundamentals.

The secret to modern machine learning techniques [58:14]

Something that we cover in great detail in the deep learning course, but it’s really important mention here is the secret, in my opinion, to modern machine learning techniques is to massively over parameterize the solution to your problem as we just did. We’ve got like 100,000 weights when we only had a small number of 28 by 28 images, and then use regularization. It’s like the direct opposite of how nearly all statistics and learning was done for decades before, and still most senior lecturers at most universities in most areas have this background where they’ve learned the correct way to build a model is to have as few parameters as possible. So hopefully we’ve learnt two things so far. One is we can build very accurate models even when they have lots and lots of parameters. Random forest has a lot of parameters, and this here, deep network has a lot of parameters, and they can be accurate. And we can do that by either using bagging or by using regularization. And regularization in neural nets means either weight decay (also known as “kind of” L2 regularization) or dropout which we won’t worry too much about here. It’s a very different way of thinking about building useful models. And I just wanted to warn you that once you leave this classroom, even possibly when you go to the next faculty members talk, there’ll be people at USF as well who are entirely trained in the world of models with small number of parameters. Your next boss is likely to have been trained in the world of models with small number of parameters. The idea that they are somehow more pure or easier or better or more interpretable or whatever. I am convinced that is not true — probably not ever true. Certainly very rarely true. And that actually models with lots of parameters can be extremely interpretable as we learnt from our whole lesson of random forest interpretation. You can use most of the same technique with neural nets, but with neural nets are even easier. Remember how we did feature importance by randomizing a column to see how changes in that column would impact the output? Well, that’s just like a kind of dumb way of calculating its gradient. How much does varying this input change the output? With a neural net, we can actually calculate its gradient. So with PyTorch, you can actually say what’s the gradient of the output with respect to this column? You can do the same kind of things to do partial dependence plot with a neural net. And I’ll mention for those of you interested in making a real impact, nobody’s written basically any of these things for neural nets. So that whole area needs libraries to be written, blog posts to be written. Some papers have been written, but only in very narrow domains like computer vision. As far as I know, nobody’s written the paper saying here’s how to do structured data neural networks interpretation methods. So it’s a really exciting big area.

NLP [1:02:04]

So what we are going to do, though, is we are going to start with applying this with a simple linear model. This is mildly terrifying for me because we are going to do NLP and our NLP faculty expert is in the room. So David, just yell at me if I screw this up too badly. NLP refers to any kind of modeling where we are working with natural language text. Interestingly enough, we are going to look at a situation where a linear model is pretty close to the state of the art for solving a particular problem. It’s actually something where I actually surpassed state of the art in this using a recurrent neural network a few weeks ago, but this is actually going to show you pretty close to the state of the art with a linear model.

IMDb

Notebook

We are going to be working with the IMDb dataset. So this is a dataset of movie reviews. You can download it by following these steps:

To get the dataset, in your terminal run the following commands:wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gzgunzip aclImdb_v1.tar.gztar -xvf aclImdb_v1.tar

And once you download it, you’ll see that you’ve got a train and a test directory and in your train directory, you’ll see there is a negative and a positive directory. And in your positive directory, you’ll see there is a bunch of text files.

PATH='data/aclImdb/'
names = ['neg','pos']%ls {PATH}aclImdb_v1.tar.gz  imdbEr.txt  imdb.vocab  models/  README  test/  tmp/  train/%ls {PATH}trainaclImdb/  all_val/         neg/  tmp/    unsupBow.feat  urls_pos.txt
all/      labeledBow.feat  pos/  unsup/  urls_neg.txt   urls_unsup.txt%ls {PATH}train/pos | head0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt
...

And here is an example of a text file:

trn[0]"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

So somehow we’ve managed to pick out a story of a man who has unnatural feelings for a pig as our first choice. That wasn’t intentional but it’ll be fine.

We are going to look at these movie reviews and for each one, we are going to look to see whether they were positive or negative. So they’ve been put into one of these folders. They were downloaded from IMDb (the movie database and review site). The ones that were strongly positive went in /pos and strongly negative went in /neg, and the rest they didn’t label at all (/unsup). So there are only highly polarized reviews.

So in the above example, we have an insane violent mob which unfortunately is too absurd, too off-putting, those from the era should be turned off. So the label for this was a zero which is negative, so this is a negative review

trn_y[0]0

In the Fast AI library, there’s lots of functions and classes to help with most kinds of domains that you do machine learning on. For NLP, one of the simple things we have is texts from folders.

trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

That will go through and find all of the folders in here (the first argument f'{PATH}train') with these names (the second argument names) and create a labeled dataset. Don’t let these things ever stop you from understanding what’s going on behind the scenes. We can grab its source code and as you can see it’s tiny, like 5 lines.

I don’t like to write these things out in full, but hide them behind little functions so you can reuse them. But basically, it’s going to go through each directory, and go through each file in that directory, then stick that into an array of texts, figure out what folder it’s in, and stick that into an array of labels. So that’s how we end up with something where we have an array of the reviews and an array of the labels.

That’s our data. So our job will be to take a movie review and to predict the label. The way we are going to do is, we are going to throw away all of the interesting stuff about language which is the order in which the words are in. This is very often not a good idea, but in this particular case, it’s going to turn out to work not too badly. So let me show you what I mean by throwing away the order of the words. Normally, the order of the words matters a lot. If you’ve got a “not” before something, then that “not” refers to that thing. But in this case, we are trying to predict whether something is positive or negative. If you see the word “absurd” or “cryptic” appear a lot then maybe that’s a sign that this isn’t very good. So the idea is that we are going to turn it into something called a term document matrix where for each document (i.e. each review), we are just going to create a list of what words are in it, rather than what order they are in.

Term document matrix example:

naivebayes.xlsx

Here are four movie reviews that I made up. So I’m going to turn this into a term document matrix. The first thing I need to do is create something called vocabulary. A vocabulary is a list of all the unique words that appear. Here are my vocabulary: this, movie, is, good, the bad. That’s all the words. Now I’m going to take each of my movie reviews and turn it into a vector of which words appear and how often they appear. In this case, none of my words appear twice. So this is called a term document matrix:

And this representation, we call a bag of words representation. So this here is a bag of words representation of the review.

It doesn’t contain the order of the text anymore. It’s just a bag of the words (i.e. what words are in it). It contains “bad”, “is”, movie”, “this”. So the first thing we are going to do is we are going to turn it into a bag of words representation. The reason that this is convenient for linear models is that this is a nice rectangular matrix that we can do math on. Specifically, we can do a logistic regression and that’s what we are going to do. We are going to get to a point we do a logistic regression. Before we get there, though, we are going to do something else which is called Naive Bayes. sklearn has something which will create a term document matrix for us which is called CountVectorizer, so we’ll just use it.

Tokenization [1:09:01]

Now in NLP, you have to turn your text into a list of words, and that’s called tokenization. That’s actually non-trivial because if this was actually This movie is good. or This “movie” is good., how do you deal with that punctuation? More interestingly, what if this was This "movie" isn’t good. How you turn a piece of text into a list of tokens is called tokenization. A good tokenizer would turn this:

Before: This "movie" isn’t good.

After: This " movie " is n’t good .

So you can see in this version here, if I now split this on spaces, every token is either a single piece of punctuation or this suffix n't and is considered like a word. That’s how we would probably want to tokenize that piece of text because you wouldn’t want good. to be an object. There is no concept of good. or "movie" is not an object. So tokenization is something we hand off to a tokenizer. Fast AI has a tokenizer in it that we can use, so this is how we create our term document matrix with a tokenizer:

veczr = CountVectorizer(tokenizer=tokenize)

sklearn has a pretty standard API which is nice. I’m sure you’ve seen it a few times before. Once we’ve built some kind of “model”, we can kind of think of CountVectorizer as a model-ish, this is just defining what it’s going to do. We can call fit_transform to do that.

trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

So in this case fit_transform is going to create the vocabulary and create the term document matrix based on the training set. transform is a little bit different. That says use the previously fitted model which in this case means use the previously created vocabulary. We wouldn’t want the validation set and the training set to have the words in different orders in the matrices. Because then they would have different meanings. So this is here saying use the same vocabulary to create a bag of words for the validation set.

Question: What if the validation set has different set of words other than training set [1:11:40]? That’s a great question. So generally, most of these vocab creating approaches will have a special token for unknown. Sometimes you can also say like hey if a word appears less than three times, call it unknown. But otherwise, if you see something you haven’t seen before, call it unknown. So that (i.e. “unknown”) would just become a column in the bag of words.

When we create this term document matrix, the training set we have 25,000 rows because there are 25,000 movie reviews and there are 75,132 columns which is the number of unique words.

trn_term_doc<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
     with 3749745 stored elements in Compressed Sparse Row format>

Now, most of the documents don’t have most of these 75,132 words. So we don’t want to actually store that as a normal array in memory. Because it’s going to be very wasteful. So instead, we store it as a sparse matrix. What a sparse matrix does is it just stores it as something that says whereabouts of the non-zeros. So it says okay, document number 1, word number 4 appears and it has 4 of them. Document number 1, term number 123 appears once, and so forth.

(1, 4) → 4
(1, 123) → 1

That’s basically how it’s stored. There’s actually a number of different ways of storing and if you do Rachel’s computational linear algebra course, you’ll learn about the different types and why you choose them, and how to convert and so forth. But they’re all something like this and you don’t really , on the whole, have to worry about the details. The important thing is it’s efficient.

So we could grab the first review and that gives us 75,000 long sparse one row long matrix with 93 stored elements [1:14:02]. So in other words, 93 of those words are actually used in the first document.

trn_term_doc[0]<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 93 stored elements in Compressed Sparse Row format>

We can have a look at the vocabulary by saying veczr.get_feature_names that gives us the vocab. And so here is an example of a few of the elements of feature names:

vocab = veczr.get_feature_names(); vocab[5000:5005]['aussie', 'aussies', 'austen', 'austeniana', 'austens']

I didn’t intentionally pick the one that had aussie but that’s the important words, obviously 😄 I haven’t used the tokenizer here. I’m just splitting on space, so this isn’t quite the same as what the vectorizer did. But to simplify things, let’s grab a set of all the lowercased words. By making it a set, we make them unique. So this is roughly the list of words that would appear.

w0 = set([o.lower() for o in trn[0].split(' ')]); w0{'a',
 'absurd',
 'an',
 'and',
 'audience',
 'be',
 'better',
 'briefly.',
 'by',
 'can',
 ...
}len(w0)91

And that length is 91 which is pretty similar to 93, and just the difference will be that I didn’t use a real tokenizer. So that’s basically all that has been done there. Created this unique list of words and mapped them. We could check by calling veczr.vocabulary_ to find the ID of a particular word. So this is like the reverse map of veczr.get_feature_names which maps integer to word, veczr.vocabulary_ maps word to integer.

veczr.vocabulary_['absurd']1297

So we saw “absurd” appear twice in the first document, so let’s check:

trn_term_doc[0,1297]2

There it is, this is 2. Or else, unfortunately aussie did not appear in the unnatural relationship with a pig movie, so this is zero:

trn_term_doc[0,5000]0

So that’s our term document matrix.

Question: Does it care about the relationship between the words as in the ordering of the words [1:16:02]? No, we’ve thrown away the orderings. That’s why it’s a bag of words. And I’m not claiming that this is necessarily a good idea. What I will say is that the vast majority of NLP work that’s been done over the last few decades generally uses this representation because we didn’t really know much better. Nowadays, increasingly we are using recurrent neural networks instead which we will learn about in our last deep learning lesson of part 1. But sometimes this representation works pretty well, and it’s actually going to work pretty well in this case.

In fact, back when I was at FastMail, my email company, a lot of the spam filtering we did used this next technique Naive Bayes which is a bag of words approach [1:16:49]. If you are getting a lot of email containing the word Viagra and it’s always been a spam and you never get email from your friends talking about Viagra, then it’s very likely something that says Viagra regardless of the detail of the language is probably from a spammer. So that’s the basic theory about classification using a term document matrix.

Naive Bayes [1:17:26]

Let’s talk about Naive Bayes. And here is the basic idea. We are going to start with our term document matrix. And the first two is our corpus of positive reviews. The next two is our corpus of negative reviews. And so here is our whole corpus of all reviews:

We tend to call these columns more generically “features” rather than “words”. this is a feature, movie is a feature, etc. So it’s more now like machine learning language. A column is a feature. We call those f in Naive Bayes. So we can basically say the probability that you would see the word this given that the class is 1 (i.e. positive review) is just the average of how often do you see this in the positive reviews. Now we’ve got to be a bit careful though, because if you never ever see a particular word in a particular class, so if I’ve never received an email from a friend that said “Viagra”, that doesn’t actually mean the probability of a friend sending me an email about Viagra is zero. It’s not really zero. I hope I don’t get an email from Terrence tomorrow saying like “Jeremy you probably could use this advertisement for Viagra” but it could happen. I’m sure it would be in my best interest 🤣 So what we do is we say actually what we’ve seen so far is not the full sample of everything that could happen. It’s like a sample of what’s happened so far. So let’s assume that the next email you get actually does mention Viagra and every other possible word. So basically we are going to add a row of 1’s.

That’s like the email that contains every possible word. That way, nothing is ever infinitely unlikely. So I take the average of all of the times that this appears in my positive corpus plus the 1's:

So that’s like the probability that feature = 'this’ appears in a document given that class = 1 (i.e. p(f|1) for this).

Not surprisingly, here is the same thing for probability that the feature this appears given class = 0:

Same calculation except for the zero rows. Obviously these are the same because this appears once in the positives, and once in the negatives.

So we can do that for every feature for every class [1:20:40]

So our trick now is to basically use Bayes rule to fill this in. So what we want is the probability that given this particular document (so somebody sent me this particular email, or I have this particular IMDb review), what is the probability that its class is equal to positive. So for this particular movie review, what’s the probability that its class is positive. So we can say, well, that’s equal to the probability that we got this particular movie review given that its class is positive multiplied by the probability that any movie review’s class is positive divided by the probability of getting this particular movie review.

That’s just Bayes’ rule. So we can calculate all of those things but actually what we really want to know is is it more likely that this is class 0 or class 1. Wo what if we actually took probability that’s class 1 and divided by probability that’s class 0. What if we did that?

Okay, so if this number is bigger than 1, then it’s more likely to be class 1, if it’s smaller than 1, it’s more likely to be class 0. So in that case, we could just divide this whole thing by the same version for class 0 which is the same as multiplying it by the reciprocal. So the nice thing is now p(d) gets cancelled and probability of getting the data given class 0 down here, probability of getting class 0 here.

Basically what that means is we want to calculate the probability that we would get this particular document given that the class is 1 times the probability that the class is 1 divided by the probability of getting this particular document given the class is 0 times the probability that the class is 0:

So probability that the class is 1 is just equal to the average of the labels [1:23:20]. Probability the class is 0 is 1 minus that. So there are those two numbers:

I’ve got an equal amount of both, so it’s both 0.5.

What is the probability of getting this document given that the class is 1?Student: Look at all the documents that have class equal to 1 and 1 divided by that would give you …[1:24:02] Jeremy: So remember it’s going to be for a particular document. For example would be saying like what’s the probability that this review is positive. So you are on the right track, but what we are going to have to do is say let’s just look at the words it has, and then multiply the probabilities together for class equals 1. So the probability that a class 1 review has this is 2/3, the probability it has movie is 1, is is 1, and good is 1. So the probability it has all of them is all of those multiplied together. Kinda. Tyler, why is it not really? So glad you look horrified and skeptical 😄

Tyler: Were choices not independent?

Jeremy: Thank you. So nobody can call Tyler naive because the reason this is Naive Bayes is because this is what happens if you take Bayes’ theorem in a naive way. And Tyler is not naive. Anything but. So Naive Bayes says let’s assume that if you have “this movie is bloody stupid I hate it” but the probability of hate is independent of the probability of bloody is independent of the probability of stupid which is definitely not true. So Naive Bayes aren’t actually very good but I’m teaching it to you because it’s going to turn out to be a convenient piece for something we are about to learn later. Background: And it often works pretty well. Jeremy: It’s okay. I mean I would never choose it. I don’t think it’s better than any other technique that’s equally fast and equally easy. But it’s a thing you can do and it’s certainly going to be useful foundation.

So here is now calculation of the probability that we get this particular document assuming it’s a positive review [1:26:08]:

Here is the probability given it’s negative

And here is the ratio. And the ratio is above 1, so we are going to say we think that this is probably a positive review.

So that’s the Excel version. So you can tell I let Yannet touch this because it’s got LaTeX in it. We got actual math. So here is the same thing; the log-count ratio for each feature f.

So here it is written out as Python. Our independent variable is our term document matrix, our dependent variable is just the labels for the y. So using numpy, this x[y==1] is going to grab the rows where the dependent variable is 1. Them we can sum them over the rows to get the total word count for that feature across all the documents, plus 1 — Terrence is totally going to send me something about Viagra today, I can tell. That’s that. Then do the same thing for the negative reviews. Then of course it’s nicer to take the log because if we take the log, then we can add things together rather than multiply them together. And once you multiply enough of these things together, it’s going to get so close to zero that you’ll probably run out of the floating-point. So we take the log of the ratios. Then, as I say, we then multiply that, or with log, we add that to the ratio of the whole class probabilities.

def pr(y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)x=trn_term_doc
y=trn_yp = x[y==1].sum(0)+1 
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log(len(p)/len(q))

So in order to say for each document, multiply the Bayes’ probabilities by the counts, we can just use matrix multiply. Then to add on the log of the class ratios, you can just use + b. So we end up something that looks a lot like a logistic regression. But we are not learning anything. Not in kind of SGD point of view. We are just calculating it using this theoretical model. As I said, we can then compare that as to whether it’s bigger or smaller than zero — not one anymore because we are now in log space. Then we can compare that to the mean. And that’s ~81% accurate. So Naive Bayes is not nothing. It gave us something.

pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()0.80691999999999997

It turns out that this version where we are actually looking at how often a appears, like “absurd” appeared twice, it turns out at least for this problem and quite often it doesn’t matter whether “absurd” appeared twice or once [1:29:03]. All that matter is that it appeared. So what people tend to try doing is to say take the term matrix document and go .sign() which replaces anything positive as 1, and anything negative with -1 (we don’t have negative counts obviously). So this binarizes it. It says I don’t care that you saw “absurd” twice, I just care that you saw it. So if we do exactly the same thing with the binarized version, then you get a better result.

pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()0.82623999999999997

Logistic regression [1:30:01]

Now this is the difference between theory and practice. In theory, Naive Bayes sounds okay but it’s naive, unlike Tyler, it’s naive. So what Tyler would probably do would instead say rather than assuming that I should use these coefficients r, why don’t we learn them? So let’s learn them. We can totally learn them.

So let’s create a logistic regression, and let’s fit some coefficients. And that’s going to literally give us something with exactly the same functional form that we had before but now rather than using a theoretical r and theoretical b, we are going to calculate the two things based on logistic regression. And that’s better.

m = LogisticRegression(C=1e8, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()0.85504000000000002

So it’s kind of like yeah, why do something based on some theoretical model? Because theoretical models are never going to be as accurate, pretty much, as a data driven model. Because theoretical models, unless you are dealing with some physics thing or something where you’re like okay this is actually how the world works, there really is no … I don’t know, we are working in a vacuum and there is the exact gravity, etc. But most of the real world, this is how things are — it’s better to learn your coefficients and calculate them.

Yannet: What’s this dual=True [1:31:30]? I was hoping you’d ignore, not notice, but you saw it. Basically in this case, our term document matrix is much wider than it is tall. There is an almost mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it’s wider than it is tall. So the short answer is anytime it’s wider than it’s tall, put dual=True, it’ll run fast. This runs in like 2 seconds. If you don’t have it here, it’ll take a few minutes. So in math, there is a concept of dual versions of problems which are kind of like equivalent versions that sometimes work better for certain situations.

Here is the binarized version [1:32:20]. It’s about the same. So you can see I’ve fitted it with the sign of term doc matrix, and predicted with val_term_doc.sign().

m = LogisticRegression(C=1e8, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()0.85487999999999997

Now the thing is that this is going to be a coefficient for every term where there was about 75,000 terms in our vocabulary and that seems like a lot of coefficients given that we’ve only got 25,000 reviews [1:32:38]. So maybe we should try regularizing this.

So we can use regularization built into sklearn’s LogisticRegression class which is C is the parameter that they use. This is slightly weird, a smaller parameter is more regularization. What’s why I used 1e8 to basically turn off regularization.

So if I turn on regularization, set it to 0.1, then now it’s 88%:

m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()0.88275999999999999

Which makes sense. You would think 75,000 parameters for 25,000 documents, it’s likely to overfit. Indeed, it did overfit. So this is adding L2 regularization to avoid overfitting.

I mentioned earlier that as well as L2 which is looking at the weight squared, there’s also L1 which is looking at just the absolute value of the weights [1:33:37].

I was kind of pretty sloppy in my wording before I said that L2 tries to make things zero. That’s kind of true but if you’ve got two things that are highly correlated, then L2 regularization will move them both down together. It won’t make one of them zero and one of them nonzero. So L1 regularization actually has the property that it will try to make as many things zero as possible where else L2 regularization has a property that it tends to try to make everything smaller. We actually don’t care about that difference in really any modern machine learning because we very rarely try to directly interpret the coefficients. We try to understand our models through interrogation using the kind of techniques that we’ve learned. The reason we would care about L1 versus L2 is simply which one ends up with a better error on the validation set. And you can try both. With sklearn’s LogisticRegression, L2 actually turns out to be a lot faster because you can’t use dual=True unless you have L2 and L2 is the default. So I didn’t really worry too much about the difference yet.

So you can see here if we use regularization and binarized, we actually do pretty well [1:35:04]:

m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()0.88404000000000005

Question: Before we learned about Elastic-net like combining L1 and L2. Can we do that [1:35:23]? Yeah, you can do that, but with deeper models. And I’ve never seen anybody find that useful.

So the last thing I’ll mention is that when you do your CountVectorizer, you can also ask for n-grams. By default, we get unigrams that is single words. But if we say ngram_range=(1,3), that’s also going to give us bigrams and trigrams.

veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize,
                         max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)trn_term_doc.shape(25000, 800000)vocab = veczr.get_feature_names()vocab[200000:200005]['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

By which I mean, if I now say okay, let’s go ahead and do the CountVectorizer, and get_feature_names, now my vocabulary includes bigram: 'by vast' , 'by vengeance' and trigram: 'by vengeance .' 'by vera miles' . So this now doing the same thing but after tokenizing, it’s not just grabbing each word and saying that’s part of your vocabulary, but each two words next to each other, and each three words next to each other. And this turns out to be super helpful in taking advantage of bag of word approaches because we now can see the difference between not good versus not bad versus not terrible. Or even like "good" which is probably going to be sarcastic. So using trigram features actually is going to turn out to make both Naive Bayes and logistic regression quite a lot better. It really takes us quite a lot further and makes them quite useful.

Question: I have a question about tokenizers. You are saying max_features, so how are these bigrams and trigrams selected [1:37:17]? Since I’m using a linear model, I didn’t want to create too many features. It actually worked fine even without max_features. I think I had something like 70 million coefficients. It still worked. But there’s no need to have 70 million coefficients. So if you say max_features=800,000, the CountVectorizer will sort the vocabulary by how often everything appears whether it be unigram, bigram, trigram, and it will cut it off after the first 800,000 most common ngrams. N-gram is just a generic word for unigram, bigram, and trigram.

So that’s why the train_term_doc.shape is now 25,000 by 800,000. If you are not sure what number the max should be, I just picked something that was really big and didn’t worry about it too much and it seemed to be fine. It’s not terribly sensitive.

Okay, we are out of time so what we are going to see tomorrow… By the way, we could have replaced this LogisticRegression with our PyTorch version:

And tomorrow, we’ll actually see something in the Fast AI library that does exactly that but also what we will see tomorrow is how to combine logistic regression and Naive Bayes together to get something better than either. Then we’ll learn how to move from there to create a deeper neural network to get pretty much state-of-the-art result for structured learning. All right. We’ll see you then.