Machine Learning 1: Lesson 11

45 min readOct 13, 2018

My personal notes from machine learning class. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12

Review of optimizing multi-layer functions with SGD [0:00]

The idea is that we’ve got some data (x) and then we do something to that data, for example, we multiply it by a weight matrix (f(x)). Then we do something to that, for example, we put it through a softmax or a sigmoid(g(f(x))). Then we do something to that, such as do a cross entropy loss or a root mean squared error loss (h(g(f(x)))). That’s going to give us some scaler. This is going to have no hidden layers. This has got a linear layer, a non-linear activation being a softmax, and a loss function being a root mean squared error or a cross entropy. Then we’ve got our input data.

For example [1:16], if the non-linear activation was sigmoid or softmax, and the loss function was cross entropy, then that would be logistic regression. So how do we calculate the derivative of that with respect to our weights?

To do that, basically we do the chain rule:

In order to take the derivative with resect to the weights, therefore, we just have to calculate the derivative with respect to w using that exact formula [3:29]. Then if we went further here and had another linear layer with weight w2, there is no difference now to calculate the derivative with respect to all of the parameters. We can still use the exact same chain rule.

So don’t think of the multi-layer network as being like things that occur at different times. It’s just a composition of functions. So we just use the chain rule to calculate all the derivatives at once. They are just a set of parameters that happen to appear in different parts of the function, but the calculus is no different. So calculate this with with respect to w1 and w2, you can just now call it w and say w1 is all of those weights.

So what you are going to have then is a list of parameters [5:26]. Here is w1, it’s probably some kind of higher rank tensor. If it’s a convolutional layer, it will be rank 3 tensor, but we can flatten it out. We’ll just make it a list of parameters. Here is w2. It’s just another list of parameters. Here is our loss which is a single number. Therefore, our derivative is just a vector of that same length. How much does changing that value of w affect the loss? So you can think of it as a function like y = ax1 + bx2 + c and say oh what’s the derivative of that with respect to a, b and c? And you would have three numbers: the derivative with respect to a, b, and c. That’s all this is. If the derivative with respect to that weight, that weight, …

To get there, inside the chain rule, we had to calculate like Jacobian so the derivative, when you take a matrix product, is you’ve now got something where you’ve got a weight matrix and input vector which are the activations from the previous layer, and you’ve got some new output activations. So now you have to say for this particular weight, how does changing this particular weight change this particular output? How does changing this particular weight change this particular output? And so forth. So you end up with these higher dimensional tensors showing for every weight, how it affects every output. Then by the time you get to the loss function, the loss function is going to have a mean or sum, so they are going to get added up in the end.

It drives me a bit crazy to try and calculate it out by hand or even think of it step by step, because you tend to have like … you just have to remember, for every weight for every output, you’re going to have to have a separate gradient.

One good way to look at this is to learn to use PyTorch’s .grad attribute and .backward method manually and look up PyTorch tutorials. So you can actually start setting up some calculations with a vector input and vector output, and then type .backward and then type grad and look at it. Then to some really small ones with just 2 or 3 items in the input and output vectors and make the operation like plus 2 or something and see what the shapes are and make sure it makes sense. Because vector matrix calculus introduces zero new concepts to anything you learned in high school, strictly speaking. But getting a feel for how these shapes move around took a lot of practice. The good news is, you almost never have to worry about it.

Review of Naive Bayes & Logistic Regression for NLP [9:53]

Notebook / Excel

We were talking about using this kind of logistic regression for NLP. And before we got to that point, we were talking about using Naive Bayes for NLP. And the basic idea was that we could take a document (e.g. a movie review), and turn it into a bag of words representation consisting of the number of times each word appears. We call the unique list of words vocabulary. And we used the sklearn CountVectorizer to automatically generate both the vocabulary which in sklearn they call the “features” and create the bag of words representations and the whole group of them is called a term document matrix.

We kind of realized that we could calculate the probability that a positive review contains the word “this” by just averaging the number of time this appears in the positive reviews, we could do the same for the negatives, then we could take the ratio of them to get something which if it’s greater than one was a word appeared more often in the positive review, or less than one was a word that appeared more often in the negative reviews.

Then we realized using Bayes rules and taking the logs, that we could basically end up with something where we could add up the logs of these (highlighted below) plus the log of the ratio of the probabilities that things are in class 1 versus class 0, and end up with something we can compare to zero [11:32]. If it’s greater than zero then we can predict a document is positive or if it’s less than zero, we can predict the document is negative. And that was our Bayes rule.

We kind of did that from math first principles and I think we agreed that the “naive” in Naive Bayes was a good description because it assumes independence when it’s definitely not true. But it’s an interesting starting point and I think it was interesting to observe when we actually got to the point where like okay, now we’ve calculated the ratio of the probabilities and took the log, and now rather than multiply them together, of course, we have to add them up. And when we actually wrote that down, we realized oh that is just a standard weight matrix product plus a bias:

Then we realized okay, so if this is not very good accuracy (80%), why not improve it by saying hey, we know other ways to calculate a bunch of coefficients and a bunch of biases which is to learn them in a logistic regression. In other words, this is the formula we use for a logistic regression and so why don’t we just create a logistic regression and fit it? It’s going to give us the same thing, but rather than coefficients and biases which are theoretically correct based on this assumption of independence and based on Bayes rule, they’ll be the coefficients and biases that are actually the best in this data. So that was where we got to.

The key insight here is just about everything, a machine learning ends up being either a tree or a bunch of matrix products and nonlinearities [13:54]. Everything seems to end up coming down to the same thing including, as it turns out, Bayes rule. Then it turns out whatever the parameters are in that function turns out that they are better learnt than calculated based on the theory. And indeed that’s what happened when we actually tried learning those coefficients, we got 85%.

Then we noticed that we could also, rather than take the whole term document matrix, we could instead just take the ones and zeros for presence or absence of a word. And sometimes it was equally as good but then we actually tried something else which is we tried adding regularization. With regularization, the binarized approach turned out to be a little better.

So then regularization was where we took the loss function, and again, let’s start with RMSE and then we’ll talk about cross entropy. Loss function was our predictions minus our actuals, sum that up, take the average plus a penalty.

This specifically is the L2 penalty. If this, instead, was the absolute value of w, then that would be the L1 penalty. We also noted that we don’t really care about the loss function per se, we only case about its derivatives that’s actually the thing that updates the weights, so because this is a sum, we can take the derivative of each part separately and so the derivative of the penalty was just 2aw. So we learnt that even though these are mathematically equivalent, they have different names. This version (2aw) is called weight decay and that term is used in the neural net literature.

Cross entropy [16:34]

Excel

Cross entropy on the other hand, it’s just another loss function like root mean squared error, but it’s specifically designed for classification. Here is an example of a binary cross entropy. Let’s say this is our “is it a cat or a dog?” So is to say isCat 1 or 0. And Preds are our predictions so this is the output of our final layer of our neural net, a logistic regression, etc.

Then all we do is to say okay let’s take the actual times the log of the prediction, then we add to that 1 minus actual times the log of 1 minus the prediction, then take the negative of that whole thing.

I suggested to you all that you try to write the if statement version of this, so hopefully you’ve done that by now, otherwise I’m about to spoil it for you. So this was:

How do we write this as an if statement?

if y == 1: return -log(ŷ)
else: return -log(1-ŷ)

So the key insight is that y has two possibilities: 1 or 0. So very often the math can hide the key insight which I think happens here until you actually think about what the values it can take. So that is all it’s saying. Either give me: -log(ŷ) or -log(1-ŷ)

Okay, so then the multi category version is just the same thing but you’re saying if for more than just y == 1 but y == 0, 1, 2, 3, 4, 5 . . . , for instance [19:26]. So that loss function has a particularly simple derivative and also another thing you could play with at home if you’d like is thinking about how the derivative looks when you add a sigmoid or softmax before it. It turns out you’ll end up with very well behaved derivatives.

There’s lots of reasons that people use RMSE for regression and cross entropy for classification, but most of it comes back to the statistical idea of a best linear unbiased estimator and based on the likelihood function that turns out that these have some nice statistical properties. It turns out, however, in practice root mean square error in particular the properties are perhaps more theoretical than actual and actually nowadays using the absolute deviation rather than sum of the squared deviation can often work better. So in practice, everything in machine learning, I normally try both. For particular dataset, I’ll try both loss functions and see which one works better. And of course if it’s a Kaggle competition in which case you’re told how Kaggle is going to judge it and you should use the same loss function as Kaggle’s evaluation metric.

So this is really the key insight [21:16]. Let’s not use theory but instead learn things from the data. And we hope that we’re going to get better results. Particularly with regularization, we do. Then I think the key regularization insight here is let’s not try to reduce the number of parameters in our model, but instead use lots of parameters and then use regularization to figure out which ones are actually useful.

More features with n-grams [21:41]

Notebook

So then we took that step further by saying given we can do that with regularization, let’s create lots more by adding bigrams and trigrams. Bigrams such as by vast, by vengeance and trigrams such as by vengeance . and by vera miles. To keep things run a little bit faster, we limited it to 800,000 features but even with the full 70 million features, it works just as well and it’s not a heck of a lot slower.

veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, 
                         max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)trn_term_doc.shape(25000, 800000)vocab = veczr.get_feature_names()vocab[200000:200005]['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']

So we created a term document matrix using the full set of n-grams for the training set and the validation set. So now we can go ahead and say our labels are as the training set labels as before, our independent variables is binarized term document matrix as before:

y=trn_y
x=trn_term_doc.sign()
val_x = val_term_doc.sign()
p = x[y==1].sum(0)+1
q = x[y==0].sum(0)+1
r = np.log((p/p.sum())/(q/q.sum()))
b = np.log(len(p)/len(q))

And Then let’s fit a logistic regression to that, and do some predictions, and we get 90% accuracy:

m = LogisticRegression(C=0.1, dual=True)
m.fit(x, y);

preds = m.predict(val_x)
(preds.T==val_y).mean()0.90500000000000003

So this is looking pretty good.

Back to Naive Bayes [22:54]

Let’s go back to our Naive Bayes. In our Naive Bayes, we have this term document matrix and then for every feature, we are calculating the probability of that feature occurring if it’s class 1, that probability of that feature occurring if it’s class 0, and the ratio of those two.

And in the paper that we are actually basing this off, they call p(f|1) p, and they call p(f|0) q and the ratio r.

So then we say let’s not use these ratios as the coefficients in that matrix multiply. But let’s instead, try and learn some coefficients. So maybe start out with some random numbers, and then try and use stochastic gradient descent to find slightly better ones.

So you’ll notice some important features here. The r vector is a vector of rank 1 and its length is equal to the number of features. And of course, our logistic regression coefficient matrix is also rank 1 and length equal to the number of features. And we are saying they are two ways of calculating the same kind of thing: one based on theory, one based on data. So here is some of the numbers in r:

r.shape, r((1, 800000),
 matrix([[-0.05468, -0.161  , -0.24784, ...,  1.09861, -0.69315, -0.69315]]))

Remember it’s using the log so these number which are less than zero represent things which are more likely to be negative and the one greater than zero is likely to be positive. So here is e to the power of that (e^). So there are the ones we can compare to one rather than zero:

np.exp(r)matrix([[ 0.94678,  0.85129,  0.78049, ...,  3.  ,  0.5 ,  0.5  ]])

I’m going to do something that hopefully is going to seem weird [25:13]. First of all, I’m going to say what we are going to do and then I’m going to try and describe why it’s weird, and then we’ll talk about why it may not be as weird as we first thought. So here is what we are going to do. We are going to take our term document matrix and we’re going to multiply it by r. So what that means is, I can do it here in Excel, we are going to say let’s grab everything in our term document matrix and multiply it by the equivalent value in the vector of r. So this is like a broadcasted element-wise multiplication, not a matrix multiplication.

So here is the value of the term document matrix times r, in other words, everywhere a zero appears in the term document matrix, a zero appears in the multiplied version. And every time a one appears in the term document matrix, the equivalent value of r appears on the bottom. So we haven’t really changed much. We’ve just kind of changed the ones into something else i.e. r’s from that feature. So what we are now going to do is we’re going to use this our independent variables, instead, in our logistic regression.

So here we are [26:56]. x_nb (x Naive Bayes version) is x times r. And now let’s do a logistic regression, fitting using those independent variables. Let’s then do that for the validation set, get the predictions, and lo and behold, we have a better number:

x_nb = x.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y);

val_x_nb = val_x.multiply(r)
preds = m.predict(val_x_nb)
(preds.T==val_y).mean()0.91768000000000005

Let me explain why this hopefully seem surprising. So that’s our independent variable (highlighted below) and then the logistic regression has come up with some set of coefficients (let’s pretend for a moment that these are the coefficients that it happened to come up with).

We could now say, well, let’s not use this set (x_nb) of independent variables but let’s use the original binarized feature matrix. And then divide all of our coefficients by the values in r and we’re going to get exactly the same result mathematically.

So we’ve got our x Naive Bayes version (x_nb) of the independent variables and we’ve got some set of weights/coefficients (w1) where it’s found this is a good set of coefficients for making our predictions from. But x_nb is simply equal to x times (element-wise) r.

So in other words, this (xnb·w1) is equal to x*r·w1 . So we could just change the weights to be r·w1 and get the same number. So this ought to mean that the change that we made to the independent variable should not have made any difference because we can calculate exactly the same thing without making that change. So there’s the question. Why did it make a difference? So in order to answer this question, you need to think about what are the thing that aren’t mathematically the same. Why is it not identical? Come up with some hypotheses what are some reasons that maybe we’ve actually ended up with a better answer. And to figure that out, we need to first of all start with why is it even a different answer? This is subtle.

Discussions [30:33~32:46]

They are getting impacted differently by regularization. Our loss was equal to our cross entropy loss based on the predictions and actuals plus our penalty:

So if your weights are large, then the penalty(aw²) gets bigger, and it drowns out the cross entropy piece (x.e.(xw, y)). But that’s actually the piece we care about. We actually want it to be a good fit. So we want to have as little regularization going on as we can get away with. So we want lesser weights (I kind of use the two words, “less” and “lesser”, a little equivalently which is not quite fair, I agree, but the idea is that weights that are pretty close to zero are kind of not there).

Here is the thing [34:38]. Our values of r, and I’m not a Bayesian weenie but I’m still going to use the word “prior”. They are kind of like a prior — we think that the different levels of importance and positive or negative of these different features might be something like that. We think that “bad” might be more correlated with negative than “good”. So our kind of implicit assumption before was that we have no priors, so in other words, when we said squared weights (w²), we are saying a non-zero weight is something we don’t want to have. But actually what I really want to say is that differing from the Naive Bayes expectation is something I don’t want to do. Only vary from the Naive Bayes prior unless you have good reason to believe otherwise.

So that’s actually what this ends up doing. We end up saying we think this value is probably 3. So if you’re going to make it a lot bigger or a lot smaller, that’s going to create the kind of variation in weights that’s going to cause that squared term to go up . So if you can, just leave all these values about similar to where they are now. So that’s what the penalty term is now doing. The penalty term when our input is already multiplied by r, it’s saying penalize things where we’re varying from our Naive Bayes prior.

Question: Why multiply only with r and not like r² or something like that when the variance would be much higher this time [36:40]? Because our prior comes from an actual theoretical model. So I said I don’t like to rely oh theory but if I have some theory, then maybe we should use that as our starting point rather than starting off by assuming everything is equal. So our prior said hey, we’ve got this model called Naive Bayes and the Naive Bayes model said if the Naive Bayes’ assumptions were correct, then r is the correct coefficient in this specific formulation. That’s why we picked that because our prior is based on that theory.

So this is a really interesting insight which I never really see covered [37:34]. The idea that we can use these traditional machine learning techniques, we can imbue them with this kind of Bayesian sense by starting out incorporating our theoretical expectations into the data that we give our model. And when we do so, that then means we don’t have to regularize as much. And that’s good because we regularized a lot… let’s just try it!

Remember, the way they do it in the sklearn logistic regression is C is the reciprocal of the amount of regularization penalty. So we will add lots of regularization by making it small (1e-5).

So that really hurts our accuracy because now it’s trying really hard to get those weights down, the loss function is overwhelmed by the need to reduce the weights. And the need to make it predictive now seems totally unimportant. So by starting out and saying don’t push the weights down so that you end up ignoring the terms, but instead push them down so that you try to get rid of ones that ignore differences from our expectation based on the Naive Bayes formulation. So that ends up giving us a very nice result

Baselines and Bigrams: Simple, Good Sentiment and Topic Classification [39:44]

Paper

This technique was originally presented in 2012. Chris Manning is a terrific NLP researcher at Stanford and Sida Wang who I don’t know but I assume is awesome because his paper is awesome. They basically came up with this idea. What they did was they compared it to a number of other approaches on a number of other datasets. So one of the things they tried is the IMDB dataset. So here is Naive Bayes SVM on bigrams:

As you can see, this approach outperformed the other linear based approaches that they looked at and also some restricted Boltzmann machine kind of neural net based approaches they looked at. Nowadays, there are better ways to do this and in fact in the deep learning course, we showed new state-of-the-art result we just developed at Fast AI that gets well over 94%. But still particularly for a linear technique that’s easy, fast, and intuitive, this is pretty good. And you’ll notice, when they did this, they only used bigrams. And I assume that’s because I looked at their code and it was kind of pretty slow and ugly. I figured out a way to optimize it a lot more as you saw and so we were able to use trigrams so we get quite a lot better and we’ve got 91.8% versus 91.2% but other than that, it’s identical. Oh, also they used support vector machine which is almost identical to a logistic regression in this case, so there’re some minor differences. So I think that’s a pretty cool result and

I will mention, what you get to see here in class is the result of many weeks and often many months of research that I do [41:32]. So I don’t want you to think this stuff is obvious. It’s not at all. Like reading this paper, there’s no description in the paper of why they use this model, how it’s different, why they thought it works. It took me a week or two to even realize that it’s mathematically equivalent to a normal logistic regressions and then a few more weeks to realize that the difference is actually in the regularization. This is kind of like machine learning as I’m sure you’ve noticed from the Kaggle competition you enter. Like you come up with a thousand good ideas, 999 of them no matter how confident you are they are going to be great, they always turn out to be crap. Then finally after four weeks, one of them finally works and kind of gives you the enthusiasm to spend another four weeks of misery and frustration. This is the norm. And for sure that the best practitioners I know in machine learning all share one particular trait in common which is that they are very very tenacious — also known as stubborn and bloody-minded which is definitely a reputation I seem to have, probably fair, along with another thing which is that they are all very good coders. They are very good at turning their ideas into new code. So this was a really interesting experience for me working through this a few months ago to try and figure out at least how to explain why this, at the time, state-of-the-art-result exists.

Even better version: NBSVM++ [43:31]

So once I figured that out, I was actually able to build on top of it and make it quite a bit better, and I’ll show you what I did. And this is where it was very very handy to have PyTorch at my disposal because I was able to create something that was customized just the way I wanted it to be and also very fast by using the GPU. So here is the kind of Fast AI version of NBSVM. Actually my friend Stephen Merity who is a terrific researcher in NLP has christened this the NBSVM++ which I thought was lovely, so here’s that, even though there is no SVM, it’s a logistic regression but as I said, nearly exactly the same thing.

So let me first of all show you the code. Once I figured out this is the best way I can come up with to do a linear bag of words model, I embedded it into Fast AI so you can just write a couple lines of code.

sl=2000# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc,
                                 val_y, sl)

So the code is basically, hey, I want to create a data class for text classification, I want to create it form a bag of words(from_bow). Here is my bag of words (trn_term_doc) and here are my labels (trn_y), here are the same thing for the validation set and use up to 2000 unique words per review, which is plenty.

So then from that model data, construct a learner which is kind of the Fast AI generalization of a model which is based on a dot product of Naive Bayes and then fit that model.

learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)[ 0.       0.0251   0.12003  0.91552]learner.fit(0.02, 2, wds=1e-6, cycle_len=1)[ 0.       0.02014  0.11387  0.92012]                         
[ 1.       0.01275  0.11149  0.92124]learner.fit(0.02, 2, wds=1e-6, cycle_len=1)[ 0.       0.01681  0.11089  0.92129]                           
[ 1.       0.00949  0.10951  0.92223]

And after 5 epochs, I was already up to 92.2. So this is now getting quite well above the linear baseline (in the original paper). So let me show you the code for that.

So the code is horrifying short. This is it. And this will also look, on the whole, extremely familiar. There are a few tweaks here, pretend this thing that says Embedding pretend that it actually says Linear. I’m going to show you embedding in a moment. So we’ve got basically a linear layer where the number of features as the rows and remember, sklearn features means number of words basically. Then for each word, we’re going to create one weight which makes sense — a logistic regression, each word has one weight. And then we are going to be multiplying it by the r value, so each word, we have one r value per class. So I actually made this so this can handle not just positive versus negative but maybe figuring out which author created this work — there could be five or six authors, for example.

And basically we use those linear layers to get the value of the weight and the value of the r, then we take the weight times the r and then sum it up. So that’s just a simple dot product just as we would do for any logistic regression and then do the softmax. The very minor tweak we added to get the better result is this +self.w_adj:

The thing I’m adding is, it’s a parameter, but I pretty much always use this default value 0.4. So what does this do? What this is doing is it’s again changing the prior. If you think about it, even once we used this r times the term document matrix as their independent variables, you really want to start with a question, okay, the penalty terms are still pushing w down to zero.

So what does it mean for w to be zero? What would it mean if we had coefficient all 0's?

When we multiply this matrix with these coefficients, we still get zero. So a weight of zero still ends up saying “I have no opinion on whether this thing is positive or negative.” On the other hand, if they were all 1's, then it basically says my opinion is that the Naive Bayes coefficients are exactly right. So the idea is that I said zero is almost certainly not the right prior. We shouldn’t really be saying if there’s no coefficient, it means ignore the Naive Bayes coefficient. 1 is probably too high because we actually think that Naive Bayes is only part of the answer. So I played around with a few different datasets where I basically said take the weights and add to them some constant. So zero would become, in this case, 0.4. In other words, the regularization penalty is pushing the weights not towards zero but towards this value. And I found that across a number of datasets, 0.4 works pretty well and pretty resilient. Again, the basic idea is to get the best of both worlds where we are learning from the data using a simple model, but we are incorporating our prior knowledge as best as we can. So it turns out when you say let’s tell it that weight matrix of zeros actually means you should use about a half of the r values, that ends up working better than prior that the weights should all be zero.

Question: Is w the point for denoting the amount of regularization required [50:31]? w are the weights. So x = ((w+self.w_adj)*r/self.r_adj).sum(1) is calculating our activations. We calculate our activations as being equal to the weights times the r, them sum. So that’s just our normal linear function. The thing which is being penalized is my weight matrix. That’s what gets penalized. So by saying, hey you know what, don’t just use w — use w+0.4. 0.4 (i.e. self.w_adj) is not being penalized. It’s not part of the weight matrix. So effectively, the weight matrix gets 0.4 for free.

Question: By doing this, even after regularization, every feature is getting some form of minimum weight [51:50]? Not necessarily because it could end up choosing a coefficient of -0.4 for a feature and that would say “you know what, even though Naive Bayes says it’s the r should be whatever for this feature. I think you should totally ignore it”.

A couple of questions during the break [52:46]. The first was a bit of a summary as to what’s going on here:

Here we have w plus weight adjustment times r:

So normally, what we are doing is saying logistic regression is basically wx (I’m going to ignore the bias). Then we are changing it to be rx·w. Then we were saying that let’s do x·w bit first. This thing here, I actually call w which is probably pretty bad, it’s actually w times x :

So instead of r(x·w), I’ve got w·x plus a constant times r. So the key idea here is that regularization wants the weights to be zero, because it’s trying to reduce Σw². So what we are saying is, ok we want to push the weights towards zero because that’s our default starting point expectation. So we want to be in a situation where if the weights are zero then we have a model that makes theoretical or intuitive sense to us. This model (r(x·w)), if the weights are zero, doesn’t make intuitive sense to us. Because it’s saying hey, multiply everything by zero gets rid of everything. We are actually saying “no, we actually think our r is useful and we actually want to keep that.” So instead, let’s take (x·w) and add 0.4 to it. So now, if the regularizer is pushing the weights towards zero, then it’s pushing the value of the sum to 0.4.

Therefore, it’s pushing a whole model to 0.4 times r. So in other words, our kind of default starting point if you’ve regularize to all the weights out all together is to say “yeah, you know, let’s use a bit of r. That’s probably a good idea.” So that’s the idea. The idea is basically what happens when that weight is zero. And you want that to be something sensible because otherwise regularizing the weights to move in that direction wouldn’t be such a good idea.

The second question was about n-grams [56:55]. So the N in n-gram can be uni, bi, tri, whatever. 1, 2, 3, whatever grams. So “This movie is good” has four unigrams: This, movie, is, good. It has three bigrams: This movie, movie is, is good. It has two trigrams: This movie is, movie is good.

Question: Do you mind going back to w_adj or 0.4 stuff? I was wondering if this adjustment will harm the predictability of the model because think of extreme case if it’s not 0.4, if it’s 4,000 and all coefficients will be essentially… [57:45]? Exactly. So our prior needs to make sense. This is why it’s called DotProdNB, so the prior is that this is something where we think Naive Bayes is a good prior. So Naive Bayes says that r = p/q is a good prior and not only do we think it’s a good prior but we think rx+b is a good model. That’s the Naive Bayes model. So in other words, we expect that a coefficient of 1 is a good coefficient, not 4,000. Specifically, we think zero is probably not a good coefficient. But we also think that maybe the Naive Bayes version is a little over confident. So maybe 1 is a little high. So we are pretty sure that the right number, assuming that Naive Bayes model is appropriate, is between 0 and 1.

Question continued: But what I was thinking is as long as it’s not zero, you are pushing those coefficients that are supposed to be zero to something not zero and make the high coefficients less distinctive from zero coefficients [59:24]? Well, but you see, they are not supposed to be zero. They are supposed to be r. And remember, this is inside our forward function, so this is part of what we are taking the gradient of. So it’s basically saying, okay, you can still set self.w to anything you like. But just the regularizer wants it to be zero. So all we are saying is okay if you want it to be zero, then I’ll try to make zero give a sensible answer.

Nothing says 0.4 is perfect for every dataset. I’ve tried a few different datasets and found various numbers between 0.3 and 0.6 that are optimal. But I’ve never found one where 0.4 is less good than zero which is not surprising. And I’ve also never found where one is better. So the idea is this is a reasonable default but it’s another parameter you can play with which I kind of like. It’s another thing you could use a grid search or whatever to figure out for your dataset what’s best. Really, the key here being every model before this one, as far as I know, has implicitly assumed it should be zero because they don’t have this parameter. And by the way, I’ve actually got a second parameter here (r_adj=10) as well which is the same thing I do to r is actually divide by a parameter which I’m not going to worry too much about it now but it’s another parameter you can use to adjust what the nature of the regularization is. In the end, I’m a empiricist, not a theoretician. I thought this seemed like a good idea. Nearly all of my things that seem like a good idea turn out to be stupid. This particular one gave good result on this dataset and a few other ones as well.

Question: I am still confused about w + w_adj. You mentioned we do w + w_adj so that the coefficients don’t get set to zero that we place some importance on the priors. But you also said that the effect of learning can be that w get set to a negative value which could make w + w_adj to be zero. So if we are allowing the learning process to indeed set the coefficients to zero, why is that different from just having w [1:01:47]? Because of regularization. Because we are penalizing it by Σw². So in other words, we are saying, you know what, if the best thing to do is to ignore the value of r, that’ll cost you (Σw²). You are going to have to set w to a negative number. So only do that if that’s clearly a good idea. Unless it’s clearly a good idea, then you should leave it where it is. That’s the only reason. Like all of this stuff we’ve done today is basically entirely about maximizing the advantage we get from regularization and saying regularization pushes us towards some default assumption, and nearly all of the machine learning literature assumes that default assumption is everything is zero. And I am saying it turns out, it makes sense theoretically and turns out empirically that actually you should decide what your default assumption is and that’ll give you better results.

Question continued: So would it be right to say that in a way you are putting an additional hurdle along the way towards getting all coefficients to zero, and it will be able to do that if it is really worth it [1:03:30]? Yes, exactly. So I’d say the default hurdle without this is making a coefficient non-zero is the hurdle. And now I’m saying, no, the hurdle is making the coefficient not be equal to 0.4r.

Question: So this is sum of w² times some constant. If the constant was, say 0.1, then the weight might not go towards zero. Then we might not need weight decay [1:04:03]? If the value of the constant is zero, then there is no regularization. But if this value is higher than zero, then there is some penalty. And presumably, we’ve set it to nonzero because we were overfitting. So we want some penalty. So if there is some penalty, then my assertion is that we should penalize things that are different to our prior, not that we should penalize things that are different to zero. And our prior is that things should be around about equal to r.

Embedding [1:05:17]

I want to talk about Embedding. I said pretend it’s linear and indeed we can pretend it’s linear. Let me show you how much we can pretend it’s linear as in nn.Linear, create a linear layer.

Here is our data matrix, here are our coefficients r if we are doing the r version. So if we were to put r into a column vector, then we could do a matrix multiply of the data matrix by the coefficients.

So the matrix multiply of this independent variable matrix by this coefficient matrix is going to give us an answer. So the question is, okay, why didn’t Jeremy write nn.Linear? Why did Jeremy write nn.Embedding? The reason is, if you recall, we don’t actually store it like this. Because this actually of width 800,000 and of height 25,000. So rather than storing it like this, we actually store it as this:

The way we store is, this bag of words contain which word indexes. This is a sparse way of storing it. It just lists out the indexes in each sentence. So given that, I want to now do that matrix multiply that I just showed you to create that same outcome. But I want to do it from the sparse representation. This is basically one hot encoded:

It’s kind of like a dummy matrix version. Does it have a word “this”? Does it have a word “movie”? And so forth. So if we took the simple version of does it have a word “this” (i.e. 1, 0, 0, 0, 0, 0) and we multiplied that by r, then it’s just going to return the first item:

So in general, a one hot encoded vector times a matrix is identical to looking up that matrix to find the n-th row in it. So this is just saying find the 0th, first, second, and fifth coefficients:

They are exactly the same thing. In this case, I only have one coefficient per feature, but actually the way I did this was to have one coefficient per feature for each class. So in this case, classes are positive and negative. So I actually had r positive (p/q), r negative (q/r):

In the binary case, obviously it’s redundant to have both. But what if it was like what’s the author of this text? Is it Jeremy, Savannah, or Terrence? Now we’ve got three categories, we want three values of r. So the nice thing is doing this sparse version, you can just look up the 0th, first, second and fifth.

Again, it’s mathematically identical to multiplying by a one hot encoded matrix. But when you have sparse inputs, it’s obviously much much more efficient. So this computational trick which is mathematically identical to, not conceptually analogous to, multiplying by one hot encoded matrix is called an embedding. I’m sure most of you probably heard about embeddings, like word embeddings: Word2Vec, GloVe, etc. And people love to make them sound like this amazing new complex neural net thing. They are not. Embedding means make a multiplication by a one hot encoded matrix faster by replacing it with a simple array lookup. That’s why I said you can think of this as if it said self.w = nn.Linear(nf+1, 1) :

Because it actually does the same thing. It actually is a matrix with these dimensions. It’s a linear layer, but it’s expecting that the input we are going to give it is not actually one hot encoded matrix but is actually a list of integers — the indexes for each word of each item. So you can see that the forward function in Fast AI automatically gets (for DotProdNB leaner) the feature indexes (feature_idx):

So they come from the sparse matrix automatically. Numpy makes it very easy to just grab those indexes. So in other words, we’ve got here (feat_idx)a list of each word index of the 800,000 that are in this document. So then this here (self.w(feat_idx)) says look up each of those in our embedding matrix which got 800,000 rows and return each thing that you find. So mathematically identical to multiplying by the one hot encoded matrix. That’s all embedding is. And what that means is we can now handle building any kind of model like whatever kind of neural network where we have potentially very high cardinality categorical variables as our inputs. We can then just turn them into a numeric code between zero and the number of levels, and then we can learn a linear layer from that as if we had one hot encoded it without ever actually constructing the one hot encoded version and without ever actually doing that matrix multiply. Instead, we will just store the index version and simply do the array lookup. So the gradients that are flowing back, basically in the one hot encoded version, everything that was zero has no gradient so the gradients flowing back is just going to update the particular row of the embedding matrix that we used. So that’s fundamentally important for NLP just like here, I wanted to create a PyTorch model that would implement this ridiculously simple little equation.

To do it without this trick would have meant I was feeding in a 25,000 by 80,000 element array which would have been kind of crazy. So this trick allowed me to write this. I just replaced the word Linear with Embedding, replaced the thing that feeds the one hot encodings in with something that just feeds the indexes in. And that was it. Then it kept working and so this now trains in about a minute per epoch.

What we can now do is we can now take this idea and apply it not just to language but to anything [1:15:30]. For example predicting the sales of items at a grocery store.

Question: We are not actually looking up anything, right? We are just seeing that array with the indices that is the representation [1:15:52]? So we are doing a lookup. The representation that’s being stored for the bag of words is now not 1 1 1 0 0 1 but 0 1 2 5. So then we actually have to do our matrix product. But rather than doing the matrix product, we look up the zero-th thing and the first thing, the second thing, and the fifth thing.

Question continued: So that means we are still retaining the one hot encoded matrix [1:16:31]? No, we didn’t. There is no one hot encoded matrix used here. The one hot encoded matrix is not currently highlighted. We’ve currently highlighted the list of indexes and the list of coefficients from the weight matrix:

So what we are going to do now is we are going to go a step further and saying let’s not use a linear model at all, let’s use a multi layer neural network [1:16:58]. And let’s have the input to that potentially be include some categorical variables. And those categorical variables, we will just have as numeric indexes. So the first layer for those won’t be a normal linear layer, they will be an embedding layer which we know behaves exactly like a linear layer mathematically. So then our hope will be that we can now use this to create a neural network for any kind of data.

Rossmann competition [1:17:40]

Notebook

There was a competition on Kaggle a few years ago called Rossmann which is a German grocery chain where they asked to predict the sales of items in their stores. And that included the mixture of categorical and continuous variables. In this paper by Guo/Berkhahn, they described their third-place winning entry which was much simpler than the first place winning entry but nearly as good but much much simpler because they took advantage of this idea of what they call entity embeddings. In the paper, they thought they had invented this, actually it had been written before earlier by Yoshua Bengio and his co-authors in another Kaggle competition which was predicting taxi destinations. Although, I will say I feel like Guo went a lot further in describing how this can be used in many other ways, so we’ll talk about that as well.

The notebook is in deep learning repo because we talked about some of the deep learning specific aspects in the deep learning course, where else in this course, we are going to be talking mainly about the feature engineering and we are also going to be talking about this embedding idea.

Let’s start with the data. So the data was, store number 1 on the 31st of July 2015 was open. They had a promotion going on. There was a school holiday. It was not a state holiday, and they sold 5,263 items. So that’s the key data they provided. So the goal is obviously to predict sales in a test set that has the information without sales. They also tell you that for each store, it’s of some particular type, it sells some particular assortment of goods, its nearest competitor is some distance away, the competitor opened in September 2008, and there’s some more information about promos I don’t know the details of what that means. Like in many Kaggle competitions, they let you download external datasets if you wish as long as you share them with other competitors. They also told you what state each store is in, so people downloaded the name of the different states of Germany, they downloaded a file for each state in Germany for each week some kind of Google trend data. I don’t know what specific Google trend they got but there was that. For each date they downloaded a bunch of temperature information. And that’s it.

One interesting insight here is that there was probably a mistake in some ways for Rossmann to design this competition as being one where you could use external data [1:21:05]. Because in reality, you don’t actually get to find out next week’s weather or next week’s Google trends. But when you are competing in Kaggle, you don’t care about that. You just want to win so you use whatever you can get.

Data cleaning [1:21:35]

Let’s talk, first of all, about data cleaning. There wasn’t really much feature engineering done in this third place winning entry, particularly by Kaggle standards where normally every last thing counts. This is a great example of how far you can get with a neural net and it certainly reminds me of claims prediction competition we talked about yesterday where the winner did no feature engineering and entirely relied on deep learning. The laughter in the room, I guess, is from people who did a little bit more than no feature engineering in that competition 😄

I should mention, by the way, I find that bit where you work hard at a competition and then it closes and you didn’t win. And the winner comes out and says this is how I won. That’s the bit where you learn the most. Sometimes that’s happened to me and it’s been like oh, I thought of that, I thought I tried that, and then I go back and I realize I had a bug there, I didn’t test properly, and I learn oh okay, I really need to learn to test this thing in this different way. Sometimes it’s like, oh I thought of that but I assumed it wouldn’t work, I’ve really got to remember to check everything before I make any assumptions. And you know, sometimes it’s just like oh, I did not think of that technique, wow, now I know it’s better than everything I just tried. Because otherwise if somebody says, hey you know here is a really good technique, you’re like okay great. But when you spent months trying to do something and somebody else did it better by using that technique, that’s pretty convincing. So it’s kind of hard I’m standing up in front of you saying here is a bunch of techniques I’ve used and I’ve won some Kaggle competitions and I’ve got some state of the art results. But that’s kind of second-hand information by the time it hits you. So it’s really great to try things out. And also it’s been nice to see particularly I’ve noticed in the deep learning course, quite a few of my students have, I’ve said this technique works really well and they’ve tried it and they’ve got into the top ten of a Kaggle competition the next day and they’re like okay, that counts as working really well. So Kaggle competitions are helpful for lots and lots of reasons. But one of the best ways is what happens after it finishes and so definitely for the ones that are now finishing up, make sure you watch the forums, see what people are sharing in terms of their solutions, and if you want to learn more about them, feel free to ask the winners, hey, would you tell me more about this or that. People are normally good about explaining. Then ideally, try and replicate it yourself. That can turn into a great blog post or great kernel to be able to say okay, such-and-such said that they used this technique, here is a really short explanation of what that technique is, and here is a little bit of code showing how it’s implemented, and here is the results showing you can get the same result. That can be a really interesting write-up as well.

It’s always nice to have your data be as easy to understand as possible [1:24:58]. So in this case the data that came from Kaggle used various integers for the holidays. We can just use a boolean of was it a holiday or not. So just clean that up:

train.StateHoliday = train.StateHoliday!='0'
test.StateHoliday = test.StateHoliday!='0'

We got quite a few different tables we need to join them all together. I have a standard way of joining things together with pandas. I just used the pandas merge function and specifically I always do a left join. Left join is where you retain all the rows in the left table, and you have a key column and you match that with a key column in the right side table and you just merge those that are also present in the right table.

def join_df(left, right, left_on, right_on=None, suffix='_y'):
    if right_on is None: right_on = left_on
    return left.merge(right, how='left', left_on=left_on,
                      right_on=right_on, suffixes=("", suffix))

The key reason that I always do a left join is that after I do the join, I always then check if there were things in the right-hand side that are now null:

store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])

Because if so, it means that I missed some things. I haven’t shown it here, but I also check the number of rows hasn’t varied before and after. If it has, that means that the right hand side table wasn’t unique. So even when I’m sure something is true, I always also assume that I’ve screwed it up. So I always check.

I could go ahead and merge the state names into the weather:

weather = join_df(weather, state_names, "file", "StateName")

If you look at the Google trends table, it’s got this week range which I need to turn into a date in order to join it [1:26:45]:

The nice thing about doing this in Pandas is that Pandas gives us access to all of Python. So for example, inside the series object, there is a .str attribute that gives you access to all the string processing functions. Just like .cat gives you access to the categorical functions, .dt gives you access to the date time functions. So I can now split everything in that column.

googletrend['Date']=googletrend.week.str.split(' - ',expand=True)[0]
googletrend['State']=googletrend.file.str.split('_', expand=True)[2]
googletrend.loc[googletrend.State=='NI', "State"] = 'HB,NI'

And it’s really important to use these Pandas functions because they are going to be vectorized, accelerated, often through SIMD at least through C code so that runs nice and quickly.

And as per usual, let’s add date metadata to our dates [1:27:41]:

add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

In the end, we are basically denormalizing all these tables. We are going to put them all into one table. So in the Google trend table, they were mainly trends by state but there was also trends for the whole of Germany, so we put the whole of Germany ones into a separate data frame so that we can join that:

trend_de = googletrend[googletrend.file == 'Rossmann_DE']

So we are going to have Google trend for this state and Google trend for the whole of Germany.

Now we can go ahead and start joining both for the training set and for the test set [1:28:19]. Then both check that we don’t have null’s.

store = join_df(store, store_states, "Store")
len(store[store.State.isnull()])0joined = join_df(train, store, "Store")
joined_test = join_df(test, store, "Store")
len(joined[joined.StoreType.isnull()]),len(joined_test[joined_test.StoreType.isnull()])(0, 0)joined = join_df(joined, googletrend, ["State","Year", "Week"])
joined_test = join_df(joined_test, googletrend, ["State","Year", "Week"])
len(joined[joined.trend.isnull()]),len(joined_test[joined_test.trend.isnull()])(0, 0)joined = joined.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
joined_test = joined_test.merge(trend_de, 'left', ["Year", "Week"], suffixes=('', '_DE'))
len(joined[joined.trend_DE.isnull()]),len(joined_test[joined_test.trend_DE.isnull()])(0, 0)joined = join_df(joined, weather, ["State","Date"])
joined_test = join_df(joined_test, weather, ["State","Date"])
len(joined[joined.Mean_TemperatureC.isnull()]),len(joined_test[joined_test.Mean_TemperatureC.isnull()])(0, 0)

My merge function, if there are two columns that are the same, I set their suffix on the left to be nothing at all, so it doesn’t screw around with the name, and the right hand side to be _y .

In this case, I didn’t want any of the duplicate ones, so I just went through and deleted them:

for df in (joined, joined_test):
    for c in df.columns:
        if c.endswith('_y'):
            if c in df.columns: df.drop(c, inplace=True, axis=1)for df in (joined,joined_test):
  df['CompetitionOpenSinceYear'] = 
          df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
  df['CompetitionOpenSinceMonth'] = 
          df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
  df['Promo2SinceYear'] = 
          df.Promo2SinceYear.fillna(1900).astype(np.int32)
  df['Promo2SinceWeek'] = 
          df.Promo2SinceWeek.fillna(1).astype(np.int32)

The main competitor for this store has been open since some date [1:28:54]. So we can just use Pandas to_datetime, I’m passing in the year, the month, and the day. So that’s going to give us an error unless they all have years and months, so we are going to fill in the missing ones with the 1900 and 1 (see above). And what we really want to know is how long this store has been open for at the time fo this particular record, so we can just do a date subtract:

for df in (joined,joined_test):
  df["CompetitionOpenSince"] = 
          pd.to_datetime(dict(year=df.CompetitionOpenSinceYear,
                        month=df.CompetitionOpenSinceMonth, day=15))
  df["CompetitionDaysOpen"] = 
          df.Date.subtract(df.CompetitionOpenSince).dt.days

Now if you think about it, sometimes the competition opened later than this particular row, so sometimes it’s going to be negative. And it doesn’t probably make sense to have negatives (i.e. it’s going to open in x days time). Now having said that, I would never put in something like this without first of all running a model with it in and without it in. Because our assumption about the data very often turned out not to be true. In this case, I didn’t invent any of these pre-processing steps. I wrote all the code but it’s all based on the third place winner’s GitHub repo. So knowing what it takes to get third place in the Kaggle competition, I’m pretty sure they would have checked every one of these pre-processing steps and made sure it actually improved their validation set score.

for df in (joined,joined_test):
    df.loc[df.CompetitionDaysOpen<0, "CompetitionDaysOpen"] = 0
    df.loc[df.CompetitionOpenSinceYear<1990,"CompetitionDaysOpen"]=0

[1:30:44]

So what we are going to be doing is creating a neural network where some of the inputs to it are continuous and some of them are categorical. So what that means in the neural net that we have, we are basically going to have this kind of initial weight matrix. And we are going to have this input feature vector. Some of the inputs are just going to be plain continuous numbers (e.g. the maximum temperature, the number of kilometer to the nearest store) and some of them are going to be one hot encoded, effectively. But we are not actually going to store it as one hot encoded. We are actually going to store it as the index.

So the neural net model is going to need to know which of these columns should you basically create an embedding for (i.e. which ones should you treat as if they were one hot encoded) and which ones should you just feed directly into the linear layer. We are going to tell the model when we get there which is which, but we actually need to think ahead of time about which ones do we want to treat as categorical and which ones are continuous. In particular, things that we are going to treat it as categorical, we don’t want to create more categories than we need. So let me show you what I mean.

The third place getter in this competition decided that the number of months that the competition was open was something they were going to use as a categorical variable. So in order to avoid having more categories than they need, they truncated it at 24 months. They said that anything more than 24 months old, truncate to 24. So here are the unique values of competition months open and it’s all the numbers from naught to 24. So what that means is that there’s going to be an embedding matrix that’s going to have basically an embedding vector for things that aren’t open yet (0), for things that are open for a month (1), and so forth.

for df in (joined,joined_test):
    df["CompetitionMonthsOpen"] = df["CompetitionDaysOpen"]//30
    df.loc[df.CompetitionMonthsOpen>24,"CompetitionMonthsOpen"] = 24
joined.CompetitionMonthsOpen.unique()array([24,  3, 19,  9,  0, 16, 17,  7, 15, 22, 11, 13,  2, 23, 12,  4, 10,  1, 14, 20,  8, 18,  6, 21,  5])

Now, they absolutely could have done that as a continuous variable [1:33:14]. They could have just had a number here which is just a single number of how many months has it been open and they could have treated it as continuous and fed it straight into the initial weight matrix. What I found though, and obviously what these competitors found is where possible, it’s best to treat things as categorical variables. The reason for that is that when you feed some things through an embedding matrix, it means every level can be treated like totally differently. So for example, in this case, whether something has been open for zero months or one months is really different. So if you fed that in as a continuous variable, it would be difficult for the neural net to try and find a functional form that has that big difference. It’s possible because neural net can do anything. But if you are not making it easy for it. Where else, if you used an embedding, treated it as categorical, then it will have a totally different vector for zero versus one. So it seems like, particularly as long as you’ve got enough data, treating columns as categorical variable where possible is a better idea. When I say where possible, that basically means where the cardinality is not too high. So if this was like the sales ID number that was uniquely different on every row, you can’t treat that as a categorical variable. Because it would be a huge embedding matrix and everything only appears once, or ditto for kilometers away from the nearest store to two decimal places, you wouldn’t make that a categorical variable.

So that’s kind of the rule of thumb that they both used in this competition. In fact, if we scroll down to their choices, here is how they did it:

Their continuous variable were things that were genuinely continuous, like number of kilometers away to the competitor, the temperature stuff, specific number in Google trend, etc. Where else, everything else, basically, they treated as categorical.

That’s it for today. So next time, we’ll finish this off. We’ll see how to turn this into a neural network and kind of wrap things up. See you then!