Deep Learning 2: Part 1 Lesson 5

My personal notes from course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1234567891011121314

Lesson 5

I. Introduction

There is not enough publications on structured deep learning, but it is definitely happening in industries:

You can download images from Google by using this tool and solve your own problems:

Introduction on how to train Neural Net (a great technical writing):

Students are competing with Jeremy in Kaggle seedling classification competition.

II. Collaborative Filtering — using MovieLens dataset

The notebook discussed can be found here(lesson5-movielens.ipynb).

Let’s take a look at the data. We will use userId (categorical), movieId(categorical) and rating (dependent) for modeling.

ratings = pd.read_csv(path+'ratings.csv')

Create subset for Excel

We create a crosstab of the most popular movies and most movie-addicted users which we will copy into Excel for visualization.

top_r = ratings.join(topUsers, rsuffix='_r', how='inner', on='userId')
top_r = top_r.join(topMovies, rsuffix='_r', how='inner', on='movieId')
pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)

This is the excel file with above information. To begin with, we will use matrix factorization/decomposition instead of building a neural net.

  • Blue cells — the actual rating
  • Purple cells — our predictions
  • Red cell — our loss function i.e. Root Mean Squared Error (RMSE)
  • Green cells — movie embeddings (randomly initialized)
  • Orange cells — user embeddings (randomly initialized)

Each prediction is a dot product of movie embedding vector and user embedding vector. In linear algebra term, it is equivalent of matrix product as one is a row and one is a column. If there is no actual rating, we set the prediction to zero (think of this as test data — not training data).

We then use Gradient Descent to minimize our loss. Microsoft excel has a “solver” in the add-ins that would minimize a variable by changing selected cells (GRG Nonlinear is the method you want to use).

This can be called “shallow learning” (as opposed to deep learning) as there is no nonlinear layer or a second linear layer. So what did we just do intuitively? The five numbers for each movie is called “embeddings” (latent factors) — the first number might represent how much it is sci-fi and fantasy, the second might be how much special effect is used for a movie, the third might be how dialog driven it is, etc. Similarly, each user also has 5 numbers representing, for example, how much does the user like sci-fi fantasy, special effects, and dialog-driven in movies. Our prediction is a cross product of these vectors. Since we do not have every movie review for every user, we are trying to figure out which movies are similar this movie and how other users who rated other movies similarly to this user rate this movie (hence the name “collaborative”).

What do we do with a new user or a new movie — do we have to retrain a model? We do not have a time to cover this now, but basically you need to have a new user model or a new movie model that you would use initially and over time you will need to re-train the model.

Simple Python version [26:03]

This should look familiar by now. We create a validation set by picking random set of ID’s. wd is a weight decay for L2 regularization, and n_factors is how big an embedding matrix we want.

val_idxs = get_cv_idxs(len(ratings)) 
wd = 2e-4
n_factors = 50

We create a model data object from CSV file:

cf = CollabFilterDataset.from_csv(path, 'ratings.csv', 'userId', 'movieId', 'rating')

We then get a learner that is suitable for the model data, and fit the model:

learn = cf.get_learner(n_factors, val_idxs, 64, opt_fn=optim.Adam), 2, wds=wd, cycle_len=1, cycle_mult=2)
Output MSE

Since the output is Mean Squared Error, you can take RMSE by:


The output is about 0.88 which outperforms the bench mark of 0.91.

You can get a prediction in a usual way:

preds = learn.predict()

And you can also plot using seaborn sns (built on top of matplotlib):

y =
sns.jointplot(preds, y, kind='hex', stat_func=None)

Dot product with Python

T is a tensor in Torch

a = T([[1., 2], [3, 4]])
b = T([[2., 2], [10, 10]])

When we have a mathematical operator between tensors in numpy or PyTorch, it will do element-wise assuming that they both have the same dimensionality. The below is how you would calculate the dot product of two vectors (e.g. (1, 2)⋅(2, 2) = 6 — the first rows of matrix a and b):

[torch.FloatTensor of size 2]

Building our first custom layer (i.e. PyTorch module) [33:55]

We do this by creating a Python class that extends nn.Module and overrideforward function.

class DotProduct (nn.Module):
def forward(self, u, m): return (u*m).sum(1)

Now we can call it and get the expected result (notice that we do not need to say model.forward(a, b) to call the forward function — it is a PyTorch magic.) [40:14]:

model = DotProduct()
[torch.FloatTensor of size 2]

Building more complex module [41:31]

This implementation has two additions to the DotProduct class:

  • Two nn.Embedding matrices
  • Look up our users and movies in above embedding matrices

It is quite possible that user ID’s are not contiguous which makes it hard to use as an index of embedding matrix. So we will start by creating indexes that starts from zero and contiguous and replace ratings.userId column with the index by using Panda’s apply function with an anonymous function lambda and do the same for ratings.movieId .

u_uniq = ratings.userId.unique() 
user2idx = {o:i for i,o in enumerate(u_uniq)}
ratings.userId = ratings.userId.apply(lambda x: user2idx[x])
m_uniq = ratings.movieId.unique() 
movie2idx = {o:i for i,o in enumerate(m_uniq)}
ratings.movieId = ratings.movieId.apply(lambda x: movie2idx[x])
n_users=int(ratings.userId.nunique()) n_movies=int(ratings.movieId.nunique())

Tip: {o:i for i,o in enumerate(u_uniq)} is a handy line of code to keep in your tool belt!

class EmbeddingDot(nn.Module):
def __init__(self, n_users, n_movies):
self.u = nn.Embedding(n_users, n_factors)
self.m = nn.Embedding(n_movies, n_factors),0.05),0.05)

def forward(self, cats, conts):
users,movies = cats[:,0],cats[:,1]
u,m = self.u(users),self.m(movies)
return (u*m).sum(1)

Note that __init__ is a constructor which is now needed because our class needs to keep track of “states” (how many movies, mow many users, how many factors, etc). We initialized the weights to random numbers between 0 and 0.05 and you can find more information about a standard algorithm for weight initialization, “Kaiming Initialization” here (PyTorch has He initialization utility function but we are trying to do things from scratch here) [46:58].

Embedding is not a tensor but a variable. A variable does the exact same operations as a tensor but it also does automatic differentiation. To pull a tensor out of a variable, call data attribute. All the tensor functions have a variation with trailing underscore (e.g. uniform_) will do things in-place.

x = ratings.drop(['rating', 'timestamp'],axis=1)
y = ratings['rating'].astype(np.float32)
data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], 64)

We are reusing ColumnarModelData (from library) from Rossmann notebook, and that is the reason behind why there are both categorical and continuous variables in def forward(self, cats, conts) function in EmbeddingDot class [50:20]. Since we do not have continuous variable in this case, we will ignore conts and use the first and second columns of cats as users and movies . Note that they are mini-batches of users and movies. It is important not to manually loop through mini-batches because you will not get GPU acceleration, instead, process a whole mini-batch at a time as you see in line 3 and 4 of forward function above [51:00–52:05].

model = EmbeddingDot(n_users, n_movies).cuda()
opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)

optim is what gives us the optimizers in PyTorch. model.parameters() is one of the function inherited from nn.Modules that gives us all the weight to be updated/learned.

fit(model, data, 3, opt, F.mse_loss)

This function is from library [54:40] and is closer to regular PyTorch approach compared to we have been using. It will not give you features like “stochastic gradient descent with restarts” or “differential learning rate” out of box.

Let’s improve our model

Bias — to adjust to generally popular movies or generally enthusiastic users.

min_rating,max_rating = ratings.rating.min(),ratings.rating.max()
def get_emb(ni,nf):
e = nn.Embedding(ni, nf),0.01)
return e
class EmbeddingDotBias(nn.Module):
def __init__(self, n_users, n_movies):
(self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [
(n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1)

def forward(self, cats, conts):
users,movies = cats[:,0],cats[:,1]
um = (self.u(users)* self.m(movies)).sum(1)
res = um + self.ub(users).squeeze() + self.mb(movies).squeeze()
res = F.sigmoid(res) * (max_rating-min_rating) + min_rating
return res

squeeze is PyTorch version of broadcasting [1:04:11] for more information, see Machine Learning class or numpy documentation.

Can we squish the ratings so that it is between 1 and 5? Yes! By putting the prediction through sigmoid function will result in number between 1 and 0. So in our case, we can multiply that by 4 and add 1 — which will result in number between 1 and 5.

F is a PyTorch functional (torch.nn.functional) that contains all functions for tensors, and is imported as F in most cases.

model = EmbeddingDotBias(cf.n_users, cf.n_items).cuda()
opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)
fit(model, data, 3, opt, F.mse_loss)
[ 0. 0.85056 0.83742]
[ 1. 0.79628 0.81775]
[ 2. 0.8012 0.80994]

Let’s take a look at code [1:13:44] we used in our Simple Python version. In file, CollabFilterDataSet.get_leaner calls get_model function that creates EmbeddingDotBias class that is identical to what we created.

Neural Net Version [1:17:21]

We go back to excel sheet to understand the intuition. Notice that we create user_idx to look up Embeddings just like we did in the python code earlier. If we were to one-hot-encode the user_idx and multiply it by user embeddings, we will get the applicable row for the user. If it is just matrix multiplication, why do we need Embeddings? It is for computational performance optimization purposes.

Rather than calculating the dot product of user embedding vector and movie embedding vector to get a prediction, we will concatenate the two and feed it through neural net.

class EmbeddingNet(nn.Module):
def __init__(self, n_users, n_movies, nh=10, p1=0.5, p2=0.5):
(self.u, self.m) = [get_emb(*o) for o in [
(n_users, n_factors), (n_movies, n_factors)]]
self.lin1 = nn.Linear(n_factors*2, nh)
self.lin2 = nn.Linear(nh, 1)
self.drop1 = nn.Dropout(p1)
self.drop2 = nn.Dropout(p2)

def forward(self, cats, conts):
users,movies = cats[:,0],cats[:,1]
x = self.drop1([self.u(users),self.m(movies)], dim=1))
x = self.drop2(F.relu(self.lin1(x)))
return F.sigmoid(self.lin2(x)) * (max_rating-min_rating+1) + min_rating-0.5

Notice that we no longer has bias terms since Linear layer in PyTorch already has a build in bias. nh is a number of activations a linear layer creates (Jeremy calls it “num hidden”).

It only has one hidden layer, so maybe not “deep”, but this is definitely a neural network.

model = EmbeddingNet(n_users, n_movies).cuda()
opt = optim.Adam(model.parameters(), 1e-3, weight_decay=wd)
fit(model, data, 3, opt, F.mse_loss)
A Jupyter Widget
[ 0.       0.88043  0.82363]                                    
[ 1. 0.8941 0.81264]
[ 2. 0.86179 0.80706]

Notice that the loss functions are also in F (here, it s mean squared loss).

Now that we have neural net, there are many things we can try:

  • Add dropouts
  • Use different embedding sizes for user embedding and movie embedding
  • Not only user and movie embeddings, but append movie genre embedding and/or timestamp from the original data.
  • Increase/decrease number of hidden layers and activations
  • Increase/decrease regularization

What is happening in the training loop? [1:33:21]

Currently, we are passing off the updating of weights to PyTorch’s optimizer. What does an optimizer do? and what is a momentum?

opt = optim.SGD(model.parameters(), 1e-1, weight_decay=wd, momentum=0.9)

We are going to implement gradient descent in an excel sheet (graddesc.xlsm) — see worksheets right to left. First we create a random x’s, and y’s that are linearly correlated with the x’s (e.g. y= a*x + b). By using sets of x’s and y’s, we will try to learn a and b.

To calculate the error, we first need a prediction, and square the difference:

To reduce the error, we increase/decrease a and b a little bit and figure out what would make the error decrease. This is called finding the derivative through finite differencing.

Finite differencing gets complicated in high dimensional spaces [1:41:46], and it becomes very memory intensive and takes a long time. So we want to find some way to do this more quickly. It is worthwhile to look up things like Jacobian and Hessian (Deep Learning book: section 4.3.1 page 84).

Chain Rule and Backpropagation

The faster approach is to do this analytically [1:45:27]. For this, we need a chain rule:

Overview of chain rule

Here is a great article by Chris Olah on Backpropagation as a chain rule.

Now we replace the finite-difference with an actual derivative WolframAlpha gave us (notice that finite-difference output is fairly close to the actual derivative and good way to do quick sanity check if you need to calculate your own derivative):

  • “Online” training — mini-batch with size 1

And this is how you do SGD with excel sheet. If you were to change the prediction value with the output from CNN spreadsheet, we can train CNN with SGD.

Momentum [1:53:47]

Come on, take a hint — that’s a good direction. Please keep doing that but more.

With this approach, we will use a linear interpolation between the current mini-batch’s derivative and the step (and direction) we took after the last mini-batch (cell K9):

Compared to de/db whose sign (+/-) is random, the one with momentum will keep going the same direction a little bit faster up till certain point. This will reduce a number of epochs required for training.

Adam [1:59:04]

Adam is much faster but the issue has been that final predictions are not as good as as they are with SGD with momentum. It seems as though that it was due to the combined usage of Adam and weight decay. The new version that fixes this issue is called AdamW.

  • cell J8 : a linear interpolation of derivative and previous direction (identical to what we had in momentum)
  • cell L8 : a linear interpolation of derivative squared + derivative squared from last step ( cell L7)
  • The idea is called “exponentially weighted moving average” (in another words, average with previous values multiplicatively decreased)

Learning rate is much higher than before because we are dividing it by square root of L8 .

If you take a look at library (, you will notice that in fit function, it does not just calculate average loss, but it is calculating the exponentially weighted moving average of loss.

avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)

Another helpful concept is whenever you see `α(…) + (1-α)(…)`, immediately think linear interpolation.

Some intuitions

  • We calculated exponentially weighted moving average of gradient squared, take a square root of that, and divided the learning rate by it.
  • Gradient squared is always positive.
  • When there is high variance in gradients, gradient squared will be large.
  • When the gradients are constant, gradient squared will be small.
  • If gradients are changing a lot, we want to be careful and divide the learning rate by a big number (slow down)
  • If gradients are not changing much, we will take a bigger step by dividing the learning rate with a small number
  • Adaptive learning rate — keep track of the average of the squares of the gradients and use that to adjust the learning rate. So there is just one learning rage, but effectively every parameter at every epoch is getting a bigger jump if the gradient is constant; smaller jump otherwise.
  • There are two momentums — one for gradient, and the other for gradient squared (in PyTorch, it is called a beta which is a tuple of two numbers)


When there are much more parameters than data points, regularizations become important. We had seen dropout previously, and weight decay is another type of regularization. Weight decay (L2 regularization) penalizes large weights by adding squared weights (times weight decay multiplier) to the loss. Now the loss function wants to keep the weights small because increasing the weights will increase the loss; hence only doing so when the loss improves by more than the penalty.

The problem is that since we added the squared weights to the loss function, this affects the moving average of gradients and the moving average of the squared gradients for Adam. This result in decreasing the amount of weight decay when there is high variance in gradients, and increasing the amount of weight decay when there is little variation. In other words, “penalize large weights unless gradients varies a lot” which is not what we intended. AdamW removed the weight decay out of the loss function, and added it directly when updating the weights.

Lessons: 1234567891011121314