Deep Learning 2: Part 1 Lesson 4

My personal notes from course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1234567891011121314

Lesson 4

Articles by students:

Dropout [04:59]

learn = ConvLearner.pretrained(arch, data, ps=0.5, precompute=True)
  • precompute=True : Pre-compute the activations that come out of the last convolutional layer. Remember, activation is a number that is calculated based on some weights/parameter that makes up kernels/filters, and they get applied to the previous layer’s activations or inputs.
(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
(1): Dropout(p=0.5)
(2): Linear(in_features=1024, out_features=512)
(3): ReLU()
(4): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
(5): Dropout(p=0.5)
(6): Linear(in_features=512, out_features=120)
(7): LogSoftmax()

learn — This will display the layers we added at the end. These are the layers we train when precompute=True

(0), (4): BatchNorm will be covered in the last lesson

(1), (5): Dropout

(2):Linear layer simply means a matrix multiply. This is a matrix which has 1024 rows and 512 columns, so it will take in 1024 activations and spit out 512 activations.

(3):ReLU — just replace negatives with zero

(6): Linear — the second linear layer that takes those 512 activations from the previous linear layer and put them through a new matrix multiply 512 by 120 and outputs 120 activations

(7): Softmax — The activation function that returns numbers that adds up to 1 and each of them is between 0 and 1:

For minor numerical precision reasons, it turns out to be better to tahe the log of the softmax than softmax directly [15:03]. That is why when we get predictions out of our models, we have to do np.exp(log_preds).

What is Dropout and what is p? [08:17]


If we applied dropout with p=0.5 to Conv2 layer, it would look like the above. We go through, pick an activation, and delete it with 50% chance. So p=0.5 is the probability of deleting that cell. Output does not actually change by very much, just a little bit.

Randomly throwing away half of the activations in a layer has an interesting effect. An important thing to note is for each mini-batch, we throw away a different random half of activations in that layer. It forces it to not overfit. In other words, when a particular activation that learned just that exact dog or exact cat gets dropped out, the model has to try and find a representation that continues to work even as random half of the activations get thrown away every time.

This has been absolutely critical in making modern deep learning work and just about solve the problem of generalization. Geoffrey Hinton and his colleagues came up with this idea loosely inspired by the way the brain works.

  • p=0.01 will throw away 1% of the activations. It will not change things up very much at all, and will not prevent overfitting (not generalized).
  • p=0.99 will throw away 99% of the activations. Not going to overfit and great for generalization, but will kill your accuracy.
  • By default, the first layer is 0.25 and second layer is 0.5[17:54]. If you find it is overfitting, start bumping it up — try setting all to 0.5, still overfitting, try 0.7 etc. If you are under-fitting, you can try making it lower but is unlikely you would need to make it much lower.
  • ResNet34 has less parameters so it does not overfit as much, but for bigger architecture like ResNet50, you often need to increase dropout.

Have you wondered why the validation losses better than the training losses particularly early in the training? [12:32] This is because we turn off dropout when we run inference (i.e. making prediction) on the validation set. We want to be using the best model we can.

Question: Do you have to do anything to accommodate for the fact that you are throwing away activations? [13:26] We do not, but PyTorch does two things when you say p=0.5. It throws away half of the activations, and it doubles all the activations that are already there so that average activation does not change.

In, you can pass in ps which is the p value for all of the added layers. It will not change the dropout in the pre-trained network since it should have been already trained with some appropriate level of dropout:

learn = ConvLearner.pretrained(arch, data, ps=0.5, precompute=True)

You can remove dropout by setting ps=0. but even after a couple epochs, we start to massively overfit (training loss ≪ validation loss):

[2.      0.3521   0.55247  0.84189]

When ps=0. , dropout layers are not even added to the model:

(0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True)
(1): Linear(in_features=4096, out_features=512)
(2): ReLU()
(3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
(4): Linear(in_features=512, out_features=120)
(5): LogSoftmax()

You may have noticed, it has been adding two Linear layers [16:19]. We do not have to do that. There is xtra_fc parameter you can set. Note: you do need at least one which takes the output of the convolutional layer (4096 in this example) and turns it into the number of classes (120 dog breeds):

learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True, 
xtra_fc=[]); learn
(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
(1): Linear(in_features=1024, out_features=120)
(2): LogSoftmax()
learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True, 
xtra_fc=[700, 300]); learn
(0): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True)
(1): Linear(in_features=1024, out_features=
(2): ReLU()
(3): BatchNorm1d(700, eps=1e-05, momentum=0.1, affine=True)
(4): Linear(in_features=700, out_features=
(5): ReLU()
(6): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True)
(7): Linear(in_features=300, out_features=120)
(8): LogSoftmax()

Question: Is there a particular way in which you can determine if it is overfitted? [19:53]. Yes, you can see the training loss is much lower than the validation loss. You cannot tell if it is too overfitted. Zero overfitting is not generally optimal. The only thing you are trying to do is to get the validation loss low, so you need to play around with a few things and see what makes the validation loss low. You will get a feel for it overtime for your particular problem what too much overfitting looks like.

Question: Why does average activation matter? [21:15] If we just deleted a half of activations, the next activation who takes them as input will also get halved, and everything after that. For example, fluffy ears are fluffy if this is greater than 0.6, and now it is only fluffy if it is greater than 0.3 — which is changing the meaning. The goal here is delete activations without changing the meanings.

Question: Can we have different level of dropout by layer? [22:41] Yes, that is why it is called ps:

learn = ConvLearner.pretrained(arch, data, ps=[0., 0.2],
precompute=True, xtra_fc=[512]); learn
(0): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True)
(1): Linear(in_features=4096, out_features=512)
(2): ReLU()
(3): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
(4): Dropout(p=0.2)
(5): Linear(in_features=512, out_features=120)
(6): LogSoftmax()
  • There is no rule of thumb for when earlier or later layers should have different amounts of dropout yet.
  • If in doubt, use the same dropout for every fully connected layer.
  • Often people only put dropout on the very last linear layer.

Question: Why monitor loss and not accuracy? [23:53] Loss is the only thing that we can see for both the validation set and the training set. As we learn later, the loss is the thing that we are actually optimizing so it is easier to monitor and understand what that means.

Question: Do we need to adjust the learning rate after adding dropouts?[24:33] It does not seem to impact the learning rate enough to notice. In theory, it might but not enough to affect us.

Structured and Time Series Data [25:03]

Notebook / Kaggle

There are two types of columns:

  • Categorical — It has a number of “levels” e.g. StoreType, Assortment
  • Continuous — It has a number where differences or ratios of that numbers have some kind of meanings e.g. CompetitionDistance
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day',
'StateHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks',
'StoreType', 'Assortment', 'PromoInterval',
'CompetitionOpenSinceYear', 'Promo2SinceYear', 'State',
'Week', 'Events', 'Promo_fw', 'Promo_bw',
'StateHoliday_fw', 'StateHoliday_bw',
'SchoolHoliday_fw', 'SchoolHoliday_bw']
contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 
'Mean_TemperatureC', 'Min_TemperatureC',
'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',
'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h',
'CloudCover', 'trend', 'trend_DE',
'AfterStateHoliday', 'BeforeStateHoliday', 'Promo',
n = len(joined); n
  • Numbers like Year , Month, although we could treat them as continuous, we do not have to. If we decide to make Year a categorical variable, we are telling our neural net that for every different “level”of Year (2000, 2001, 2002), you can treat it totally differently; where-else if we say it is continuous, it has to come up with some kind of smooth function to fit them. So often things that actually are continuous but do not have many distinct levels (e.g. Year, DayOfWeek), it often works better to treat them as categorical.
  • Choosing categorical vs. continuous variable is a modeling decision you get to make. In summary, if it is categorical in the data, it has to be categorical. If it is continuous in the data, you get to pick whether to make it continuous or categorical in the model.
  • Generally, floating point numbers are hard to make categorical as there are many levels (we call number of levels “Cardinality” — e.g. the cardinality of the day of week variable is 7).

Question: Do you ever bin continuous variables?[31:02] Jeremy does not bin variables but one thing we could do with, say max temperature, is to group into 0–10, 10–20, 20–30, and call that categorical. Interestingly, a paper just came out last week in which a group of researchers found that sometimes binning can be helpful.

Question: If you are using year as a category, what happens when a model encounters a year it has never seen before? [31:47] We will get there, but the short answer is that it will be treated as an unknown category. Pandas has a special category called unknown and if it sees a category it has not seen before, it gets treated as unknown.

for v in cat_vars: 
joined[v] = joined[v].astype('category').cat.as_ordered()
for v in contin_vars:
joined[v] = joined[v].astype('float32')
dep = 'Sales'
joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy()
  • Loop through cat_vars and turn applicable data frame columns into categorical columns.
  • Loop through contin_vars and set them as float32 (32 bit floating point) because that is what PyTorch expects.

Start with a small sample [34:29]

idxs = get_cv_idxs(n, val_pct=150000/n) 
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp); samp_size

Here is what our data looks like. Even though we set some of the columns as “category” (e.g. ‘StoreType’, ‘Year’), Pandas still display as string in the notebook.

df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yl = np.log(y)

proc_df (process data frame) — A function in that does a few things:

  1. Pulls out the dependent variable, puts it into a separate variable, and deletes it from the original data frame. In other words, df do not have Sales column, and y only contains Sales column.
  2. do_scale : Neural nets really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around 1. So we take our data, subtract the mean, and divide by the standard deviation to make that happen. It returns a special object which keeps track of what mean and standard deviation it used for that normalization so you can do the same to the test set later (mapper).
  3. It also handles missing values — for categorical variable, it becomes ID: 0 and other categories become 1, 2, 3, and so on. For continuous variable, it replaces the missing value with the median and create a new boolean column that says whether it was missing or not.

After processing, year 2014 for example becomes 2 since categorical variables have been replaced with contiguous integers starting at zero. The reason for that is, we are going to put them into a matrix later, and we would not want the matrix to be 2014 rows long when it could just be two rows.

Now we have a data frame which does not contain the dependent variable and where everything is a number. That is where we need to get to to do deep learning. Check out Machine Learning course on further details. Another thing that is covered in Machine Learning course is validation sets. In this case, we need to predict the next two weeks of sales therefore we should create a validation set which is the last two weeks of our training set:

val_idx = np.flatnonzero((df.index<=datetime.datetime(2014,9,17)) &

Let’s get straight to the deep learning action [39:48]

For any Kaggle competitions, it is important that you have a strong understanding of your metric — how you are going to be judged. In this competition, we are going to be judged on Root Mean Square Percentage Error (RMSPE).

def inv_y(a): return np.exp(a)
def exp_rmspe(y_pred, targ):
targ = inv_y(targ)
pct_var = (targ - inv_y(y_pred))/targ
return math.sqrt((pct_var**2).mean())
max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)
  • When you take the log of the data, getting the root mean squared error will actually get you the root mean square percentage error.
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
yl.astype(np.float32), cat_flds=cat_vars, bs=128,
  • As per usual, we will start by creating model data object which has a validation set, training set, and optional test set built into it. From that, we will get a learner, we will then optionally call lr_find, then call and so forth.
  • The difference here is we are not using ImageClassifierData.from_csv or .from_paths, we need a different kind of model data called ColumnarModelData and we call from_data_frame.
  • PATH : Specifies where to store model files etc
  • val_idx : A list of the indexes of the rows that we want to put in the validation set
  • df : data frame that contains independent variable
  • yl : We took the dependent variable y returned by proc_df and took the log of that (i.e. np.log(y))
  • cat_flds : which columns to be treated as categorical. Remember, by this time, everything is a number, so unless we specify, it will treat them all as continuous.

Now we have a standard model data object which we are familiar with and contains train_dl, val_dl , train_ds , val_ds , etc.

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.04, 1, [1000,500], [0.001,0.01],
  • Here, we are asking it to create a learner that is suitable for our model data.
  • 0.04 : how much dropout to use
  • [1000,500] : how many activations to have in each layer
  • [0.001,0.01] : how many dropout to use at later layers

Key New Concept: Embeddings [45:39]

Let’s forget about categorical variables for a moment:

Remember, you never want to put ReLU in the last layer because softmax needs negatives to create low probabilities.

Simple view of fully connected neural net [49:13]:

For regression problems (not classification), you can even skip the softmax layer.

Categorical variables [50:49]

We create a new matrix of 7 rows and as many columns as we choose (4, for example) and fill it with floating numbers. To add “Sunday” to our rank 1 tensor with continuous variables, we do a look up to this matrix, which will return 4 floating numbers, and we use them as “Sunday”.

Initially, these numbers are random. But we can put them through a neural net and update them in a way that reduces the loss. In other words, this matrix is just another bunch of weights in our neural net. And matrices of this type are called “embedding matrices”. An embedding matrix is something where we start out with an integer between zero and maximum number of levels of that category. We index into the matrix to find a particular row, and we append it to all of our continuous variables, and everything after that is just the same as before (linear → ReLU → etc).

Question: What do those 4 numbers represent?[55:12] We will learn more about that when we look at collaborative filtering, but for now, they are just parameters that we are learning that happen to end up giving us a good loss. We will discover later that these particular parameters often are human interpretable and quite interesting but that a side effect.

Question: Do you have good heuristics for the dimensionality of the embedding matrix? [55:57] I sure do! Let’s take a look.

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) 
for c in cat_vars]
[('Store', 1116),
('DayOfWeek', 8),
('Year', 4),
('Month', 13),
('Day', 32),
('StateHoliday', 3),
('CompetitionMonthsOpen', 26),
('Promo2Weeks', 27),
('StoreType', 5),
('Assortment', 4),
('PromoInterval', 4),
('CompetitionOpenSinceYear', 24),
('Promo2SinceYear', 9),
('State', 13),
('Week', 53),
('Events', 22),
('Promo_fw', 7),
('Promo_bw', 7),
('StateHoliday_fw', 4),
('StateHoliday_bw', 4),
('SchoolHoliday_fw', 9),
('SchoolHoliday_bw', 9)]
  • Here is a list of every categorical variable and its cardinality.
  • Even if there were no missing values in the original data, you should still set aside one for unknown just in case.
  • The rule of thumb for determining the embedding size is the cardinality size divided by 2, but no bigger than 50.
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
[(1116, 50),
(8, 4),
(4, 2),
(13, 7),
(32, 16),
(3, 2),
(26, 13),
(27, 14),
(5, 3),
(4, 2),
(4, 2),
(24, 12),
(9, 5),
(13, 7),
(53, 27),
(22, 11),
(7, 4),
(7, 4),
(4, 2),
(4, 2),
(9, 5),
(9, 5)]

Then pass the embedding size to the learner:

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
[1000,500], [0.001,0.01], y_range=y_range)

Question: Is there a way to initialize embedding matrices besides random? [58:14] We will probably talk about pre-trained more later in the course, but the basic idea is if somebody else at Rossmann had already trained a neural network to predict cheese sales, you may as well start with their embedding matrix of stores to predict liquor sales. This is what happens, for example, at Pinterest and Instacart. Instacart uses this technique for routing their shoppers, and Pinterest uses it for deciding what to display on a webpage. They have embedding matrices of products/stores that get shared in the organization so people do not have to train new ones.

Question: What is the advantage of using embedding matrices over one-hot-encoding? [59:23] For the day of week example above, instead of the 4 numbers, we could have easily passed 7 numbers (e.g. [0, 1, 0, 0, 0, 0, 0] for Sunday). That also is a list of floats and that would totally work — and that is how, generally speaking, categorical variables have been used in statistics for many years (called “dummy variables”). The problem is, the concept of Sunday could only ever be associated with a single floating-point number. So it gets this kind of linear behavior — it says Sunday is more or less of a single thing. With embeddings, Sunday is a concept in four dimensional space. What we tend to find happen is that these embedding vectors tend to get these rich semantic concepts. For example, if it turns out that weekends have a different behavior, you tend to see that Saturday and Sunday will have some particular number higher.

By having higher dimensionality vector rather than just a single number, it gives the deep learning network a chance to learn these rich representations.

The idea of an embedding is what is called a “distributed representation” — the most fundamental concept of neural networks. This is the idea that a concept in neural network has a high dimensional representation which can be hard to interpret. These numbers in this vector does not even have to have just one meaning. It could mean one thing if this is low and that one is high, and something else if that one is high and that one is low because it is going through this rich nonlinear function. It is this rich representation that allows it to learn such interesting relationships.

Question: Are embeddings suitable for certain types of variables? [01:02:45] Embedding is suitable for any categorical variables. The only thing it cannot work well for would be something with too high cardinality. If you had 600,000 rows and a variable had 600,000 levels, that is just not a useful categorical variable. But in general, the third winner in this competition really decided that everything that was not too high cardinality, they put them all as categorical. The good rule of thumb is if you can make a categorical variable, you may as well because that way it can learn this rich distributed representation; where else if you leave it as continuous, the most it can do is to try and find a single functional form that fits it well.

Matrix algebra behind the scene [01:04:47]

Looking up an embedding with an index is identical to doing a matrix product between a one-hot encoded vector and the embedding matrix. But doing so is terribly inefficient, so modern libraries implement this as taking an integer and doing a look up into an array.

Question: Could you touch on using dates and times as categorical and how that affects seasonality? [01:06:59] There is a function called add_datepart which takes a data frame and a column name. It optionally removes the column from the data frame and replaces it with lots of column representing all of the useful information about that date such as day of week, day of month, month of year, etc (basically everything Pandas gives us).

add_datepart(weather, "Date", drop=False)
add_datepart(googletrend, "Date", drop=False)
add_datepart(train, "Date", drop=False)
add_datepart(test, "Date", drop=False)

So for example, day of week now becomes eight rows by four columns embedding matrix. Conceptually this allows our model to create some interesting time series models. If there is something that has a seven day period cycle that goes up on Mondays and down on Wednesdays but only for daily and only in Berlin, it can totally do that — it has all the information it needs. This is a fantastic way to deal with time series. You just need to make sure that the cycle indicator in your time series exists as a column. If you did not have a column called day of week, it would be very difficult for the neural network to learn to do mod seven and look up in an embedding matrix. It is not impossible but really hard. If you are predicting sales of beverages in San Francisco, you probably want a list of when the ball game is on at AT&T park because that is going to to impact how many people are drinking beer in SoMa. So you need to make sure that the basic indicators or periodicity is in your data, and as long as they are there, neural net is going to learn to use them.

Learner [01:10:13]

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
[1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3
  • emb_szs : embedding size
  • len(df.columns)-len(cat_vars) : number of continuous variables in the data frame
  • 0.04 : embedding matrix has its own dropout and this is the dropout rate
  • 1 : how many outputs we want to create (output of the last linear layer)
  • [1000, 500] : number of activations in the first linear layer, and the second linear layer
  • [0.001, 0.01] : dropout in the first linear layer, and the second linear layer
  • y_range : we will not worry about that for now, 3, metrics=[exp_rmspe])
A Jupyter Widget
[ 0.       0.02479  0.02205  0.19309]                          
[ 1. 0.02044 0.01751
[ 2. 0.01598 0.01571
  • metrics : this is a custom metric which specifies a function to be called at the end of every epoch and prints out a result, 1, metrics=[exp_rmspe], cycle_len=1)
[ 0.       0.00676  0.01041  0.09711]   

By using all of the training data, we achieved a RMSPE around 0.09711. There is a big difference between public leader board and private leader board, but we are certainly in the top end of this competition.

So this is a technique for dealing with time series and structured data. Interestingly, compared to the group that used this technique (Entity Embeddings of Categorical Variables), the second place winner did way more feature engineering. The winners of this competition were actually subject matter experts in logistics sales forecasting so they had their own code to create lots and lots of features. Folks at Pinterest who build a very similar model for recommendations also said that when they switched from gradient boosting machines to deep learning, they did way less feature engineering and it was much simpler model which requires less maintenance. So this is one of the big benefits of using this approach to deep learning — you can get state of the art results but with a lot less work.

Question: Are we using any time series in any of these? [01:15:01] Indirectly, yes. As we just saw, we have a day of week, month of year, etc in our columns and most of them are being treated as categories, so we are building a distributed representation of January, Sunday, and so on. We are not using any classic time series techniques, all we are doing is true fully connected layers in a neural net. The embedding matrix is able to deal with things like day of week periodicity in a much richer way than than any standard time series techniques.

Question regarding the difference between image models and this model [01:15:59]: There is a difference in a way we are calling get_learner. In imaging we just did Learner.trained and pass the data:

learn = ConvLearner.pretrained(arch, data, ps=0., precompute=True)

For these kinds of models, in fact for a lot of the models, the model we build depends on the data. In this case, we need to know what embedding matrices we have. So in this case, the data objects creates the learner (upside down to what we have seen before):

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
[1000,500], [0.001,0.01], y_range=y_range)

Summary of steps (if you want to use this for your own dataset) [01:17:56]:

Step 1. List categorical variable names, and list continuous variable names, and put them in a Pandas data frame

Step 2. Create a list of which row indexes you want in your validation set

Step 3. Call this exact line of code:

md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
yl.astype(np.float32), cat_flds=cat_vars, bs=128,

Step 4. Create a list of how big you want each embedding matrix to be

Step 5. Call get_learner — you can use these exact parameters to start with:

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,
[1000,500], [0.001,0.01], y_range=y_range)

Step 6. Call

Question: How to use data augmentation for this type of data, and how does dropout work? [01:18:59] No idea. Jeremy thinks it has to be domain-specific, but he has never seen any paper or anybody in industry doing data augmentation with structured data and deep learning. He thinks it can be done but has not seen it done. What dropout is doing is exactly the same as before.

Question: What is the downside? Almost no one is using this. Why not? [01:20:41] Basically the answer is as we discussed before, no one in academia almost is working on this because it is not something that people publish on. As a result, there have not been really great examples people could look at and say “oh here is a technique that works well so let’s have our company implement it”. But perhaps equally importantly, until now with this library, there has not been any way to do it conveniently. If you wanted to implement one of these models, you had to write all the custom code yourself. There are a lot of big commercial and scientific opportunity to use this and solve problems that previously haven’t been solved very well.

Natural Language Processing [01:23:37]

The most up-and-coming area of deep learning and it is two or three years behind computer vision. The state of software and some of the concepts is much less mature than it is for computer vision. One of the things you find in NLP is there are particular problems you can solve and they have particular names. There is a particular kind of problem in NLP called “language modeling” and it has a very specific definition — it means build a model where given a few words of a sentence, can you predict what the next word is going to be.

Language Modeling [01:25:48]


Here we have 18 months worth of papers from arXiv ( and this is an example:

' '.join(md.trn_ds[0].text[:150])
'<cat> csni <summ> the exploitation of mm - wave bands is one of the key - enabler for 5 g mobile \n radio networks . however , the introduction of mm - wave technologies in cellular \n networks is not straightforward due to harsh propagation conditions that limit \n the mm - wave access availability . mm - wave technologies require high - gain antenna \n systems to compensate for high path loss and limited power . as a consequence , \n directional transmissions must be used for cell discovery and synchronization \n processes : this can lead to a non - negligible access delay caused by the \n exploration of the cell area with multiple transmissions along different \n directions . \n    the integration of mm - wave technologies and conventional wireless access \n networks with the objective of speeding up the cell search process requires new \n'
  • <cat> — category of the paper. CSNI is Computer Science and Networking
  • <summ> — abstract of the paper

Here are what the output of a trained language model looks like. We did simple little tests in which you pass some priming text and see what the model thinks should come next:

sample_model(m, "<CAT> csni <SUMM> algorithms that")
...use the same network as a single node are not able to achieve the same performance as the traditional network - based routing algorithms . in this paper , we propose a novel routing scheme for routing protocols in wireless networks . the proposed scheme is based ...

It learned by reading arXiv papers that somebody who is writing about computer networking would talk like this. Remember, it started out not knowing English at all. It started out with an embedding matrix for every word in English that was random. By reading lots of arXiv papers, it learned what kind of words followed others.

Here we tried specifying a category to be computer vision:

sample_model(m, "<CAT> cscv <SUMM> algorithms that")
...use the same data to perform image classification are increasingly being used to improve the performance of image classification algorithms . in this paper , we propose a novel method for image classification using a deep convolutional neural network ( cnn ) . the proposed method is ...

It not only learned how to write English pretty well, but also after you say something like “convolutional neural network” you should then use parenthesis to specify an acronym “(CNN)”.

sample_model(m,"<CAT> cscv <SUMM> algorithms. <TITLE> on ")
...the performance of deep learning for image classification <eos>
sample_model(m,"<CAT> csni <SUMM> algorithms. <TITLE> on ")
...the performance of wireless networks <eos>
sample_model(m,"<CAT> cscv <SUMM> algorithms. <TITLE> towards ")
...a new approach to image classification <eos>
sample_model(m,"<CAT> csni <SUMM> algorithms. <TITLE> towards ")
...a new approach to the analysis of wireless networks <eos>

A language model can be incredibly deep and subtle, so we are going to try and build that — not because we care about this at all, but because we are trying to create a pre-trained model which is used to do some other tasks. For example, given an IMDB movie review, we will figure out whether they are positive or negative. It is a lot like cats vs. dogs — a classification problem. So we would really like to use a pre-trained network which at least knows how to read English. So we will train a model that predicts a next word of a sentence (i.e. language model), and just like in computer vision, stick some new layers on the end and ask it to predict whether something is positive or negative.

IMDB [1:31:11]


What we are going to do is to train a language model, making that the pre-trained model for a classification model. In other words, we are trying to leverage exactly what we learned in our computer vision which is how to do fine-tuning to create powerful classification models.

Question: why would doing directly what you want to do not work? [01:31:34] It just turns out it doesn’t empirically. There are several reasons. First of all, we know fine-tuning a pre-trained network is really powerful. So if we can get it to learn some related tasks first, then we can use all that information to try and help it on the second task. The other is IMDB movie reviews are up to a thousands words long. So after reading a thousands words knowing nothing about how English is structured or concept of a word or punctuation, all you get is a 1 or a 0 (positive or negative). Trying to learn the entire structure of English and then how it expresses positive and negative sentiments from a single number is just too much to expect.

Question: Is this similar to Char-RNN by Karpathy? [01:33:09] This is somewhat similar to Char-RNN which predicts the next letter given a number of previous letters. Language model generally work at a word level (but they do not have to), and we will focus on word level modeling in this course.

Question: To what extent are these generated words/sentences actual copies of what it found in the training set? [01:33:44] Words are definitely words it has seen before because it is not a character level so it can only give us the word it has seen before. Sentences, there are rigorous ways of doing it but the easiest would be by looking at examples like above, you get a sense of it. Most importantly, when we train the language model, we will have a validation set so that we are trying to predict the next word of something that has never seen before. There are tricks to using language models to generate text like beam search.

Use cases of text classification:

  • For hedge fund, identify things in articles or Twitter that caused massive market drops in the past.
  • Identify customer service queries which tend to be associated with people who cancel their contracts in the next month
  • Organize documents into whether they are part of legal discovery or not.
from fastai.learner import *
import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling
from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *
import dill as pickle
  • torchtext — PyTorch’s NLP library

Data [01:37:05]

IMDB Large Movie Review Dataset

PATH = 'data/aclImdb/'
TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
%ls {PATH}
imdbEr.txt  imdb.vocab  models/  README  test/  tmp/  train/

We do not have separate test and validation in this case. Just like in vision, the training directory has bunch of files in it:

trn_files = !ls {TRN}

review = !cat {TRN}{trn_files[6]}
"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out lines. I mean, some of it didn't make sense with the rest of the flick, but who cares when you're laughing so hard! All in all the film wasn't the greatest thing since sliced bread, but I wasn't expecting that. It was a Troma flick so I figured it would totally suck. It's nice when something surprises you but not totally sucking.<br /><br />Rent it if you want to get stoned on a Friday night and laugh with your buddies. Don't rent it if you are an uptight weenie or want a zombie movie with lots of flesh eating.<br /><br />P.S. Uwe Boil was a nice touch."

Now we will check how many words are in the dataset:

!find {TRN} -name '*.txt' | xargs cat | wc -w
!find {VAL} -name '*.txt' | xargs cat | wc -w

Before we can do anything with text, we have to turn it into a list of tokens. Token is basically like a word. Eventually we will turn them into a list of numbers, but the first step is to turn it into a list of words — this is called “tokenization” in NLP. A good tokenizer will do a good job of recognizing pieces in your sentence. Each separated piece of punctuation will be separated, and each part of multi-part word will be separated as appropriate. Spacy does a lot of NLP stuff, and it has the best tokenizer Jeremy knows. So library is designed to work well with the Spacey tokenizer as with torchtext.

Creating a field [01:41:01]

A field is a definition of how to pre-process some text.

TEXT = data.Field(lower=True, tokenize=spacy_tok)
  • lower=True — lowercase the text
  • tokenize=spacy_tok — tokenize with spacy_tok

Now we create the usual model data object:

bs=64; bptt=70
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs,
bptt=bptt, min_freq=10)
  • PATH : as per usual where the data is, where to save models, etc
  • TEXT : torchtext’s Field definition
  • **FILES : list of all of the files we have: training, validation, and test (to keep things simple, we do not have a separate validation and test set, so both points to validation folder)
  • bs : batch size
  • bptt : Back Prop Through Time. It means how long a sentence we will stick on the GPU at once
  • min_freq=10 : In a moment, we are going to be replacing words with integers (a unique index for every word). If there are any words that occur less than 10 times, just call it unknown.

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which unique words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id.

# 'itos': 'int-to-string' 
['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in']
# 'stoi': 'string to int'

itos is sorted by frequency except for the first two special ones. Using vocab, torchtext will turn words into integer IDs for us :

Variable containing:
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Question: Is it common to do any stemming or lemma-tizing? [01:45:47] Not really, no. Generally tokenization is what we want. To keep it as general as possible, we want to know what is coming next so whether it is future tense or past tense or plural or singular, we don’t really know which things are going to be interesting and which are not, so it seems that it is generally best to leave it alone as much as possible.

Question: When dealing with natural language, isn’t context important? Why are we tokenizing and looking at individual word? [01:46:38] No, we are not looking at individual word — they are still in order. Just because we replaced I with a number 12, they are still in that order. There is a different way of dealing with natural language called “bag of words” and they do throw away the order and context. In the Machine Learning course, we will be learning about working with bag of words representations but my belief is that they are no longer useful or in the verge of becoming no longer useful. We are starting to learn how to use deep learning to use context properly.

Batch size and BPTT [01:47:40]

What happens in a language model is even though we have lots of movie reviews, they all get concatenated together into one big block of text. So we predict the next word in this huge long thing which is all of the IMDB movie reviews concatenated together.

  • We split up the concatenated reviews into batches. In this case, we will split it to 64 sections
  • We then move each section underneath the previous one, and transpose it.
  • We end up with a matrix which is 1 million by 64.
  • We then grab a little chunk at time and those chunk lengths are approximately equal to BPTT. Here, we grab a little 70 long section and that is the first thing we chuck into our GPU (i.e. the batch).
(Variable containing:
12 567 3 ... 2118 4 2399
35 7 33 ... 6 148 55
227 103 533 ... 4892 31 10
... ⋱ ...
19 8879 33 ... 41 24 733
552 8250 57 ... 219 57 1777
5 19 2 ... 3099 8 48
[torch.cuda.LongTensor of size 75x64 (GPU 0)], Variable containing:

[torch.cuda.LongTensor of size 4800 (GPU 0)])
  • We grab our first training batch by wrapping data loader with iter then calling next.
  • We got back a 75 by 64 tensor (approximately 70 rows but not exactly)
  • A neat trick torchtext does is to randomly change the bptt number every time so each epoch it is getting slightly different bits of text — similar to shuffling images in computer vision. We cannot randomly shuffle the words because they need to be in the right order, so instead, we randomly move their breakpoints a little bit.
  • The target value is also 75 by 64 but for minor technical reasons it is flattened out into a single vector.

Question: Why not split by a sentence? [01:53:40] Not really. Remember, we are using columns. So each of our column is of length about 1 million, so although it is true that those columns are not always exactly finishing on a full stop, they are so darn long we do not care. Each column contains multiple sentences.

Pertaining to this question, Jeremy found what is in this language model matrix a little mind-bending for quite a while, so do not worry if it takes a while and you have to ask a thousands questions.

Create a model [01:55:46]

Now that we have a model data object that can fee d us batches, we can create a model. First, we are going to create an embedding matrix.

Here are the: # batches; # unique tokens in the vocab; length of the dataset; # of words

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)
(4602, 34945, 1, 20621966)

This is our embedding matrix looks like:

  • It is a high cardinality categorical variable and furthermore, it is the only variable — this is typical in NLP
  • The embedding size is 200 which is much bigger than our previous embedding vectors. Not surprising because a word has a lot more nuance to it than the concept of Sunday. Generally, an embedding size for a word will be somewhere between 50 and 600.
em_sz = 200  # size of each embedding vector
nh = 500 # number of hidden activations per layer
nl = 3 # number of layers

Researchers have found that large amounts of momentum (which we’ll learn about later) don’t work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than its default of 0.9. Any time you are doing NLP, you should probably include this line:

opt_fn = partial(optim.Adam, betas=(0.7, 0.99)) uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below — you just have to experiment…

However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning.

learner = md.get_model(opt_fn, em_sz, nh, nl, dropouti=0.05,
dropout=0.05, wdrop=0.1, dropoute=0.02,
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
  • In the last lecture, we will learn what the architecture is and what all these dropouts are. For now, just know it is the same as per usual, if you try to build an NLP model and you are under-fitting, then decrease all these dropouts, if overfitting, then increase all these dropouts in roughly this ratio. Since this is such a recent paper so there is not a lot of guidance but these ratios worked well — it is what Stephen has been using as well.
  • There is another kind of way we can avoid overfitting that we will talk about in the last class. For now, learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1) works reliably so all of your NLP models probably want this particular line.
  • learner.clip=0.3 : when you look at your gradients and you multiply them by the learning rate to decide how much to update your weights by, this will not allow them be more than 0.3. This is a cool little trick to prevent us from taking too big of a step.
  • Details do not matter too much right now, so you can use them as they are.

Question: There are word embedding out there such as Word2vec or GloVe. How are they different from this? And why not initialize the weights with those initially? [02:02:29] People have pre-trained these embedding matrices before to do various other tasks. They are not called pre-trained models; they are just a pre-trained embedding matrix and you can download them. There is no reason we could not download them. I found that building a whole pre-trained model in this way did not seem to benefit much if at all from using pre-trained word vectors; where else using a whole pre-trained language model made a much bigger difference. Maybe we can combine both to make them a little better still.

Question: What is the architecture of the model? [02:03:55] We will be learning about the model architecture in the last lesson but for now, it is a recurrent neural network using something called LSTM (Long Short Term Memory).

Fitting [02:04:24], 4, wds=1e-6, cycle_len=1, cycle_mult=2)
learner.save_encoder('adam1_enc'), 4, wds=1e-6, cycle_len=10, 
learner.save_encoder('adam3_10_enc'), 1, wds=1e-6, cycle_len=20, 

In the sentiment analysis section, we'll just need half of the language model - the encoder, so we save that part.


Language modeling accuracy is generally measured using the metric perplexity, which is simply exp() of the loss function we used.

pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Testing [02:04:53]

We can play around with our language model a bit to check it seems to be working OK. First, let’s create a short bit of text to ‘prime’ a set of predictions. We’ll use our torchtext field to numericalize it so we can feed it to our language model.

ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [spacy_tok(ss)]
' '.join(s[0])
". So , it was n't quite was I was expecting , but I really liked it anyway ! The best"

We haven’t yet added methods to make it easy to test a language model, so we’ll need to manually go through the steps.

# Set batch size to 1
# Turn off dropout
# Reset hidden state
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was

Let’s see what the top 10 predictions were for the next word after our short text:

nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

…and let’s see if our model can generate a bit more text all by itself!

for i in range(50):
n = n[1] if[0]==0 else n[0]
print(TEXT.vocab.itos[[0]], end=' ')
res,*_ = m(n[0].unsqueeze(0))
. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 
film ever ! <eos> i saw this movie at the toronto international film festival . i was very impressed . i was very impressed with the acting . i was very impressed with the acting . i was surprised to see that the actors were not in the movie . ...

Sentiment [02:05:09]

So we had pre-trained a language model and now we want to fine-tune it to do sentiment classification.

To use a pre-trained model, we will need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

sequential=False tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

IMDB_LABEL = data.Field(sequential=False)

This time, we need to not treat the whole thing as one big piece of text but every review is separate because each one has a different sentiment attached to it.

splits is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at lang_model-arxiv.ipynb to see how to define your own fastai/torchtext datasets.

splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')
t = splits[0].examples[0]
t.label, ' '.join(t.text[:16])
('pos', 'ashanti is a very 70s sort of film ( 1979 , to be precise ) .')

fastai can create a ModelData object directly from torchtext splits.

md2 = TextData.from_splits(PATH, splits, bs)

Now you can go ahead and call get_model that gets us our learner. Then we can load into it the pre-trained language model (load_encoder).

m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, 
n_layers=nl, dropout=0.1, dropouti=0.4,
wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)

Because we’re fine-tuning a pretrained model, we’ll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

m3.freeze_to(-1), 1, metrics=[accuracy])
m3.unfreeze(), 1, metrics=[accuracy], cycle_len=1)
[ 0.       0.45074  0.28424  0.88458]
[ 0.       0.29202  0.19023  0.92768]

We make sure all except the last layer is frozen. Then we train a bit, unfreeze it, train it a bit. The nice thing is once you have got a pre-trained language model, it actually trains really fast., 7, metrics=[accuracy], cycle_len=2, 
[ 0.       0.29053  0.18292  0.93241]                        
[ 1. 0.24058 0.18233 0.93313]
[ 2. 0.24244 0.17261 0.93714]
[ 3. 0.21166 0.17143 0.93866]
[ 4. 0.2062 0.17143 0.94042]
[ 5. 0.18951 0.16591 0.94083]
[ 6. 0.20527 0.16631 0.9393 ]
[ 7. 0.17372 0.16162 0.94159]
[ 8. 0.17434 0.17213 0.94063]
[ 9. 0.16285 0.16073 0.94311]
[ 10. 0.16327 0.17851 0.93998]
[ 11. 0.15795 0.16042 0.94267]
[ 12. 0.1602 0.16015 0.94199]
[ 13. 0.15503 0.1624 0.94171]
m3.load_cycle('imdb2', 4)

A recent paper from Bradbury et al, Learned in translation: contextualized word vectors, has a handy summary of the latest academic research in solving this IMDB sentiment analysis problem. Many of the latest algorithms shown are tuned for this specific problem.

As you see, we just got a new state of the art result in sentiment analysis, decreasing the error from 5.9% to 5.5%! You should be able to get similarly world-class results on other NLP classification problems using the same basic steps.

There are many opportunities to further improve this, although we won’t be able to get to them until part 2 of this course.

  • For example we could start training language models that look at lots of medical journals and then we could make a downloadable medical language model that then anybody could use to fine-tune on a prostate cancer subset of medical literature.
  • We could also combine this with pre-trained word vectors
  • We could have pre-trained a Wikipedia corpus language model and then fine-tuned it into an IMDB language model, and then fine-tune that into an IMDB sentiment analysis model and we would have gotten something better than this.

There is a really fantastic researcher called Sebastian Ruder who is the only NLP researcher who has been really writing a lot about pre-training, fine-tuning, and transfer learning in NLP. Jeremy was asking him why this is not happening more, and his view was it is because there is not a software to make it easy. Hopefully will change that.

Collaborative Filtering Introduction [02:11:38]


Data available from

ratings = pd.read_csv(path+'ratings.csv')

The dataset looks like this:

It contains ratings by users. Our goal will be for some user-movie combination we have not seen before, we have to predict a rating.

movies = pd.read_csv(path+'movies.csv')

To make it more interesting, we will also actually download a list of movies so that we can interpret what is actually in these embedding matrices.

top_r = ratings.join(topUsers, rsuffix='_r', how='inner', 
top_r = top_r.join(topMovies, rsuffix='_r', how='inner',
pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, 

This is what we are creating — this kind of cross tab of users by movies.

Feel free to look ahead and you will find that most of the steps are familiar to you already.

Lessons: 1234567891011121314