Summary for Practical Tips from fast.ai Machine Learning Course — Part 3

6 min readNov 4, 2018

This is my high-level summary of machine learning course by Jeremy. The focus is on practical tricks and tips for machine learning, and in particular for random forest and basic neural network. It is assumed that you’ve known their basic theories. Special thanks to Hiromi Suenaga for her wonderful notes with great details on every lesson. Most content of this summary takes reference from her note (all the figures are from her notes).

part 1 for general knowledge in machine learning and tools
part 2 for random forest
part 3 for neural network

Neural Network Pre-Processing:

normalization / standardization is necessary, training set and validation set should be using the same mean and stand deviation value computed from the training set.
initialization of matrix weight: using normally distributed random numbers divided by the number of rows in the weight matrix,which keeps the numbers at about the right scale and avoids gradient explosion.
how to handle semi-supervised learning with unlabelled data? perform data augmentation by autoencoder, a smart trick from a kaggle competition winner.
for problems involving time series, when we have events in our time series, it is better to create two new columns for each event [lesson 12]:

— How long is it going to be until the next time this event happens.

— How long has it been since the last time that event happened.

def get_elapsed(fld, pre):
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()
    last_store = 0
    res = []

    for s,v,d in zip(df.Store.values,df[fld].values,df.Date.values):
        if s != last_store:
            last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        res.append(((d-last_date).astype('timedelta64[D]') / day1))
    df[pre+fld] = resfld = 'SchoolHoliday'
df = df.sort_values(['Store', 'Date'])
get_elapsed(fld, 'After')
df = df.sort_values(['Store', 'Date'], ascending=[True, False])
get_elapsed(fld, 'Before')

The above example inserts date columns for event fld by Store . So in this case, we basically want to say let’s grab the first store, the first school holiday, and the first date. For store 1, January the first, school holiday was true or false. If it is a school holiday, I’ll keep track of that fact by saying the last time I saw a school was that date, and append how long has it been since the last school holiday. And if the store ID is different to the last store ID, then I’ve now got to a whole new store, in which case, I have to basically reset everything.

One thing to note is that the use of zip leads to 300 times faster computation than the use of for row in df.iterrows() .

More functions are available for pandas time series data, like rolling , timestamp, date offsets, and resampling from pandas time series API.

The Simplest Neural Network Model using PyTorch and fast.ai:

Explanation: fitthis network net to this data md going over every image once (n_epochs) using this loss function loss, this optimizer opt, and print out these metrics metrics.

from fastai.metrics import *
from fastai.model import *
from fastai.dataset import *

import torch.nn as nn# instantiation
net = nn.Sequential(
    nn.Linear(28*28, 10),
    nn.LogSoftmax()
).cuda()# get data loader
md = ImageClassifierData.from_arrays(path, (x,y),(x_valid, y_valid))# training
loss = nn.NLLLoss() # Negative Log Likelihood Loss
metrics = [accuracy]
opt = optim.Adam(net.parameters())
fit(net, md, n_epochs=10, crit=loss, opt=opt, metrics=metrics)# prediction
preds = predict(net, md.val_dl)

This one layer neural network is essentially a logistic regression classifier.

Basic Structure of A Fully-Connected Neural Network:

for epoch in all_epochs:
    dl = iter(md.trn_dl)
    for t in range(len(dl)):
        # Forward pass: 
        # compute predicted y and loss by passing x to the model.
        xt, yt = next(dl)
        y_pred = forward_computation(V(xt))
        l = loss_computation(y_pred, V(yt))        # Before the backward pass, use the optimizer object to zero
        # all of the gradients for the variables it will update
        # (which are the learnable weights of the model)
        optimizer.zero_grad()        # Backward pass: 
        # compute gradient of the loss with respect to model parameters
        gradient = backward_computation(l) # calling: l.backward()        # Calling the step function on an Optimizer makes an update
        # to its parameters
        parameter = parameter_update() # calling: optimizer.step()
    
    val_dl = iter(md.val_dl)
    val_scores = [score(*next(val_dl)) for i in range(len(val_dl))]
    print(np.mean(val_scores))

PyTorch functions:

Parameter(...) : When used in the constructor of a network object, it allows nn.Module to know that the input is something we want to optimize. The calling of parameter using optimizer on the network object, e.g. optim.Adam(net.parameters()), it goes through everything that we created in the constructor, checks to see if any of them are of type Parameter and if so, it sets all of those being things that we want to train with the optimizer.
Variable(...).cuda() or V(...) the shortcut in fast.ai: It turns the input into a variable so that nn.Module knows to take derivative with respect to it in the backward computation.

Regularization:

Regularization is useful in avoid overfitting. For networks with a huge number of parameters, the risk of overfitting can be reduced by penalizing parameters for not being zero.

The weight decay in parameter update by SGD is a form of L2 regularization.
L1 regularization actually has the property that it will try to make as many things zero as possible where else L2 regularization has a property that it tends to try to make everything smaller. If you’ve got two things that are highly correlated, then L2 regularization will move them both down together. It won’t make one of them zero and one of them nonzero.
Essentially, with all other things being equal, regularization will make the score on the training set less good and make the validation score better. Sometimes, the initial a few epoch on training set may show the opposite, and this depends on the shape of the underlying function, whether it is bumpy or it is smooth.
The regularization pushes neural network weight to be zero, it is hence important to figure out what happen when weights are zero and it needs to be sensible.

Presumably, we’ve set regularization because we were overfitting….So if there is some penalty, then my assertion is that we should penalize things that are different to our prior (this assumes that prior is correct and reliable), not that we should penalize things that are different to zero.

Secret to Modern Machine Learning:

The secret, in my opinion, to modern machine learning techniques is to massively over parameterize the solution to your problem as we just did, thta is, we’ve got like 100,000 weights when we only had a small number of 28 by 28 images, and then use regularization.

Start with being generous with parameters, reduce the number of parameter less than we have to gradually, and avoid overfitting by regularization.

Embedding:

Embedding means make a multiplication by a one hot encoded matrix faster by replacing it with a simple array lookup.

Entity embedding: a paper for a winning solution in Rossman grocery competition, and a similar idea had been written before earlier by Yoshua Bengio and his co-authors in another Kaggle competition which was predicting taxi destinations.

By embedding, the categorical variables will have more flexibility in how the neural net can use them. Essentially, a categorical variable, with categories mapped to ordered integer, say [0,1,2,3,4,5,6] for dayofweek , an embedding matrix of dimension (7,4) can be built, with 7 the cardinality and 4 the number of customized embedding dimension, and the categorical values becomes index for a lookup table, that is, Saturday of 6 means taking the 6 -th row of the embedding matrix. Neural network learns this kind of embedding matrix for you if you did the correct settings.

So what should be a appropriate value for embedding dimension? — It is experimental. A proper value for NLP models is 600, while for practical problems, Jeremy applies a rule of thumb as min(50, (c+1)//2) with c being the cardinality.

How do we train a neural network with categorical data? — Here is an example using fast.ai:

# instantiate the model
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, 
                    yl.astype(np.float32), cat_flds=cat_vars, 
                    bs=128, test_df=df_test)# set embedding dimension for each categorical column
cat_sz=[(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]# get the model
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
m.summary()# train the model 
m.fit(lr, 3, metrics=[exp_rmspe], cycle_len=1)

emb_szs: how big is each of our embeddings
len(df.columns)-len(cat_vars): how many continuous variables we have
[1000,500]: how many activations to create for each layer
[0.001,0.01]: what dropout to use for each layer

Useful Packages, Tutorials and Articles:

Autograd: A tutorial on the automatic differentiation package that comes with PyTorch, and it’s an implementation of automatic differentiation.
TensorLy: A python library for tensor decomposition, tensor learning and tensor algebra.
Parfit — quick and powerful hyper-parameter optimization with visualizations
Keras Model for Beginners (0.210 on LB)+EDA+R&D
A visual proof that neural nets can compute any function