Deep Learning 2: Part 1 Lesson 6

22 min readJan 10, 2018

My personal notes from fast.ai course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12 ・ 13 ・ 14

Lesson 6

Optimization for Deep Learning Highlights in 2017

Table of contents: Deep Learning ultimately is about finding a minimum that generalizes well -- with bonus points for…

ruder.io

Review from last week [2:15]

We took a deep dive to collaborative filtering last week, and we ended up re-creating EmbeddingDotBias class (column_data.py) in fast.ai library. Let’s visualize what the embeddings look like [notebook].

Inside of a learner learn, you can get a PyTorch model itself by calling learn.model . @property looks like a regular function, but requires no parenthesis when you call it.

@property
def model(self): return self.models.model

learn.models is an instance of CollabFilterModel which is a thin wrapper of PyTorch model that allows us to use “layer groups” which is not a concept available in PyTorch and fast.ai uses it to apply different learning rates to different sets of layers (layer group).

PyTorch model prints out the layers nicely including layer name which is what we called them in the code.

m=learn.model; mEmbeddingDotBias (
  (u): Embedding(671, 50)
  (i): Embedding(9066, 50)
  (ub): Embedding(671, 1)
  (ib): Embedding(9066, 1)
)

m.ib refers to an embedding layer for an item bias — movie bias, in our case. What is nice about PyTorch models and layers is that we can call them as if they are functions. So if you want to get a prediction, you call m(...) and pass in variables.

Layers require variables not tensors because it needs to keep track of the derivatives — that is the reason for V(...) to convert tensor to variable. PyTorch 0.4 will get rid of variables and we will be able to use tensors directly.

movie_bias = to_np(m.ib(V(topMovieIdx)))

The to_np function will take a variable or a tensor (regardless of being on the CPU or GPU) and returns a numpy array. Jeremy’s approach [12:03] is to use numpy for everything except when he explicitly needs something to run on the GPU or he needs its derivatives — in which case he uses PyTorch. Numpy has been around longer than PyTorch and works well with other libraries such as OpenCV, Pandas, etc.

A question regarding CPU vs. GPU in production. The suggested approach is to do inference on CPU as it is more scalable and you do not need to put things in batches. You can move a model onto the CPU by typing m.cpu(), similarly a variable by typingV(topMovieIndex).cpu() (from CPU to GPU would be m.cuda()).If your server does not have GPU, it will run inference on CPU automatically. For loading a saved model that was trained on GPU, take a look at this line of code in torch_imports.py:

def load_model(m, p): m.load_state_dict(torch.load(p, map_location=lambda storage, loc: storage))

Now that we have movie bias for top 3000 movies, and let’s take a look at ratings:

movie_ratings = [(b[0], movie_names[i]) for i,b in zip(topMovies,movie_bias)]

zip will allow you to iterate through multiple lists at the same time.

Worst movies

About sorting key — Python has itemgetter function but plain lambda is just one more character.

sorted(movie_ratings, key=lambda o: o[0])[:15][(-0.96070349, 'Battlefield Earth (2000)'),
 (-0.76858485, 'Speed 2: Cruise Control (1997)'),
 (-0.73675376, 'Wild Wild West (1999)'),
 (-0.73655486, 'Anaconda (1997)'),
 ...]sorted(movie_ratings, key=itemgetter(0))[:15]

Best movies

sorted(movie_ratings, key=lambda o: o[0], reverse=True)[:15][(1.3070084, 'Shawshank Redemption, The (1994)'),
 (1.1196285, 'Godfather, The (1972)'),
 (1.0844109, 'Usual Suspects, The (1995)'),
 (0.96578616, "Schindler's List (1993)"),
 ...]

Embedding interpretation [18:42]

Each movie has 50 embeddings and it is hard to visualize 50 dimensional space, so we will turn it into a three dimensional space. We can compress dimensions using several techniques: Principal Component Analysis (PCA) (Rachel’s Computational Linear Algebra class covers this in detail — which is almost identical to Singular Value Decomposition (SVD))

movie_emb = to_np(m.i(V(topMovieIdx)))
movie_emb.shape(3000, 50)from sklearn.decomposition import PCA
pca = PCA(n_components=3)
movie_pca = pca.fit(movie_emb.T).components_
movie_pca.shape(3, 3000)

We will take a look at the first dimension “easy watching vs. serious” (we do not know what it represents but can certainly speculate by looking at them):

fac0 = movie_pca[0] 
movie_comp = [(f, movie_names[i]) for f,i in zip(fac0, topMovies)]
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]sorted(movie_comp, key=itemgetter(0), reverse=True)[:10][(0.06748189, 'Independence Day (a.k.a. ID4) (1996)'),
 (0.061572548, 'Police Academy 4: Citizens on Patrol (1987)'),
 (0.061050549, 'Waterworld (1995)'),
 (0.057877172, 'Rocky V (1990)'),
 ...
]sorted(movie_comp, key=itemgetter(0))[:10][(-0.078433245, 'Godfather: Part II, The (1974)'),
 (-0.072180331, 'Fargo (1996)'),
 (-0.071351372, 'Pulp Fiction (1994)'),
 (-0.068537779, 'Goodfellas (1990)'),
 ...
]

The second dimension “dialog driven vs. CGI”

fac1 = movie_pca[1]
movie_comp = [(f, movie_names[i]) for f,i in zip(fac1, topMovies)]
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10][(0.058975246, 'Bonfire of the Vanities (1990)'),
 (0.055992026, '2001: A Space Odyssey (1968)'),
 (0.054682467, 'Tank Girl (1995)'),
 (0.054429606, 'Purple Rose of Cairo, The (1985)'),
 ...]sorted(movie_comp, key=itemgetter(0))[:10][(-0.1064609, 'Lord of the Rings: The Return of the King, The (2003)'),
 (-0.090635143, 'Aladdin (1992)'),
 (-0.089208141, 'Star Wars: Episode V - The Empire Strikes Back (1980)'),
 (-0.088854566, 'Star Wars: Episode IV - A New Hope (1977)'),
 ...]

Plot

idxs = np.random.choice(len(topMovies), 50, replace=False)
X = fac0[idxs]
Y = fac1[idxs]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(topMovies[idxs], X, Y):
    plt.text(x,y,movie_names[i], color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

What actually happens when you say learn.fit ?

Entity Embeddings of Categorical Variables [24:42]

The second paper to talk about categorical embeddings. FIG. 1. caption should sound familiar as they talk about how entity embedding layers are equivalent to one-hot encoding followed by a matrix multiplication.

The interesting thing they did was, they took the entity embeddings trained by a neural network, replaced each categorical variable with the learned entity embeddings, then fed that into Gradient Boosting Machine (GBM), Random Forest (RF), and KNN — which reduced the error to something almost as good as neural network (NN). This is a great way to give the power of neural net within your organization without forcing others to learn deep learning because they can continue to use what they currently use and use the embeddings as input. GBM and RF train much faster than NN.

They also plotted the embeddings of states in Germany which interestingly (“whackingly enough” as Jeremy would call it) resembled an actual map.

They also plotted the distances of stores in physical space and embedding space — which showed a beautiful and clear correlation.

There also seems to be correlation between days of the week, or months of the year. Visualizing embeddings can be interesting as it shows you what you expected see or what you didn’t.

A question about Skip-Gram to generate embeddings [31:31]

Skip-Gram is specific to NLP. A good way to turn an unlabeled problem into a labeled problem is to “invent” labels. Word2Vec’s approach was to take a sentence of 11 words, delete the middle word, and replace it with a random word. Then they gave a label 1 to the original sentence; 0 to the fake one, and built a machine learning model to find the fake sentences. As a result, they now have embeddings they can use for other purposes. If you do this as a single matrix multiplier (shallow model) rather than deep neural net, you can train this very quickly — the disadvantage is that it is a less predictive model, but the advantages are that you can train on a very large dataset and more importantly, the resulting embeddings have linear characteristics which allow us to add, subtract, or draw nicely. In NLP, we should move past Word2Vec and Glove (i.e. linear based methods) because these embeddings are less predictive. The state of the art language model uses deep RNN.

To learn any kind of feature space, you either need labeled data or you need to invent a fake task [35:45]

Is one fake task better than another? Not well studied yet.
Intuitively, we want a task which helps a machine to learn the kinds of relationships that you care about.
In computer vision, a type of fake task people use is to apply unreal and unreasonable data augmentations.
If you can’t come up with great fake tasks, just use crappy one — it is often surprising how little you need.
Autoencoder [38:10] — it recently won an insurance claim competition. Take a single policy, run it through neural net, and have it reconstruct itself (make sure that intermediate layers have less activations than the input variable). Basically, it is a task whose input = output which works surprisingly well as a fake task.

In computer vision, you can train on cats and dogs and use it for CT scans. Maybe it might work for language/NLP! (future research)

Rossmann [41:04]

A way to use test set properly was added to the notebook.
For more detailed explanations, see Machine Learning course.
apply_cats(joined_test, joined) is used to make sure that the test set and the training set have the same categorical codes.
Keep track of mapper which contains the mean and standard deviation of each continuous column, and apply the same mapper to the test set.
Do not rely on Kaggle public board — rely on your own thoughtfully created validation set.

Going over a good Kernel for Rossmann

Sunday effect on sales

There is a jump on sales before and after the store closing. 3rd place winner deleted closed store rows before they started any analysis.

Don’t touch your data unless you, first of all, analyze to see what you are doing is okay — no assumptions.

Vim tricks [49:12]

:tag ColumnarModelData will take you to the class definition
ctrl + ] will take you to a definition of what’s under the cursor
ctrl + t to go back
* to find the usage of what’s under the cursor
You can switch between tabs with :tabn and :tabp, With :tabe <filepath> you can add a new tab; and with a regular :q or :wq you close a tab. If you map :tabn and :tabp to your F7/F8 keys you can easily switch between files.

Inside of ColumnarModelData [51:01]

Slowly but surely, what used to be just “magic” start to look familiar. As you can see, get_learner returns Learner which is fast.ai concept that wraps data and PyTorch model:

Inside of MixedInputModel you see how it is creating Embedding which we now know more about. nn.ModuleList is used to register a list of layers. We will talk about BatchNorm next week, but rest, we have seen before.

Similarly, we now understand what’s going on in the forward function.

call embedding layer with ith categorical variable and concatenate them all together
put that through dropout
go through each one of our linear layers, call it, apply relu and dropout
then final linear layer has a size of 1
if y_range is passed in, apply sigmoid and fit the output within a range (which we learned last week)

Stochastic Gradient Descent — SGD [59:56]

To make sure we are totally comfortable with SGD, we will use it to learn y = ax + b . If we can solve something with 2 parameters, we can use the same technique to solve 100 million parameters.

# Here we generate some fake data
def lin(a,b,x): return a*x+b

def gen_fake_data(n, a, b):
    x = s = np.random.uniform(0,1,n) 
    y = lin(a,b,x) + 0.1 * np.random.normal(0,3,n)
    return x, y

x, y = gen_fake_data(50, 3., 8.)

plt.scatter(x,y, s=8); plt.xlabel("x"); plt.ylabel("y");

To get started, we need a loss function. This is a regression problem since the output is continuous output, and the most common loss function is the mean squared error (MSE).

Regression — the target output is a real number or a whole vector of real numbers
Classification — the target output is a class label

def mse(y_hat, y): return ((y_hat - y) ** 2).mean()def mse_loss(a, b, x, y): return mse(lin(a,b,x), y)

y_hat — predictions

We will make 10,000 more fake data and turn them into PyTorch variables because Jeremy doesn’t like taking derivatives and PyTorch can do that for him:

x, y = gen_fake_data(10000, 3., 8.) 
x,y = V(x),V(y)

Then create random weight for a and b , they are the variables we want to learn, so set requires_grad=True .

a = V(np.random.randn(1), requires_grad=True) 
b = V(np.random.randn(1), requires_grad=True)

Then set the learning rate and do 10,000 epoch of full gradient descent (not SGD as each epoch will look at all of the data):

learning_rate = 1e-3
for t in range(10000):
    # Forward pass: compute predicted y using operations on Variables
    loss = mse_loss(a,b,x,y)
    if t % 1000 == 0: print(loss.data[0])
    
    # Computes the gradient of loss with respect to all Variables with requires_grad=True.
    # After this call a.grad and b.grad will be Variables holding the gradient
    # of the loss with respect to a and b respectively
    loss.backward()
    
    # Update a and b using gradient descent; a.data and b.data are Tensors,
    # a.grad and b.grad are Variables and a.grad.data and b.grad.data are Tensors
    a.data -= learning_rate * a.grad.data
    b.data -= learning_rate * b.grad.data
    
    # Zero the gradients
    a.grad.data.zero_()
    b.grad.data.zero_()

calculate the loss (remember, a and b are set to random initially)
from time to time (every 1000 epochs), print out the loss
loss.backward() will calculate gradients for all variables with requires_grad=True and fill in .grad property
update a to whatever it was minus LR * grad ( .data accesses a tensor inside of a variable)
when there are multiple loss functions or many output layers contributing to the gradient, PyTorch will add them together. So you need to tell when to set gradients back to zero (zero_() in the _ means that the variable is changed in-place).
The last 4 lines of code is what is wrapped in optim.SGD.step function

Let’s do this with just Numpy (without PyTorch) [1:07:01]

We actually have to do calculus, but everything else should look similar:

x, y = gen_fake_data(50, 3., 8.)a_guess,b_guess = -1., 1.
mse_loss(y, a_guess, b_guess, x)lr=0.01 
def upd():
     global a_guess, b_guess
     y_pred = lin(a_guess, b_guess, x)
     dydb = 2 * (y_pred - y)
     dyda = x*dydb
     a_guess -= lr*dyda.mean()
     b_guess -= lr*dydb.mean()

Just for fun, you can use matplotlib.animation.FuncAnimation to animate:

Tip: Fast.ai AMI did not come with ffmpeg . So if you see KeyError: 'ffmpeg'

Run print(animation.writers.list()) and print out a list of available MovieWriters
If ffmpeg is among it. Otherwise install it.

Recurrent Neural Network — RNN [1:09:16]

Let’s learn how to write philosophy like Nietzsche. This is similar to a language model we learned in lesson 4, but this time, we will do it one character at a time. RNN is no different from what we have already learned.

Some examples:

Basic NN with single hidden layer

All shapes are activations (an activation is a number that has been calculated by a relu, matrix product, etc.). An arrow is a layer operation (possibly more than one). Check out Machine Learning course lesson 9–11 for creating this from scratch.

Image CNN with single dense hidden layer

We will cover how to flatten a layer next week more, but the main method is called “adaptive max pooling” — where we average across the height and the width and turn it into a vector.

`batch_size` dimension and activation function (e.g. relu, softmax) are not shown here

Predicting char 3 using chars 1 & 2 [1:18:04]

We are going to implement this one for NLP.

Input can be one-hot-encoded character (length of vector = # of unique characters) or a single integer and pretend it is one-hot-encoded by using an embedding layer.
The difference from the CNN one is that then char 2 inputs gets added.

layer operations not shown; remember arrows represent layer operations

Let’s implement this without torchtext or fast.ai library so we can see.

set will return all unique characters.

text = open(f'{PATH}nietzsche.txt').read()
print(text[:400])'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself 'chars = sorted(list(set(text))) 
vocab_size = len(chars)+1 
print('total chars:', vocab_size)total chars: 85

Always good to put a null or an empty character for padding.

chars.insert(0, "\0")

Mapping of every character to a unique ID, and a unique ID to a character

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Now we can represent the text with its ID’s:

idx = [char_indices[c] for c in text]
idx[:10][40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

Question: Character based model vs. word based model [1:22:30]

Generally, you want to combine character level model and word level model (e.g. for translation).
Character level model is useful when a vocabulary contains unusual words — which word level model will just treat as “unknown”. When you see a word you have not seen before, you can use a character level model.
There is also something in between that is called Byte Pair Encoding (BPE) which looks at n-gram of characters.

Create inputs [1:23:48]

cs = 3 
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)]

Note that c1_dat[n+1] == c4_dat[n] since we are skipping by 3 (the third argument of range)

x1 = np.stack(c1_dat) 
x2 = np.stack(c2_dat) 
x3 = np.stack(c3_dat) 
y = np.stack(c4_dat)

x’s are our inputs, y is our target value.

Build a model [1:26:08]

n_hidden = 256 
n_fac = 42

n_hiddein — “# activations” in the diagram.
n_fac — the size of the embedding matrix.

Here is the updated version of the previous diagram. Notice that now arrows are colored. All the arrows with the same color will use the same weight matrix. The idea here is that a character would not have different meaning (semantically or conceptually) depending on whether it is the first, the second, or the third item in a sequence, so treat them the same.

class Char3Model(nn.Module):
     def __init__(self, vocab_size, n_fac):
         super().__init__()
         
         self.e = nn.Embedding(vocab_size, n_fac)
         
         self.l_in = nn.Linear(n_fac, n_hidden)
          
         self.l_hidden = nn.Linear(n_hidden, n_hidden)
         
         self.l_out = nn.Linear(n_hidden, vocab_size)              def forward(self, c1, c2, c3):
         in1 = F.relu(self.l_in(self.e(c1)))
         in2 = F.relu(self.l_in(self.e(c2)))
         in3 = F.relu(self.l_in(self.e(c3)))

         h = V(torch.zeros(in1.size()).cuda())
         h = F.tanh(self.l_hidden(h+in1))
         h = F.tanh(self.l_hidden(h+in2))
         h = F.tanh(self.l_hidden(h+in3))
         
         return F.log_softmax(self.l_out(h))

[1:29:58] It is important that this l_hidden uses a square weight matrix whose size matches the output of l_in. Then h and in2 will be the same shape allowing us to sum them together as you see in self.l_hidden(h+in2)
V(torch.zeros(in1.size()).cuda()) is only there to make the three lines identical to make it easier to put in a for loop later.

md = ColumnarModelData.from_arrays('.', [-1], np.stack([x1,x2,x3], axis=1), y, bs=512)

We will reuse ColumnarModelData[1:32:20]. If we stack x1 , x2, and x3, we will get c1, c2, c3 in the forward method. ColumnarModelData.from_arrays will come in handy when you want to train a model in raw-er approach, what you put in [x1, x2, x3] , you will get back in def forward(self, c1, c2, c3)

m = Char3Model(vocab_size, n_fac).cuda()

We create a standard PyTorch model (not Learner)
Because it is a standard PyTorch model, don’t forget .cuda

it = iter(md.trn_dl)
*xs,yt = next(it)
t = m(*V(xs)

iter to grab an iterator
next returns a mini-batch
“Variabize” the xs tensor, and put it through the model — which will give us 512x85 tensor containing prediction (batch size * unique character)

opt = optim.Adam(m.parameters(), 1e-2)

Create a standard PyTorch optimizer — for which you need to pass in a list of things to optimize, which is returned by m.parameters()

fit(m, md, 1, opt, F.nll_loss)
set_lrs(opt, 0.001)
fit(m, md, 1, opt, F.nll_loss)

We do not find a learning rate finder and SGDR because we are not using Learner, so we would need to manually do learning rate annealing (set LR a little bit lower)

Testing a model [1:35:58]

def get_next(inp):
     idxs = T(np.array([char_indices[c] for c in inp]))
     p = m(*VV(idxs))
     i = np.argmax(to_np(p))
     return chars[i]

This function takes three characters and return what the model predict as the fourth. Note: np.argmax returns index of the maximum values.

get_next('y. ')
'T'get_next('ppl')
'e'get_next(' th')
'e'get_next('and')
' '

Let’s create our first RNN [1:37:45]

We can simplify the previous diagram as below:

Let’s implement this. This time, we will use the first 8 characters to predict the 9th. Here is how we create inputs and output just like the last time:

cs = 8c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)]c_out_dat = [idx[j+cs] for j in range(len(idx)-cs)]xs = np.stack(c_in_dat, axis=0)y = np.stack(c_out_dat)xs[:cs,:cs]
array([[40, 42, 29, 30, 25, 27, 29,  1],
       [42, 29, 30, 25, 27, 29,  1,  1],
       [29, 30, 25, 27, 29,  1,  1,  1],
       [30, 25, 27, 29,  1,  1,  1, 43],
       [25, 27, 29,  1,  1,  1, 43, 45],
       [27, 29,  1,  1,  1, 43, 45, 40],
       [29,  1,  1,  1, 43, 45, 40, 40],
       [ 1,  1,  1, 43, 45, 40, 40, 39]])y[:cs]
array([ 1,  1, 43, 45, 40, 40, 39, 43])

Notice that they are overlaps (i.e. 0–7 to predict 8, 1–8 to predict 9).

val_idx = get_cv_idxs(len(idx)-cs-1)
md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

Create the model [1:43:03]

class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

Most of the code is the same as before. You will notice that there is one for loop in forward function.

Hyperbolic Tangent (Tanh) [1:43:43]
It is a sigmoid that is offset. It is common to use hyperbolic tanh in the hidden state to hidden state transition because it stops it from flying off too high or too low. For other purposes, relu is more common.

This now is a quite deep network as it uses 8 characters instead of 2. And as networks get deeper, they become harder to train.

m = CharLoopModel(vocab_size, n_fac).cuda() 
opt = optim.Adam(m.parameters(), 1e-2)
fit(m, md, 1, opt, F.nll_loss)
set_lrs(opt, 0.001)
fit(m, md, 1, opt, F.nll_loss)

Adding vs. Contatenating

We now will try something else for self.l_hidden(h+inp)[1:46:04]. The reason is that the input state and the hidden state are qualitatively different. Input is the encoding of a character, and h is an encoding of series of characters. So adding them together, we might lose information. Let’s concatenate them instead. Don’t forget to change the input to match the shape (n_fac+n_hidden instead of n_fac).

class CharLoopConcatModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac+n_hidden, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = torch.cat((h, self.e(c)), 1)
            inp = F.relu(self.l_in(inp))
            h = F.tanh(self.l_hidden(inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

This gives some improvement.

RNN with PyTorch [1:48:47]

PyTorch will write the for loop automatically for us and also the linear input layer.

class CharRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

For reasons that will become apparent later on, self.rnn will return not only the output but also the hidden state.
The minor difference in PyTorch is that self.rnn will append a new hidden state to a tensor instead of replacing (in other words, it will give back all ellipses in the diagram) . We only want the final one so we do outp[-1]

m = CharRnn(vocab_size, n_fac).cuda() 
opt = optim.Adam(m.parameters(), 1e-3)ht = V(torch.zeros(1, 512,n_hidden)) 
outp, hn = m.rnn(t, ht) 
outp.size(), hn.size()

(torch.Size([8, 512, 256]), torch.Size([1, 512, 256]))

In PyTorch version, a hidden state is rank 3 tensor h = V(torch.zeros(1, bs, n_hidden)) (in our version, it was rank 2 tensor) [1:51:58]. We will learn more about this later, but it turns out you can have a second RNN that goes backwards. The idea is that it is going to be better at finding relationships that go backwards — it is called “bi-directional RNN”. Also you can have an RNN feeds to an RNN which is called “multi layer RNN”. For these RNN’s, you will need the additional axis in the tensor to keep track of additional layers of hidden state. For now, we will just have 1 there, and get back 1.

Test the model

def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return resget_next_n('for thos', 40)
'for those the same the same the same the same th'

This time, we loop n times calling get_next each time, and each time we will replace our input by removing the first character and adding the character we just predicted.

For an interesting homework, try writing your own nn.RNN “JeremysRNN” without looking at PyTorch source code.

Multi-output [1:55:31]

From the last diagram, we can simplify even further by treating char 1 the same as char 2 to n-1. You notice the triangle (the output) also moved inside of the loop, in other words, we create a prediction after each character.

Predicting chars 2 to n using chars 1 to n-1

One of the reasons we may want to do this is the redundancies we had seen before:

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [42, 29, 30, 25, 27, 29,  1,  1],
       [29, 30, 25, 27, 29,  1,  1,  1],
       [30, 25, 27, 29,  1,  1,  1, 43],
       [25, 27, 29,  1,  1,  1, 43, 45],
       [27, 29,  1,  1,  1, 43, 45, 40],
       [29,  1,  1,  1, 43, 45, 40, 40],
       [ 1,  1,  1, 43, 45, 40, 40, 39]])

We can make it more efficient by taking non-overlapping sets of character this time. Because we are doing multi-output, for an input char 0 to 7, the output would be the predictions for char 1 to 8.

xs[:cs,:cs]array([[40, 42, 29, 30, 25, 27, 29,  1],
       [ 1,  1, 43, 45, 40, 40, 39, 43],
       [33, 38, 31,  2, 73, 61, 54, 73],
       [ 2, 44, 71, 74, 73, 61,  2, 62],
       [72,  2, 54,  2, 76, 68, 66, 54],
       [67,  9,  9, 76, 61, 54, 73,  2],
       [73, 61, 58, 67, 24,  2, 33, 72],
       [ 2, 73, 61, 58, 71, 58,  2, 67]])ys[:cs,:cs]
array([[42, 29, 30, 25, 27, 29,  1,  1],
       [ 1, 43, 45, 40, 40, 39, 43, 33],
       [38, 31,  2, 73, 61, 54, 73,  2],
       [44, 71, 74, 73, 61,  2, 62, 72],
       [ 2, 54,  2, 76, 68, 66, 54, 67],
       [ 9,  9, 76, 61, 54, 73,  2, 73],
       [61, 58, 67, 24,  2, 33, 72,  2],
       [73, 61, 58, 71, 58,  2, 67, 68]])

This will not make our model any more accurate, but we can train it more efficiently.

class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

Notice that we are no longer doing outp[-1] since we want to keep all of them. But everything else is identical. One complexity[2:00:37] is that we want to use the negative log-likelihood loss function as before, but it expects two rank 2 tensors (two mini-batches of vectors). But here, we have rank 3 tensor:

8 characters (time steps)
84 probabilities
for 512 minibatch

Let’s write a custom loss function [2:02:10]:

def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size()
    targ = targ.transpose(0,1).contiguous().view(-1)
    return F.nll_loss(inp.view(-1,nh), targ)

F.nll_loss is the PyTorch loss function.
Flatten our inputs and targets.
Transpose the first two axes because PyTorch expects 1. sequence length (how many time steps), 2. batch size, 3. hidden state itself. yt.size() is 512 by 8, whereas sl, bs is 8 by 512.
PyTorch does not generally actually shuffle the memory order when you do things like ‘transpose’, but instead it keeps some internal metadata to treat it as if it is transposed. When you transpose a matrix, PyTorch just updates the metadata . If you ever see an error that says “this tensor is not continuous” , add .contiguous() after it and error goes away.
.view is same as np.reshape. -1 indicates as long as it needs to be.

fit(m, md, 4, opt, null_loss_seq)

Remember that fit(...) is the lowest level fast.ai abstraction that implements the training loop. So all the arguments are standard PyTorch things except for md which is our model data object which wraps up the test set, the training set, and the validation set.

Question [2:06:04]: Now that we put a triangle inside of the loop, do we need a bigger sequence size?

If we have a short sequence like 8, the first character has nothing to go on. It starts with an empty hidden state of zeros.
We will learn how to avoid that problem next week.
The basic idea is “why should we reset the hidden state to zeros every time?” (see code below). If we can line up these mini-batches somehow so that the next mini-batch joins up correctly representingthe next letter in Nietsche’s works, then we can move h = V(torch.zeros(1, bs, n_hidden)) to the constructor.

class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

Gradient Explosion [2:08:21]

self.rnn(inp, h) is a loop applying the same matrix multiply again and again. If that matrix multiply tends to increase the activations each time, we are effectively doing that to the power of 8 — we call this a gradient explosion. We want to make sure the initial l_hidden will not cause our activations on average to increase or decrease.

A nice matrix that does exactly that is called identity matrix:

We can overwrite the randomly initialized hidden-hidden weight with an identity matrix:

m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))

This was introduced by Geoffrey Hinton et. al. in 2015 (A Simple Way to Initialize Recurrent Networks of Rectified Linear Units) — after RNN has been around for decades. It works very well, and you can use higher learning rate since it is well behaved.