[Notes] Neural Language Models with PyTorch

With Notebook Examples Runnable on Google Colab

Ceshine Lee
Oct 13, 2018 · 8 min read


I was reading this paper titled “Character-Level Language Modeling with Deeper Self-Attention” by Al-Rfou et al., which describes some ways to use Transformer self-attention models to solve the language modeling problem. One big problem of Transformer models in this setting is that they cannot pass information from one batch to the next, so they have to make predictions based on limited contexts.

Theoretical Background

We’re not going to cover this in this post. But here are some resources for you if you’re interested:

  1. The Wikipedia page on Language Model
  2. Gentle Introduction to Statistical Language Modeling and Neural Language Models
  3. Language Model: A Survey of the State-of-the-Art Technology

Source Code

I forked the pytorch/examples Github repo, made some tiny changes, and added two notebooks. Here’s the link:

Dataset Preparation

This example comes with a copy of wikitext2 dataset. The texts have already been tokenized to word level, and split into train, validation, test sets.


The resulting 4 x 9 tensor


This example uses very basic GRU/LSTM/RNNmodels you can learn from any decent tutorials on the topic of recurrent neural networks. We’re not going to cover it in detail. You can read the source code here:

Training: Iterating Through Batches

The first batch: input tensor (4 x 3)
The first batch: target tensor (4 x 3)
Yellow — The second batch: input tensor (4 x 3). Green — inputs that have been seen by the model.
The second batch: target tensor (4 x 3)
Yellow — The third batch: input tensor (4 x 2). Green — inputs that have been seen by the model.
The third batch: target tensor (4 x 2)


Here’s the part most confused me. The batch size of the evaluation phase clearly affects the evaluation result. Here’s an example to demonstrate this. Suppose the evaluation set consists of 5 series with the same length and the same values [1, 2, …, 8]. One of the better scenarios is using a batch size of 5:

Batchify scheme as a 5 x 8 tensor
A less ideal batchify scheme as a 6 x 6 tensor

Training and Evaluating Language Models on Google Colaboratory

Thanks to Google Colab, we can run the entire training and evaluation process for free on the Internet. The training time for the small model in the example is about 4 hours.

from google.colab import files


Thanks for reading! Please feel free to leave comments and give this post some claps if you find it useful.


Towards human-centered AI. https://veritable.pw

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store