Introduction to Language Modelling — Part 1

Hans Han
9 min readJan 29, 2023

--

Photo by lilartsy on Unsplash

Purpose

I’m sure you have come across ChatGPT from OpenAI recently for holding human-like conversations using both words and code. Under the hood, a large component of ChatGPT is based on language modelling — the ability to predict the next character/subword/word/sentence based on previous ones.

This article is a summary of Andrej Karpathy’s Youtube tutorial video on introduction to language modelling.

The spelled-out intro to language modelling: building makemore

I highly recommend watching the video yourself. Throughout the video, Andrej encourages the audience to undergo self-learning with external resources such Wikipedia for concept definitions, PyTorch website for code documentations and Stack Overflow forum for debugging errors.

PS: At the time of writing this article, Andrej has just released a more advanced tutorial for building GPT (Generative Pretrained Transformers which evolved into GPT3 in its third iteration and largely powers ChatGPT) here. This can be viewed as an end goal of this series of tutorials on language modelling (bearing in mind GPT architecture has now been widely used in all modalities not just for language but also vision and speech).

Series Overview

The topic of this series of videos is on how one can generate more of the same type of text using different approaches — hence the name ‘makemore’. I also think of this as sequence/text generation (or Natural Language Generation) models that are trained based on observed (or training) data. The roadmap for this series of videos on language modelling to go through are of the following order (referenced in this github page):

  1. Bigram (either look-up table based or neural network based approach using counts of consecutive characters; focus of this tutorial)
  2. Multi-layer Perceptron (MLP)
  3. Convolutional Neural Network (CNN)
  4. Recurrent Neural Network (RNN)
  5. Long Short Term Memory (LSTM)
  6. Gated Recurrent Units (GRU)
  7. Transformer

On other dimensions, Andrej mentioned this series for granularity would start off with character-level language models then moving towards word-level whereas in terms of modality we eventually shift away from pure text to text-to-image models such as DALL-E and Stable Diffusion.

Video Overview

This first video of the series explains the basics of generating sequence models in terms of either through look-up table counts-based or neural network based Bigram approach. Here, bigrams are pairs of characters that follow one another —so we would be considering the context of one character and using that to predict the next. The tutorial can be broken down into the following areas:

  1. Read in and explore the dataset based on an input text file
  2. Preprocess data to prepare for models in the upcoming steps
  3. Modelling approach 1: Count-based model
  4. Modelling approach 2: Simple neural network model

1. Reading and Exploring Data

The dataset we’re looking into is a text file with lowercase names of people in each line of the text file, starting with ‘emma’, ‘olivia’ and ‘ava’. This is a dataset containing some of the most common names. The link for the dataset from the github repository is here. There are a total of 32K names with the shortest and longest names consisting of 2 and 15 letters respectively.

Code:

Here the code for reading in the names.txt input file and storing the names as a list stored in the variable words is outlined below.

words = open('names.txt', 'r').read().splitlines()

2. Preprocessing Data

Counts dictionary

Steps:

Store in a dictionary with the keys being the tuples of all consecutive character pairs by repeating the process over all names in the data and the values being the counts of these tuples of consecutive character pairs. Start and stop character tokens are added with ‘.’ the dot symbol.

Example illustrating the iteration over the name ‘emma’ and storing the tuples of characters into a dictionary N

The entire process of iterating with the outer loop over all names (each name represented by w) in the list wordsis outlined below where in the inner loop each consecutive pairs of characters ch1 and ch2 are iterated over.

Code:

N = {}

for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
bigram = (ch1, ch2)
N[bigram] = N.get(bigram, 0) + 1

Tokenisation

Tokenisation is the process of splitting text data into meaningful chunks with the appropriate level of granularity so they can be passed to downstream tasks/models. Common tokenisation levels include word, subword (one popular method called Byte Pair Encoding or BPE) or character. The level specified in tokenisation would affect how the input and output looks like. Here we have used character level tokenisation as the task at hand is for next character prediction. After tokenising to the appropriate levels we then convert the tokens into numerical representations for ease of downstream processes — indexing of the lookup table in the case of count-based model and additional encoding for the neural network based model.

Steps:

  1. Create a set of unique characters (called tokens) — 26 letters of the alphabet (lowercase only) plus the dot ‘.’ character (a special token signifying the start and end of a name) for a total of 27 tokens in the vocabulary.
  2. Sort this list of tokens in alphabetical order.
  3. Create a dictionary of the string / character to index mapping (key being the string and value being the index) with the special token character ‘.’ being indexed at 0 followed by ‘a’ at index 1 and so on until the last index ‘z’ at 26. This dictionary is stored in variable stoi.
    - stoi = {‘.’ : 0, ‘a’ : 1, …, ‘z’ : 26}
  4. Reverse the index mapping of the dictionary stoi above to another dictionary called itos.
    - itos = {0 : ‘.’, 1 : ‘a’, …, 26 : ‘z’}

Code:

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

3. Count-based Bigram Model

The end product of this approach would ideally be a look-up table with the indices being the previous character and the values being the next character with the maximum probability based on past observations. In this step we would start to use PyTorch package. Note how we are repeating many bits of section 2 (preprocessing) as we are moving away from base Python dictionary toward PyTorch tensors for storing the counts.

Steps:

  1. Initialise a matrix N of zeros with dimension of 27 by 27.
  2. Convert the dictionary counts into PyTorch tensor (matrix) with indices being mapped from the string-to-index mapping dictionary stoi.
  3. Add one to each entry of this matrix so as to avoid division by zero error and also to give arbitrarily small probability to sequences that never occurred in the training data. This is called Laplace Smoothing.
  4. Normalise each row of this matrix N so we get the transition probability for each character being followed by another character.
  5. During inference, take your context character, use stoi to get the numeric representation, do a lookup in the table to get the corresponding row then take the argmax of the row to get the number then token (through itos) which is the character with the maximum probability.

Note: the matrix N would actually not have the tokens ‘.’, ‘a’, …, etc as the indices but numerical representations of them.

Example (Step 4) illustrating how we get the lookup table of normalised counts (right) for the Bigram Model from the raw counts of occurrences (left). Here the character ‘y’ follows ‘a’ with 20% probability (highlighted).

Code:

N = torch.zeros((27, 27), dtype=torch.int32)

for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1

P = (N+1).float()
P /= P.sum(1, keepdims=True)

4. Neural Network Model

The neural network model in this section captures the relationship between consecutive characters through the use of embeddings which are vectors that encodes a continuous range of information for the task at hand. Here we start with further preprocessing in order to massage the data into the right format to be consumed by the neural network model.

Steps:

  1. Create a training set of Bigram tuples and store as input and output using the mapped numerical representations of tokens.
  2. Encode input vector (X) with one-hot-encoding i.e. each character token will be a vector of length 27 (the vocabulary size) and have ith position having a value of 1 and every other position having 0s, where i represents the corresponding index from dictionary stoi in section 2 (preprocessing data). This is done through PyTorch function torch.nn.functional.one_hot .
  3. Multiply this input vector (X) by a randomly initialised weight matrix (W) to get back only the ith row of this weight matrix. This vector is the embedding vector of the character.
  4. Perform Softmax function on the embedding vector to get the predicted scores for each character, with a sum of 1.
  5. During inference, take the argmax of the vector of scores to get back the index of the vector with the maximum score (then we can convert this index back to character using itos dictionary from section 2).
  6. During training, calculate the error based on the scores and back-propagate the error through the network until the Weight Matrix (W) (more on that in the next part below).
Example illustration of inference step: an input character ‘e’ goes through the neural network ending with a prediction ‘d’ as the predicted next character. Important tensor element(s) to the input are boxed red.

Code:

# create the dataset (step 1)
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

# initialize the 'network' (part of step 3)
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)

# forward pass - single run
xenc = F.one_hot(xs, num_classes=27).float() # one-hot encode input (step 2)
logits = xenc @ W # predict log-counts (step 3)
counts = logits.exp() # counts, equivalent to N from count-based approach (step 4)
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character (step 4)
  • Note: the 2 lines of code going starting with counts and probs is equivalent to applying the softmax function in step 4 described above.

In the very last part above, how we do things slightly differently depending on whether we’re performing inference (step 5) or training (step 6) after the forward pass of the neural network. For inference we just need to return the character with the maximum probability. For training we need to calculate the error with respect to the ground truth label and perform back-propagation.

Training

I won’t go in detail on back-propagation and backward gradient flow process here but if you like to learn more on that topic there’s another great tutorial video by Andrej and the linked github page. The steps below would start with the end point (step 4) of the last section where we have the predicted scores for each character.

Steps:

  1. Encode the actual label (Y, ground truth) through one-hot encoding.
  2. Calculate cross-entropy loss by comparing the score (predicted probability) corresponding to the index of the actual label. You also hear Negative Log Likelihood being used in the video (see article here for further details on their differences)
  3. Perform a backward pass to calculate the gradient of each element of the parameters (just the weight matrix W in this simple case) for this training example. Let’s call the gradient dL / dW.
  4. Update the values of the weights in matrix W using gradient descent algorithm (where alpha is the learning rate):
    W_new = W_old — (alpha * dL / dW)
Example illustrating the backward pass of the training process starting with the actual label ‘t’ (order is highlighted in red numbers and arrows)

Code:

# forward pass - single run (continued from the code right before this)
# calculate loss
loss = -probs[torch.arange(num), ys].log().mean()
print(loss.item())

# backward pass - single run
W.grad = None # set to zero the gradient
loss.backward()

# update weights using learning rate of 0.1
W.data += -0.1 * W.grad

What’s Not Covered in This Article

The tutorial video also goes through many PyTorch API usages which is not covered in this article but very practical. Again I highly recommend you to watch the video yourself for these.

Links

The link to Andrej’s github page for ‘makemore’ is here and to the Jupyter Notebook of this part one of tutorial is here.

--

--