Purpose
I’m sure you have come across ChatGPT from OpenAI recently for holding human-like conversations using both words and code. Under the hood, a large component of ChatGPT is based on language modelling — the ability to predict the next character/subword/word/sentence based on previous ones.
This article is a summary of Andrej Karpathy’s Youtube tutorial video on introduction to language modelling.
The spelled-out intro to language modelling: building makemore
I highly recommend watching the video yourself. Throughout the video, Andrej encourages the audience to undergo self-learning with external resources such Wikipedia for concept definitions, PyTorch website for code documentations and Stack Overflow forum for debugging errors.
PS: At the time of writing this article, Andrej has just released a more advanced tutorial for building GPT (Generative Pretrained Transformers which evolved into GPT3 in its third iteration and largely powers ChatGPT) here. This can be viewed as an end goal of this series of tutorials on language modelling (bearing in mind GPT architecture has now been widely used in all modalities not just for language but also vision and speech).
Series Overview
The topic of this series of videos is on how one can generate more of the same type of text using different approaches — hence the name ‘makemore’. I also think of this as sequence/text generation (or Natural Language Generation) models that are trained based on observed (or training) data. The roadmap for this series of videos on language modelling to go through are of the following order (referenced in this github page):
- Bigram (either look-up table based or neural network based approach using counts of consecutive characters; focus of this tutorial)
- Multi-layer Perceptron (MLP)
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
- Long Short Term Memory (LSTM)
- Gated Recurrent Units (GRU)
- Transformer
On other dimensions, Andrej mentioned this series for granularity would start off with character-level language models then moving towards word-level whereas in terms of modality we eventually shift away from pure text to text-to-image models such as DALL-E and Stable Diffusion.
Video Overview
This first video of the series explains the basics of generating sequence models in terms of either through look-up table counts-based or neural network based Bigram approach. Here, bigrams are pairs of characters that follow one another —so we would be considering the context of one character and using that to predict the next. The tutorial can be broken down into the following areas:
- Read in and explore the dataset based on an input text file
- Preprocess data to prepare for models in the upcoming steps
- Modelling approach 1: Count-based model
- Modelling approach 2: Simple neural network model
1. Reading and Exploring Data
The dataset we’re looking into is a text file with lowercase names of people in each line of the text file, starting with ‘emma’, ‘olivia’ and ‘ava’. This is a dataset containing some of the most common names. The link for the dataset from the github repository is here. There are a total of 32K names with the shortest and longest names consisting of 2 and 15 letters respectively.
Code:
Here the code for reading in the names.txt
input file and storing the names as a list stored in the variable words
is outlined below.
words = open('names.txt', 'r').read().splitlines()
2. Preprocessing Data
Counts dictionary
Steps:
Store in a dictionary with the keys being the tuples of all consecutive character pairs by repeating the process over all names in the data and the values being the counts of these tuples of consecutive character pairs. Start and stop character tokens are added with ‘.’ the dot symbol.
The entire process of iterating with the outer loop over all names (each name represented by w
) in the list words
is outlined below where in the inner loop each consecutive pairs of characters ch1
and ch2
are iterated over.
Code:
N = {}
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
bigram = (ch1, ch2)
N[bigram] = N.get(bigram, 0) + 1
Tokenisation
Tokenisation is the process of splitting text data into meaningful chunks with the appropriate level of granularity so they can be passed to downstream tasks/models. Common tokenisation levels include word, subword (one popular method called Byte Pair Encoding or BPE) or character. The level specified in tokenisation would affect how the input and output looks like. Here we have used character level tokenisation as the task at hand is for next character prediction. After tokenising to the appropriate levels we then convert the tokens into numerical representations for ease of downstream processes — indexing of the lookup table in the case of count-based model and additional encoding for the neural network based model.
Steps:
- Create a set of unique characters (called tokens) — 26 letters of the alphabet (lowercase only) plus the dot ‘.’ character (a special token signifying the start and end of a name) for a total of 27 tokens in the vocabulary.
- Sort this list of tokens in alphabetical order.
- Create a dictionary of the string / character to index mapping (key being the string and value being the index) with the special token character ‘.’ being indexed at 0 followed by ‘a’ at index 1 and so on until the last index ‘z’ at 26. This dictionary is stored in variable stoi.
- stoi = {‘.’ : 0, ‘a’ : 1, …, ‘z’ : 26} - Reverse the index mapping of the dictionary stoi above to another dictionary called itos.
- itos = {0 : ‘.’, 1 : ‘a’, …, 26 : ‘z’}
Code:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
3. Count-based Bigram Model
The end product of this approach would ideally be a look-up table with the indices being the previous character and the values being the next character with the maximum probability based on past observations. In this step we would start to use PyTorch package. Note how we are repeating many bits of section 2 (preprocessing) as we are moving away from base Python dictionary toward PyTorch tensors for storing the counts.
Steps:
- Initialise a matrix
N
of zeros with dimension of 27 by 27. - Convert the dictionary counts into PyTorch tensor (matrix) with indices being mapped from the string-to-index mapping dictionary
stoi
. - Add one to each entry of this matrix so as to avoid division by zero error and also to give arbitrarily small probability to sequences that never occurred in the training data. This is called Laplace Smoothing.
- Normalise each row of this matrix
N
so we get the transition probability for each character being followed by another character. - During inference, take your context character, use
stoi
to get the numeric representation, do a lookup in the table to get the corresponding row then take the argmax of the row to get the number then token (throughitos
) which is the character with the maximum probability.
Note: the matrix N would actually not have the tokens ‘.’, ‘a’, …, etc as the indices but numerical representations of them.
Code:
N = torch.zeros((27, 27), dtype=torch.int32)
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1
P = (N+1).float()
P /= P.sum(1, keepdims=True)
4. Neural Network Model
The neural network model in this section captures the relationship between consecutive characters through the use of embeddings which are vectors that encodes a continuous range of information for the task at hand. Here we start with further preprocessing in order to massage the data into the right format to be consumed by the neural network model.
Steps:
- Create a training set of Bigram tuples and store as input and output using the mapped numerical representations of tokens.
- Encode input vector (X) with one-hot-encoding i.e. each character token will be a vector of length 27 (the vocabulary size) and have ith position having a value of 1 and every other position having 0s, where i represents the corresponding index from dictionary
stoi
in section 2 (preprocessing data). This is done through PyTorch functiontorch.nn.functional.one_hot
. - Multiply this input vector (X) by a randomly initialised weight matrix (W) to get back only the ith row of this weight matrix. This vector is the embedding vector of the character.
- Perform Softmax function on the embedding vector to get the predicted scores for each character, with a sum of 1.
- During inference, take the argmax of the vector of scores to get back the index of the vector with the maximum score (then we can convert this index back to character using
itos
dictionary from section 2). - During training, calculate the error based on the scores and back-propagate the error through the network until the Weight Matrix (W) (more on that in the next part below).
Code:
# create the dataset (step 1)
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)
# initialize the 'network' (part of step 3)
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
# forward pass - single run
xenc = F.one_hot(xs, num_classes=27).float() # one-hot encode input (step 2)
logits = xenc @ W # predict log-counts (step 3)
counts = logits.exp() # counts, equivalent to N from count-based approach (step 4)
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character (step 4)
- Note: the 2 lines of code going starting with
counts
andprobs
is equivalent to applying the softmax function in step 4 described above.
In the very last part above, how we do things slightly differently depending on whether we’re performing inference (step 5) or training (step 6) after the forward pass of the neural network. For inference we just need to return the character with the maximum probability. For training we need to calculate the error with respect to the ground truth label and perform back-propagation.
Training
I won’t go in detail on back-propagation and backward gradient flow process here but if you like to learn more on that topic there’s another great tutorial video by Andrej and the linked github page. The steps below would start with the end point (step 4) of the last section where we have the predicted scores for each character.
Steps:
- Encode the actual label (Y, ground truth) through one-hot encoding.
- Calculate cross-entropy loss by comparing the score (predicted probability) corresponding to the index of the actual label. You also hear Negative Log Likelihood being used in the video (see article here for further details on their differences)
- Perform a backward pass to calculate the gradient of each element of the parameters (just the weight matrix W in this simple case) for this training example. Let’s call the gradient dL / dW.
- Update the values of the weights in matrix W using gradient descent algorithm (where alpha is the learning rate):
W_new = W_old — (alpha * dL / dW)
Code:
# forward pass - single run (continued from the code right before this)
# calculate loss
loss = -probs[torch.arange(num), ys].log().mean()
print(loss.item())
# backward pass - single run
W.grad = None # set to zero the gradient
loss.backward()
# update weights using learning rate of 0.1
W.data += -0.1 * W.grad
What’s Not Covered in This Article
The tutorial video also goes through many PyTorch API usages which is not covered in this article but very practical. Again I highly recommend you to watch the video yourself for these.
Links
The link to Andrej’s github page for ‘makemore’ is here and to the Jupyter Notebook of this part one of tutorial is here.