Deep Dive into AI: Building a Bigram Language Model and Practicing Patience!

15 min readJan 14, 2024

Welcome or welcome back to the Deep Dive into AI tutorial series where I go deep into the fundamentals of neural networks (specifically large language models) through the bottom-up approach!

Mainly in this series, I will be referencing a lot from Andrej Karpathy’s series on Neural Networks: Zero to Hero because I have found that it is one of the most practical and in-depth series on building neural networks. And as always, I’ll be breaking everything I learn into first principles through this series and sharing the useful life lessons along the way I have learned from programming! I shared the mindsets I used in my first ever tutorial and it got a lot of positive feedback! So I’ll be sharing that more from now on:)

In this tutorial, we’re starting to build makemore:

Building a Bigram Language Model

Makemore (it just makes more of the input you feed it. In our case, it will make more names) is a character-level language model that treats each character as a token and uses it to predict the next character. It deals with only 2 characters at a time, one which is passed and the other which it has to predict. It cannot take all the previous characters for prediction so it has a limited range of context.

Each name tells us which characters are more likely to come first, which characters are likely to be paired up, and which characters are likely to come last.

Data Processing

words = open('names.txt', 'r').read().splitlines()
words[:10]

for w in words [:1]:
  chs = ['<S>'] + list(w) + ['<E>']
  for ch1, ch2 in zip(w, w[1:]):
    print(ch1,ch2)

We want to pair the adjacent characters together to define the relationship they have with each other. One way to do this in Python is through the zip() function. In this code snippet, I have only passed the first word. Zip() allows us to pair tuples until one of them runs out of data, that’s why in this case, I have paired the word with a tuple of the same word but with the first character left out so that we get pairs of adjacent characters.

We’ll add a special start and end token to each word as well, so we explicitly make the relation of which characters are likely to come first and the ones that are likely to come last.

Now, we also want to find out which pairs are more likely to occur. In the Bigram model, the simplest way to do that is by counting through a dictionary.

b = {}
for w in words:
  chs = ['<S>'] + list(w) + ['<E>']
  for ch1, ch2 in zip(chs, chs[1:]):
    bigram = (ch1,ch2)
    b[bigram] = b.get(bigram, 0) + 1
    print(ch1,ch2)

b.get(bigram, 0) is a dictionary method that tries to retrieve the value associated with the key bigram in the dictionary b. If the key (bigram) is not already in the dictionary, it returns 0 as the default value.

b.get(bigram, 0) + 1 increments the value associated with the bigram key by 1. If the bigram key doesn't exist in the dictionary, it initializes it with a value of 1.

Let’s see which pairs are the most common.

sorted(b.items(), key = lambda kv: -kv[1])

We call the b.items() function which returns key and value pairs and sorts them in a descending order. -kv[1] returns the value of each key.

In this case, the key is the bigram pairs and values are their frequency.

Building the Model!

Instead of dictionaries, let’s store it in a PyTorch tensor of dimension 28 by 28.

import torch

N = torch.zeros((28,28), dtype = torch.int32)

Now, we can take the unique characters in the dataset and map them to numbers to create an index table stoi and we can also reverse this dictionary as itos . We can look up the indices of the character in the stoi table and locate it in the tensor to update it.

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['<S>'] = 0
stoi['<E>'] = 28
for w in words:
  chs = ['<S>'] + list(w) + ['<E>']
  for ch1, ch2 in zip(chs, chs[1:]):
    bigram = (ch1,ch2)
    b[bigram] = b.get(bigram, 0) + 1
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1, ix2] += 1
itos = {i:s for s,i in stoi.items()}

But adding <S> and <E> characters increases the number of characters we have and leads to additional pairs being added that will always be zero such as <E>(any other character). So we can replace both characters by .

chs = ['.'] + list(w) + ['.']

We can set the stoi['.'] = 0because we offset all the alphabets by 1 in the stoi mapping and change the tensor dimensions to 27 by 27.

Let’s visualize this:

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize = (16,16))
plt.imshow(N, cmap = 'Blues')
for i in range(27):
  for j in range(27):
    chstr = itos[i] + itos[j]
    plt.text(j, i, chstr, ha = 'center', va = 'bottom', color = 'gray')
    plt.text(j, i, N[i,j].item(), va = 'top', color = 'gray')
plt.axis('off');

Training

Right now, we have the frequencies of the bigram pairs and we need to normalize it to define it as probabilities.

p = N[0].float()
p = p/p.sum()
p

For efficiency, we can convert all the rows to the probability distribution.

P = N.float()
P /= P.sum(1, keepdim = True)

If keepdim is True, the output tensor is of the same size as input except in the dimension(s) dim where it is of size 1. Otherwise, dim is squeezed (see torch.squeeze()), resulting in the output tensor having 1 (or len(dim)) fewer dimension(s).

The shape of P is [27, 1] because we summed across the 1st dimension which is horizontal, giving the sum of probabilities across each row.

But can this be divided by a tensor of dimension [27, 27]? This is where broadcasting comes into play.

Two tensors are “broadcastable” if the following rules hold:

Each tensor has at least one dimension.
When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

In this case, the 1st trailing dimension is not equal as 27 != 1 but one of them is equal to 1 so broadcasting is possible. What happens in broadcasting?

In this case, the value in dimension 1 is going to be stretched to match the other tensor. As the sum is normalized, its total sum would be 1.

We’re going to be using torch.multinomial which returns the set of integers based on their probabilities. So if integer 2 in a tensor of probability distribution across a range of integers has a 60% probability, torch.multinomial will return a set of integers in which 2 appears 60% of the time.

The other function we will use is the torch.Generator() which returns in-place random numbers. I will explain how they all work together in a while.

g = torch.Generator().manual_seed(2147483647)
ix = torch.multinomial(p, num_samples = 1, replacement = True, generator = g).item()
itos[ix]

When we set the Generator() to a manual seed, we make sure it returns the same set of random numbers no matter who runs it. The multinomial accepts the p tensor which is a normalized tensor of probabilities of which character will come first (because it was the first row of our tensor which mapped . to all the characters and calculated their occurring frequency) and it builds a multinomial distribution in which integers (which are equal to the index of the p tensor) occur according to their probabilities. num_samples = 1 which means it will generate one sample from the distribution, replacement = True ensures that samples are drawn with replacement and the generator object determines how we’re randomly sampling from the probability distribution.

In our case, the result of this sampling is m.

Let’s put this together to generate our first name.

g = torch.Generator().manual_seed(2147483647)

for i in range(10):
  out = []
  ix = 0
  while True:
    p = P[ix]
    ix = torch.multinomial(p, num_samples = 1, replacement = True, generator = g).item()
    out.append(itos[ix])
    if ix == 0:
      break
    print(''.join(out))

The starting index is always 0 which denotes . (our starting character). We always feed in the 0th row because it contains all the starting characters. If we get ix as 0 in our generation, it also denotes our ending character so we can break out of the loop.

The reason why they are quite bad is because the model is untrained.

Loss function

We need to measure the performance of the model so that we can train it through backpropagation. We can do this through Maximum Likelihood Estimation. Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the process described by the model produced the data that were actually observed (source).

We can calculate the likelihood of something happening by multiplying the probabilities, which in this case would be the predictions or the outputs for each bigram of the model. But, as the probabilities we calculated are between 0 and 1, multiplying them would yield a very small number which is inconvenient. So we can calculate the log of it.

The log graph (on the right) is monotonic because as x increases, y also increases. If you increase the x, you get closer and closer to 0, and increasing the y gets us closer to negative infinity. The log function essentially squishes the very large and small values, making it convenient to work with as the small values would be negative, and large values would be closer to 0. The log of likelihood would be the sum of the log of all the individual probabilities. So when all the probabilities would be 1, their log would be 0, and hence the likelihood would be 0. And, when probabilities would be less than zero, their log would be negative, resulting in a more negative number, but when we use our loss function, we are trying to minimize it and in this case, we would be trying to maximize the likelihood. So to make sure both of them are in the same direction, we can take the negative log of likelihood and we can normalize it by taking the average.

One problem that we do have to address in this is that for values that have 0 probability, the log of that would be infinity. So to overcome that, we can add 1 when we calculate probabilities so that there are no 0 probabilities and we don't get any infinite values. P = (N+1).float() This is called model smoothing!

log_likelihood = 0.0
n = 0
for w in words[:3]:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    prob = P[ix1, ix2]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1
    print(f'{ch1}{ch2}: {prob:4f} {logprob: .4f}')

nll = -log_likelihood
print(f'{nll=}')

Our loss function is 2.4241… and we want to minimize it to maximize the likelihood of producing comprehensible names.

Let’s convert our bigram language model into a neural network!

Creating a training set of bigrams

import torch.nn.functional as F

xs, ys = [], []

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    xs.append(ix1)
    ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys) 
num = xs.nelement() 

xenc = F.one_hot(xs, num_classes=27).float()
yenc = F.one_hot(ys, num_classes= 27).float()

Here, xs is the training set and would contain all the characters in our names and ys contains the targets, so it would contain the characters starting from the 1st index because, for the 0th character, its index would be the 1st character.

But we don’t want to input integers because in a neuron we are going to be multiplying weights with the inputs and it doesn’t make sense to assign some characters a larger value because that will weigh in the final probabilities. The values that we assign to the inputs are a way of categorizing the data rather than assigning them a low or high value, so we use a method called one-hot encoding in which we create an array of 0s and 1s where each index is a category. 1 is assigned to the index where the value for that category is true and the rest is zero. For more insights into one-hot encoding and embedding tables, feel free to read this article where I talk about this in-depth (yes, I just plugged myself in my own article):

Getting Started with AI: Building a movie recommendation model from scratch!

Collaborative filtering, explaining OOP concepts, breaking down one-hot encoding and embedding vectors, and so much…

medium.com

We can convert input tensors into one-hot encoding vectors very easily through the torch.nn.functional.one_hot function. Because this function doesn’t take the dtype argument, we need to explicitly convert it to float. num_classes = 27 because we have 27 categories to evaluate. (If you’re ever confused throughout the tutorial about what the values actually look like, go ahead and print them to gain visibility into what we’re working with.) Each of the characters will have a tensor of 27 columns.

Now, we’re going to be implementing a very simple linear neural network with just one layer of neurons. It will have 27 inputs to account for all the characters and then have 27 neurons in its first and only layer.

g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator = g, requires_grad = True)
xenc @ W

torch.randn returns a tensor filled with random numbers from a normal distribution with mean 0 and variance 1 (also called the standard normal distribution).

If we have a list of 10 inputs we’re feeding in, xenc would be [10, 27] and we multiply it with [27, 27] which would give us the result of dimensions [10, 27] which would tell us that for each of the inputs, how much each neuron was activated.

When we are doing a matrix multiplication of one-hot encoded vectors with the weight tensor, we are essentially plucking out the same row in the weight tensor the index for which was 1 in the one-hot encoded tensor. This row would tell us which character is most likely to come next because we calculate probabalities of bigrams.

But how can we interpret these as probabilities?

These weights also have negative numbers as they’re randomly assigned across a standard normal distribution of mean 0. To interpret these as counts, we can convert them into their exponential form which raises it to be e^x, in which all negative numbers are less than 0 and the positive numbers are increased by a factor. We calculate these as counts for each of our weights and then take the sum across each row and normalize it by dividing the sum to get activations of each neuron.

logits = xenc @ W #log-counts
counts = logits.exp()
probs = counts / counts.sum(1, keepdims = True)

The last 2 lines are called the softmax activation function which takes in the outputs of a neural network, raises it to their exponent, and normalizes it by dividing it by their sum which ensures that the sum of the probabilities of each neuron is 1. All of the functions we have used are differentiable, so we will be able to calculate the gradients of weights based on them and backpropagate through the entire neural network.

Training the neural network

As we have assigned the weights randomly, the loss function would be high! Passing the inputs and calculating the predictions comprised the forward pass, and now we need to do a backward pass and we will do it exactly similar to how we did it in micrograd. If you want an article, walking through its implementation, check this out:

Deep dive into AI: Understanding backpropagation and neural networks on a fundamental level

This tutorial will touch on all the fundamentals of neural networks, with an emphasis on learning how PyTorch is…

medium.com

The only difference we have is that instead of using mean squared error as the loss function, we’re using the negative log-likelihood because this is a classification problem instead of a regression problem.

Here I have calculated the negative log of the probabilities to calculate the maximum likelihood!

loss = -probs[torch.arange(num), ys].log().mean()

torch.arange() indexes a tensor with the number of entries we pass into it, with all the values added in an increasing order. For example, if I pass torch.arange(5) it will return a tensor([0, 1, 2, 3, 4]), so in this case we pass in all the inputs we have and use it to extract the predicted probabilities for the labels in ys tensor. And then we calculate the negative log and then normalize it using the mean().

Backward Pass

W.grad = None #so that no previous gradients accumulate during backpropagation
loss.backward()
W.data += -0.1 * W.grad

Each value in the W.grad function tells us what impact it would have on the loss if we increase it and changing weights in the negative direction of the gradient leads us to minimize the loss and is called gradient descent!

Let’s put the whole neural network together:

#create the dataset
xs, ys = [], []
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    xs.append(ix1)
    ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples: ', num)

#initialize the network

g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator = g, requires_grad = True)

We can adjust the learning rate based on observing how much the loss is decreasing and our current loss should be equal to the loss we had when we optimized through manual counting which was around 2.42.

#gradient descent

import torch.nn.functional as F

for k in range(200):
  #forward pass
  xenc = F.one_hot(xs, num_classes = 27).float() #input to the network: one-hot encoding
  logits = xenc @ W #predict the log-counts
  counts = logits.exp() #counts, equivalent to N
  probs = counts/counts.sum(1, keepdims=True) #probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean()
  print(loss.item())
  
  #backward pass
  W.grad = None #set to zero the gradient
  loss.backward()
  
  #update
  W.data += -50 * W.grad

After a couple of iterations, we do get close to 2.42..

Let’s sample words from the model:

g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  out = []
  ix = 0
  while True:
    xenc = F.one_hot(torch.tensor([ix]), num_classes = 27).float()
    logits = xenc @ W #predict log-counts
    counts = logits.exp()
    p = counts / counts.sum(1, keepdims = True) #probabilities for the next character
    # ----------
    ix = torch.multinomial(p, num_samples = 1, replacement = True, generator = g).item()
    out.append(itos[ix])
    if ix == 0:
      break
  print(''.join(out))

The neural network shows the same results we had when we manually optimized the model on one word. But it works!

In the future, we will be complexifying the neural networks and scaling them to take in larger contexts!

Here is my Colab file with all the code:

Google Colaboratory

Edit description

colab.research.google.com

Mindsets Practiced

Patience: Having the spent the last few months, I almost didn’t want to spend additional time strengthening my foundations. But to acheive my goal of replicating a transformer from its paper, I wanted to have a deeper understanding of how natural language processing can be automated through neural networks! Having patience and putting the reps to revisit the fundamentals while learning new concepts has helped me deepen my understanding!
Consistency: I tended to go all-in a project for a few days until I would burnout. It stemmed from a tendency to prove to myself and others that ‘Hey, I can do it!’ but in order to develop long-term skills, you have to keep on putting little efforts consistently. And it’s okay to explore other things as well! In December, I learned full-stack development and have spent my time in January so far doing user research for my Supplier Risk AI platform. Consistency doesn’t mean that you keep on doing something repetitively. I find that doing that makes you lose intention. Exploring other interests while compounding the skills in the key priorities you have has worked really well for me!

That’s it for this tutorial! Hope to share something new next time!

Until then, keep building (and breaking), keep learning!