Implementing a Character-Level Bigram Language Model Using N.N — Part 1B

Tahir Rauf
6 min readNov 30, 2023

This blog holds my notes of first video (2nd half) of Andrej’s makemore series.

In part 1 of makemore series, we learned how to build a bigram language model by analyzing the counts of all bigrams, how to sample characters to generate new words, and how to calculate loses.
In this part, we’ll reframe the problem of character-level bigram language modeling within a neural network framework. We’ll end up in a very similar position, but the approach will be different.

             +-------------------+         +-------------------+
| Part 1 | | Part 2 (this) |
| | | |
| Manually learn | | Let the computer |
| the probability | | learn the |
| distributions | | probability |
| of next charcter | | distributions. |
| by counting | | |
+--------|----------+ +--------|----------+
| | | |
+------|------+ +------|------+
| The manuly | |The bigram |
| created | |table learnt |
| bigram table| |by NN |
+-------|-----+ +-------|-----+
| |
| |
Text Generation! Text Generation!

Our neural network will still function as a bigram character-level language model. It will receive a single character as input, and then, through a neural network with certain weights/parameters w, it will output a probability distribution over the next character in a sequence. The network will make predictions about what is likely to follow the character that was input to the model.
Furthermore, we will evaluate any setting of the neural network’s parameters using a loss function. We’ll examine its probability distributions and use the labels, which are essentially the identities of the next character. Gradient-based optimization will then be employed to adjust the network’s parameters, ensuring the neural network accurately predicts the next character.

First step is to create a training set of all the bigrams for the model

Create Bigram dataset for neural network

Create a training set of bigrams (x, y). Given the first character of Bigram, we try to predict the next character.‘xs’ holds numerical representation of names characters. While ‘ys’ are targets/labels.

# Create training set of bigrams (x,y). 
# Given the first character of Bigram, we try to predict the next character.
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
xs.append(stoi[ch1])
ys.append(stoi[ch2])
xs = torch.tensor(xs)
ys = torch.tensor(ys)
print(words[:2])
print(f'xs: {xs[:12]}')
print(f'ys: {ys[:12]}')
"""
['emma', 'olivia']
tensor([ 0, 5, 13, 13, 1, 0, 15, 12, 9, 22, 9, 1])
tensor([ 5, 13, 13, 1, 0, 15, 12, 9, 22, 9, 1, 0])
"""

Feeding integers into neural network? One-hot encoding

How to feed these examples into the neural network? The integer values (like 1 for ‘A’, 2 for ‘B’, etc.), can imply a certain ordinal relationship or magnitude that doesn’t actually exist in the context of language. For instance, the model might incorrectly interpret that ‘B’ (2) is twice as much as ‘A’ (1) or that ‘C’ (3) is the sum of ‘A’ (1) and ‘B’ (2), which is semantically meaningless in language.
One-hot encoding prevents misleading interpretations of character importance and relationships and to make the input compatible with the mechanisms of neural network operations and learning.

Let’s encode the input vector (X) using one-hot encoding, where each character token becomes a vector of length 27 (the size of the vocabulary). In this vector, the ith position will have a value of 1, and all other positions will have 0s. Here, ‘i’ represents the corresponding index from the stoi dictionary.

import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=27).float()
print(xenc[:2])
"""
tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.]])
"""

‘neural net’: One linear layer of neurons implemented with matrix multiplication

In this section, we’ll train a neural network on all bigrams of name ‘emma’. We’ll take below steps

  1. Get Counts
    a. Calculate xenc @ W. Consider these log counts a.k.a logits.
    b. Exponentiate the logits: Result of xenc @ W have +ve and -ve numbers. while our While, our bigram was returning the “positive” count values (and then the probabitlity). So, we’ll take logs. Exponentiation ensures that each output is positive. Exponentiation converts negative numbers to values between 0 and 1, and positive numbers to values greater than 1. Consider these as counts.
  2. Calculate probabilities
  3. Calculate loss

Note: Step 1b and 2 combinely is called the softmax activation.

Here is the high level architecture of neural network.

Number of sample in dataset      = S = 228146
Number of features per sample = F = 27
Number of probabilities to emit = H = 27

nurn1, nurn2, ... , nurn27
S = [[f1, f2, f3, ..., f27]], W = [[n1-f1w, n2-f1w, ... , n27-f27w],
[n1-f2w, n2-f2w, ... , n27-f27w],
[n1-f3w, n2-f3w, ... , n27-f27w],
.
.
.
[n1-f27w, n2-f27w, ... , n27-f27w]]

Number of nodes in input layer: 27 (one to cater each input feature)
Number of neurons in hidden layer: 27 (one for each possible output probability)
Input Weights shape: 27x27
(27 rows for 27 features. 27 columns for 27 neurons)
sample = xenc[:5]   # .emma.
generator = torch.Generator().manual_seed(2147483647)
# 27 columns for 27 output neurons. Each column has 27 rows for features.
W = torch.randn((27, 27), generator=generator)
logits = sample @ W
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
loss = -probs[torch.arange(5), ys[:5]].log().mean()

print(f'S={sample}')
print(f'W={W}')
print(f'logits={logits}')
print(f'counts={counts}')
print(f'probs={probs}')
print(f'input_shape: {sample.shape}, weight_shape: {W.shape}, probs_shape:{probs.shape}, loss: {loss.item()}')

Note: (xenc @ W)[3, 13] gives us the firing rate of the 13th neuron when looking at the 3rd input.

Optimization of Neural Network — Gradient Descent

In this section, we’ll run the training loop. It will consists of steps a) Forward pass b) Backward pass c) Update and d) repeat.

# initialize the 'network'
num = xs.nelement()
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding

epochs = 100
# gradient descent
for k in range(epochs):

# forward pass
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
print("Current loss:", round(loss.item(), 4))

# backward pass
W.grad = None # set to zero the gradient
loss.backward()

# update
learning_rate = 50
W.data += -learning_rate * W.grad

Sampling

For sampling, we’ll use trained weights matrix as our manually created ‘N’ matrix. We’ll do

  1. Pluck the weight row corresponding to ‘ix’. This is our count.
  2. Exponentiate it to keep the ‘count’ positive.
  3. Get the probability distribution. (Remember 2&3 is softmax operation)
  4. Random sampling based upon the probability distributionn
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
out = []
ix = 0
while True:
# ----------
# BEFORE:
#p = P[ix]
# ----------
# NOW:
xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
logits = xenc @ W # pluck the row corresponding to ix.
counts = logits.exp() # counts, equivalent to N
p = counts / counts.sum(1, keepdims=True) # probabilities for next character
# ----------
# Sample from the distribution
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
Generated names

Summary

We trained the model in two completely different ways that yielded the same results. The first method involved counting the frequency of all bigrams and normalizing them. In the second method, we used negative log-likelihood loss as a guide to optimize the ‘counts’ matrix, minimizing the loss within a gradient-based framework. We observed that both methods produced the same results, although the second one is much more flexible.

References

--

--