Notes on Andrej Karpathy’s makemore videos. Part 1.

5 min readOct 29, 2022

Below are my notes on Andrej Karpathy’s video tutorial on introduction to language modeling.

I find Andrej’s educational videos very inspiring and I highly recommend anyone interested in deep learning to watch the full version here.

Problem statement.

Our goal is to create a name generator (called “makemore”) using a data-driven approach. Technically this can be done by predicting the next character in a sequence given the sequence itself. For example, in the name “isabella”, the character “i” is likely to come first, “s” follows “i”, “a” follows “is”, etc. Finally, after the sequence “isabella”, the word is likely to end.

Bigram model

We start with a bigram model that works with only two characters at a time. We look at one character in the sequence and we try to predict the next one. We focus on local information and forget about global one (i.e. about full word).

In order to generate start and complete the name, we add an additional symbol ‘.’ at the beginning and end of the name respectively. Thus, the first bigram of the name “Isabella” starts with ‘.i’ and ends with ‘a.’.
Our dataset consists of bigrams. For each name we have multiple bigrams contributing to the dataset. For example, “Isabella” is split into ‘.i’, ‘is’, ‘sa’, …

Statistical model

We start with a simple statistical model counting the frequency of bigram (pairs of letters) occurrence in our dataset. One can generate a frequency matrix N that shows how many bigrams ending with character B are in the dataset if it starts with character A. The count goes over all possible characters A and B, forming a 27x27 matrix. Then transform it to the probability matrix P by normalizing each cell by sum of row elements. This gives us the statistics of the bigrams. The elements of the probability matrix P are the parameters of our model which help us to generate bigrams.

Now one can use a multinomial distribution sampling to generate bigrams using the probability matrix P. We then merge these bigrams into names, using the start and end symbols ‘.’ as separators.

Model quality. Loss function

We will now assess the quality of the model with a single number, called the loss function. We will use the likelihood function, which is the product of all the bigram probabilities of a word. This joint probability will be a small number since we are multiplying numbers between 0 and 1. Instead, people prefer to work with logarithmic likelihood. High probabilities have logarithms close to 0, while small probabilities are negative. To calculate a total log-likelihood, one must sum up all the log-probabilities for a single word.

Since we want a loss function to be minimized, but not maximized, we take a negative log-likelihood. We also normalize it to the total number of counts (i.e take an average instead of a sum) to work with smaller numbers.
This model is already trained by counting all bigrams in our dataset. The weights of the model are elements in a probability matrix.

Model regularization

If we take a bigram that is not present in the dataset, the algorithm will give zero probability and ‘-Inf’ log-probability. This is undesirable. To solve this problem, people use a smoothing trick. The trick is to add a few fake counts to the count matrix before turning it into a probability matrix. Therefore, there are no more zeros in a probability matrix P. Note that the more fake counts we add, the more uniform the probability matrix. The less we add, the more peaked model we have.

Neural network approach

So far we have built a model by specifying a set of deterministic actions, but now we will use a neural network to solve the same problem. The network should take one character and return a probability distribution for the next character. Unfortunately, we can not feed a character or a corresponding integer N directly to the network. We first convert it using one-hot encoding into a scarce vector with all 0-s except for the Nth dimension, which we turn into 1. Now that vector can be feed into neural network.

We define a simple model consisting of a single linear layer with 27 neurons, which takes an input vector x and multiplies it on a weight matrix W. As output, the model gives us the firing rate of these 27 neurons to activate for each input. We interpret these 27 numbers as log counts. To get linear counts, we need to exponentiate them. Probabilities can be found using the same procedure as before. They show just how likely each of the characters is to come next according to the model. As we adjust the W weights, we will get different probabilities for any character you enter.

We want to optimize the weight matrix W so that the output probabilities are quite good. The optimal parameters should minimize the loss function i.e. maximize the likelihood of the data with respect to the model parameters. This can be achieved by performing several operations at each iteration: a) a forward pass b) calculating the loss function c) a backward pass d) updating the parameters with the methods of gradient descent using information from the backpropagation algorithm.

First, we can generate random weights from a distribution (for example, from a Gaussian distribution). As we optimize, our weights get closer to their optimal values.

We can achieve roughly the same result using a simple neural network and gradient-based loss function optimization as we did with the statistical approach. This makes sense since we are not adding any additional information to the dataset. But the neural network approach is much more flexible. We can now extend it or make the neural network more complex. For example, we can take multiple previous characters into account and feed them into increasingly complex neural networks. Fundamentally, the output of the neural network will always be logits, which will undergo the exact same transformations. The only way that changes in more complex network architectures is how we perform a forward pass.

Neural network regularization

Previously, we introduced a smoothing trick for the regularization of the statistical approach. It turns out that the gradient-based approach has the equivalent of smoothing. Consider the weight matrix W which we initialized randomly. We can think of initializing all weights W to be 0. In this case, all the logits become zeros and their exponentiation gives all ones. Thus, the probability turns out to be exactly uniform. So trying to induce the w’s to be close to zero is basically the same as smoothing. This is called regularization. One can augment the loss function to have a small component that we call regularization loss. This new component tries to make all W’s zero and thus smooth the probability matrix.