Notes on Andrej Karpathy’s makemore videos. Part 2.

Maxime Markov
4 min readNov 3, 2022

--

Below are my notes on Andrej Karpathy’s video tutorial on introduction to language modeling. You can watch Andrej’s original presentation on youtube.

In Part 1, we worked on a bigram model that takes into account only the local context of a word. This approach is impractical as the size of the counting matrix (also represent model’s weights) grows rapidly when we increase the context, i.e. take more characters. For example, the first dimension of the counting matrix increases from 27 to 27*27 = 729 when we switch from bigram to trigram model.

Instead, we want to have a model that is easy to generalize.

Model architecture.

Neural Probabilistic Language Model architecture from Bengio et al, J. of ML Research 3 (2003) 1137–1155

In this lecture, we implement a Neural Probabilistic Language Model.

The model takes several previous characters (context) and tries to predict the next one. In the figure above, the context consists of three characters, but their number can be more. Since our model cannot work directly with characters, we convert characters to integers (indexes). The input layer takes the indexes of all context characters and converts them into embedding vectors using a lookup table C. The size of the embedding vector is a network parameter. The lookup table is shared across characters and has a size of (number of unique characters X embedding vector size).

The hidden layer is fully connected to the previous layer and receives one concatenated output from it. The tanh function is used as a non-linearity. The output layer is also fully connected and consists of the number of neurons equal to the total number of unique characters in the dataset. It produces logits which are then converted to probabilities using the softmax function. The character with the highest probability is our prediction.

The network parameters are optimized using back-propagation.

Embeddings

We associate characters with N-dimensional feature vectors (embed characters in a N-dimensional space). Embedding vectors are randomly initialized, and then tuned during the training process. If the training is good, some characters appear close (by some metric distance) to others in the embedding space. If we reduce the size of the vector to 2, we can see how the characters are grouped in the embedding space using a simple matplotlib scatter plot.

Visualization of characters in 2D embedding space

The analogy becomes even more intuitive if we replace characters with words. Words with a similar meaning will end up in a similar part of the embedding space, and words with a different meaning will end up somewhere else in the space.

Embeddings help us deal with the problem of out-of-distribution data. If a word is not in the data set, its embedding may be close to the one of another word that can serve as a substitute for “meaning”. For example, our network might know that cats and dogs are animals and appear in a sentence in a similar context.

Mini-batch

First, we did a forward and backward pass for each training example separately. Instead, one can feed data into the network in chunks called mini-batches. This way we can run many examples at once and train much faster. Because we are doing mini-batches, the quality of our gradient is much lower (i.e. the weights update direction is less reliable). However, it turns out that it is much better to have an approximate gradient and take more steps than to calculate an exact gradient and take fewer steps.

Learning rate optimization.

We don’t know if we step too slowly or too fast when doing gradient descent. To get a sense of optimal learning rate, we can apply a simple procedure before starting the actual learning process. Let’s evenly sample points x in some interval (say -7 to 1, use linspace in python) and convert them to 10**x. Now run a training loop for various learning rates from our sampling and plot the loss function as a function of the learning rate. Typically, we see a decreasing loss function at very low learning rates, then it plateaus and finally becomes unstable (fluctuates or rises) at high learning rates. A plateau region is our sweet spot window.

Learning rate warm-up optimization. Sweet spot values are around minimum plateau.

Split into train/val/test datasets

As the capacity of the network grows, it becomes more capable of memorizing data (this is called overfitting). If we take an overfitted model and try to sample from it, we will only get examples from the dataset, not new data. In other words, our model loses its predictive power.

To avoid this, we split our data into 3 independent subsets: training, validation (or development, a rather outdated term) and test. The split ratios can be approximately 80%, 10% and 10% respectively. The training subset is used to optimize the model parameters. The validation subset is used to develop your model’s hyperparameters controlling the learning process (such as your model size, embedding size, regularization strength, learning rate etc.). Finally, a subset of tests is used to independently test the performance of the model at the end of the training process measuring its predictive ability.

When test and validation losses are about the same, this is a good sign of underfitting. Underfitting means that our network is small and we expect performance gains from increasing the size of the network. When train loss is significantly lower than test loss, the model overfits. Providing more data, augmenting it, or using a smaller network can help to reduce overfitting.

Performance constraints

  • Larger hidden layer size
  • Increase the embedding size

--

--