Introduction to Language Modelling — Part 2 (MLP)

10 min readJun 13, 2023

The second instalment of the series in explaining how Language Modelling have evolved.

TL;DR

This article continues from the first part here. It’s slightly different to the previous article in that I’ll be incorporating more of my own understandings and intuitions to complement with Andrej Karpathy’s part 2 video with less focus on code. The main topic of this part is a new architecture called multi-layer perceptron (MLP) which builds on top of the simple bigram neural-network setup by including more previous tokens and having additional hidden layers towards achieving better performance for the language modelling task (predicting the next token). First, let’s continue where we left off from Bigrams with the names dataset.

More on Part 1 — Embeddings

It turns out that we can represent characters as very basic embeddings from the bigram model we’ve built previously. An embedding (sometimes also often referred to as encoding) in NLP is simply a numeric vectorised representation of non-numeric discrete values (e.g. characters, sub-words or words). If you recall from part 1, we have already tokenised our names dataset into characters and then to arbitrary ordinal numbers. With embeddings, we can now represent each character as a set of numeric values — this is also called vector representation.

Count-based Approach

In the count-based method, the representations for each character is simply the normalised count probabilities using that character as the start of the bigram. In the illustrative example below, the vector presentation for the character ‘a’ is the row with values [0.5, 0.05, …, 0.2, 0.05] that would contain 27 elements representing the probabilities transitioning to each of the characters from ‘.’, ‘a’ to ‘z’.

Illustrative values for the normalised lookup-table vector representation of character ‘a’.

From our names dataset, if we get the actual embedding for the count-based approach for the character ‘b’ from the normalised count table we get the following vector. We can also plot it for better visualisation.

tensor([0.0430, 0.1205, 0.0146, 0.0007, 0.0247, 0.2455, 0.0004, 0.0004, 0.0157,
        0.0816, 0.0007, 0.0004, 0.0389, 0.0004, 0.0019, 0.0397, 0.0004, 0.0004,
        0.3155, 0.0034, 0.0011, 0.0172, 0.0004, 0.0004, 0.0004, 0.0314, 0.0004])

Actual probability distribution of the predicted next character after ‘b’ using bigram lookup-table approach

It makes intuitive sense that ‘e’ and ‘r’ have the most weights (are most probable next characters) if you recall common names like ‘Ben’, ‘Beatrice’, ‘Bella’, ‘Brittany’ and ‘Bridget’ (this dataset is in fact quite out-dated).

Simple Neural-network Approach

It turns out what the neural network has learnt in its weight matrix actually converges to the same values as the normalised counts in the count-based approach. We can also inspect and visualise the values in vector representing of the character ‘b’ and see it’s indeed very similar to the normalised count-based approach.

tensor([0.0374, 0.1140, 0.0127, 0.0055, 0.0182, 0.2386, 0.0050, 0.0046, 0.0107,
        0.0753, 0.0059, 0.0037, 0.0340, 0.0063, 0.0042, 0.0343, 0.0055, 0.0061,
        0.3088, 0.0050, 0.0041, 0.0156, 0.0051, 0.0058, 0.0040, 0.0262, 0.0033]

Actual probability distribution of next character after ‘b’ using bigram neural-network based approach

Visualising Character Embeddings

To visualise vectors of 27 dimensions in a way that appeals to our eyes we need to have the numbers of dimensions down to 2. The simplest method to achieve this is to just retrieve and plot the first 2 dimensions of the 27 dimension vectors.

Plot of the first 2 dimensions of bigram count-based and neural-network models’ character embeddings

The meaning of these vector embeddings are derived from both the training objective which is next character prediction and the structure of the dataset. In this case we would see characters that serve similar purposes in the construction of names are casted to a similar space i.e. the vowels ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’ are close to each other and consonants that you’ll see in similar positions (or are replaceable) within common names are also occupying similar subspaces.

There’s a whole body of work around other more advanced dimensionality reduction techniques (part of unsupervised learning) such as Principal Component Analysis (PCA) for condensing the high dimensions down, which I’ll not go into details for this article.

Part 2: Multi-Layer Perceptron

Part 2 of this series implements the Multi-Layer Perceptron (MLP) architecture shown in this Neural Language Model paper in 2003 by Yoshua Bengio along with others on a character level (instead of word-level tokens used in the paper).

Drawbacks of N-gram Models

A critical problem with N-gram models lies in its inability to generalise between sequences especially as the context window grows. Each different sequence would be independent of each other with no learnings shared between them. This issue is evident when we need to make inference on previously unseen sequences. It is much more of a problem in word-level models compared to character-level models as the vocabulary size of words easily extends beyond 10k in English. Take for example the sentence below, where we want to predict the next word.

The dog was running in the …

For an n-gram model (in this case 7-gram, using 6 words to predict the next), we would need to have seen this exact set up in the training data. Smoothing can help getting around this but even then there are still two problems:

Distributing equally small weights to the unseen sequences may not always make sense as we are assigning the same probabilities to each of them.
There’s a huge combination of sequences for which we need to assign these because the number of combination of sequences grows exponentially with context window (the ‘n’ in n-gram models).

For a Neural Language Model like MLP, we don’t have to have seen the exact sequences. The model can learn from other sequences that the words ‘the’ and ‘a’ are interchangeable most of the time and ‘dog’ and ‘cat’ share very similar concepts in the context of household pets through other sequences containing these words. These similarities would be obvious when we look at the embedding for these words. The model may have previously seen:

A cat was running in the … (room)

So with everything it has learnt the model is highly confident that the next character for The dog was running in the … is also ‘room’. Hence a big advantage of MLP over n-gram models is that it can learn through sharing of weights.

How it works

Training Dataset

The data setup is similar to part 1 with the names dataset where we tokenize each character to one of 27 tokens (extra character ‘.’ is used for the start and end of a word). Now we also need to specify the context window (set arbitrarily to 3) which represents how many characters we want to look at in order to predict the next one. Doing so would create something like the below for the first name ‘emma’ in our dataset. Extra dots are used to pad the start of the first letter ‘e’.

An example of how the features and labels would look like per instance with context window of 3

From the tokenised characters which converted the above to numbers, we store everything on the left hand side of the arrow as ‘features’ or ‘inputs’ and everything on the right to ‘target’ for our training dataset.

Architecture

The key take-outs with reference to the diagram below starting from the bottom are:

The inputs first connect to a shared lookup table (matrix C) where we pluck out the index of the input tokens and concatenate them as one long vector. The dimension of this matrix C would be vocabulary size by the embedding dimension (which is arbitrarily chosen as a hyperparameter).
Flowing through to a hidden layer with a specific number of units (chosen as a hyperparameter) and then a tanh activation function.
Ending at a softmax layer that converts logits to probabilities.

Neural network architecture set-up for MLP Neural Language Model (input starts from the bottom)

For our names dataset, with a context window size of 3, embedding size of 2 and a hidden layer size of 100 we would have the below details during training:

Each of the 3 tokenised characters would index into a particular row in the matrix C with size (27, 2) so now we have 3 vectors of size 2. We concatenate these to form one vector of size 6.
We multiply this vector by a (6, 100) weights matrix then pass through a non-linear tanh function in an element-wise fashion to get the hidden layer with 100 units.
We multiply the hidden units vector with size 100 by a (100, 27) weights matrix to arrive at the logits with 27 units. We pass this result through a softmax layer to get the probability distribution of the next token.
For the training process, we evaluate the loss and backpropagate this loss to the weights in upstream layers.

Cross-entropy Loss and Perplexity

It’s the right time now to reintroduce the concept of cross-entropy loss from part 1, which is a form of intrinsic evaluation on the performance of a language model. It’s intrinsic because it’s evaluated independent of any downstream application, optimised during the training process itself on the training set and tested on a hold-out testing set. The formal definition of cross-entropy is a measure from the field of information theory and it quantifies the difference between two probability distributions, which in language modelling are:

The ground truth probability distribution which would have all 0s except at the place of the index corresponding to the actual label where it would have all the weights (a weight of 1).
The predicted probabilities distribution across the full vocabulary of tokens.

For a basic explanation, cross-entropy is calculated as the negative log of the predicted class probability weighted by the true class probability summing over all possible values. For our language modelling task, since the ground truth would always have an 1 in the place of the target token, the cross-entropy loss is equivalent to the negative log of the predicted probability on the index of the actual target token (which is the same as negative log likelihood if you remember from part 1). A smaller cross-entropy loss would mean we’re assigning higher probabilities to the ground truth tokens meaning that we’re more confident about picking the correct next token.

The example below shows that at inference time what the model’s predicted probability distribution and the ground truth label (boxed in red) are after seeing the starting character ‘b’ in the name ‘Bridget’. The cross entropy loss is simply -logn(0.3) = 1.2 where 0.3 is the probability of predicting ‘r’ as the next character.

Illustrative example of how cross-entropy would be calculated when comparing the probability output distribution after the softmax layer with the ground truth label boxed in red.

In practice when cross-entropy loss is used in language models, inference on the whole dataset is performed and the average cross-entropy is calculated across each instance for evaluation.

Another concept very much related to the cross-entropy loss, which you may come across in the NLP space is perplexity. Perplexity is simply a base to the power of the cross-entropy loss, where the base is the specific logarithm base (usually Euler’s number, e) that was used to calculate the cross-entropy loss.

Results

We can compare the cross-entropy loss results from both of the bigram and MLP approaches.

It’s easy to see the simple Bigram neural network approximates the performance of the count-based approached and MLP surpassing both of them significantly. This is expected as MLP has a much more expressive setup than the simple bigram neural network approach, as it builds on top of that architecture with larger context window (3 instead of 1) and an extra hidden layer. Here expressivity means how complex the model can be at capturing the underlying behaviour of the data.

What’s Next

I hope you’ve enjoyed reading this article and learnt something meaningful. Before jumping into the Transformer architecture and attention mechanism which powers popular applications today like OpenAI’s ChatGPT and Google’s Bard, there are two more milestone in the history of Language Models. These are namely Recurrent Neural Network (RNN) with its derivatives, that explicitly models for each time step, as well as Convolutional Neural Network (CNN) based models which although is primarily used for computer vision, could also be used to model sequences.

Bonus Content

If you’ve read this far, congratulations on reaching the end! I’ve been learning and contributing some of my knowledge in Large Language Models (LLMs) space in the last few months. PanML is a nice Python package that gets you started to explore the vast ocean of LLMs in a familiar sklearn-like API / coding syntax. In my opinion, its big advantage lies in:

The multitude of open source LLMs in the Huggingface library it supports in addition to the OpenAI API and on top of an easy to use API interface.
Out-of-the-box options to fine-tune models with parameter efficient fine tuning techniques such as LoRA.
Prompt engineering and chaining techniques that are simpler and easier to call with base Python lists as input.