Behind the Large Language Models: Word Embedding
How the foundational layers for ChatGPT and Google Bard work
With the emergence of large language models such as ChatGPT and Google Bard, Natural Language Processing models have become quite a hot topic. One of the technologies underlying these models is called Word Embedding models.
The ‘word embedding models’ create predictions of word relationships and semantic similarities. This is why they are crucial for large language models.
First, I will explain what word embeddings are, and then I will encode words with one-hot encoding while capturing the meaning of the word in the encoding. Finally, I will build a model to predict each word from the previous word. The internal weights of this model will give us a better way to encode words.
Word embeddings are the vectors for each word in the text. Word embedding models generate these embeddings. These models calculate a unique vector for each word in the text. By ‘vector,’ we mean an array of numbers that is specific to a word. In other words, these vectors are n-dimensional objects where each component can be a number. These vectors are unique to each word in the text. In practice, word embedding models spit out these vectors for each word in the document.
In word embedding, similar words are represented by similar vectors. For example, the vector for ‘apple’ is relatively more similar to the vector for ‘orange’ than ‘violin’; because ‘apple’ is closer to ‘orange’ than it is to ‘violin.’
One point to note is that these vectors’ dimensions are at the designer’s discretion. For example, Word2Vec uses 300 dimensional vectors, meaning each word is represented by 300 numbers. Here we are going to create a very simple word embedding from scratch using only two dimensions. For example, let’s say we have a document containing two sentences.
Peaches are rich in vitamins, minerals, and beneficial plant compounds. Oranges are rich in vitamins, fiber, and potassium.
I have 13 words in total. My aim is to create a model that will try to predict each word from the previous words. To achieve this, I will define a specific vector for each word so that exact words have the same vector and similar words have a vector close to each other.
Phase 1: Nodes
The first step is to create several nodes grouped into a “hidden layer” and assign an input word to each node with a weighted connection. Nodes consist of two parts. The first part receives the weighted input from each word and sums them up. This sum goes to the second part which is the activation function. Here, the activation function defines the output based on the function process. In this case, the activation functions are simple identity functions that do not change the input.
The number of nodes will define the dimension of the vectors as mentioned above. In other words, it will define how many numbers (items in the vector) we want to assign to each input word. In real life, the nodes are tens or hundreds.
Each word in our dataset goes to each node with a weight. This node aims to help the model to learn the relationships without adding more complexity. In our example, we have 13 words, and each node gets 13 weighted inputs and outputs the sum of those numbers. In this example, I used two nodes to keep our example simple and manageable. (Note: the nodes in the same level form the hidden layer.)
This is the first phase. As you can see above, each word maps to nodes with weights. (w₁, w₂, ….., wₙ) These weights are initially randomly assigned by the models to initialize the process. The goal is to fine-tune these weights over many iterations. Of course, we cannot mathematically multiply a string with a number. So we will assign 0 or 1 for those words. The word that comes right before the one we want to predict will get 1. In our example, we want to predict “are,” so “Peaches” gets 1.
Before we move to the next phase, one might think that we already have a way to vectorize words before we even started word embedding with one-hot encoding. I should emphasize that the whole point of creating word embeddings is to find a better, lower-dimensional encoding that somewhat incorporates the meaning (or at least the usage) of the word.
Phase 2: Back to the words
In the second phase, we will distribute the outputs of each node to each word with randomly assigned weights.
So far, we have 26 random weights for 13 words and two nodes in the first phase and another 26 random weights in the second phase. We sum up the outputs of each activation function for each word.
Phase 3: Softmax layer
In the third phase, we will add a softmax layer to our model. The softmax layer is technically another activation function. It’s used for the final output layer of multiple classification problems. The goal of using the softmax layer is to turn an array of numbers into an array of probabilities. It takes a number from each label (in our case, from each word), calculates, and returns an output for each label. The softmax function does not use random weight to distribute the numbers to the labels. It has a formula to do this.
Here’s a reminder of the softmax function:
The softmax function takes a vector and applies the formula above to each vector member. For example, if the input vector is [3, 5, 7], the output vector is calculated as [0.016 0.117 0.867]. The sum of the items of the output vector is always 1.
Applying the softmax layer in our hypothetical example returns the following results.
Phase 4: Optimizing the weights
To predict a word, we assign “1” to the input word and expect to get a high score for the following word. If we want to predict the next word for “Peaches,” we will probably see an inaccurate output. This is because the weights we use are randomly selected. In the training process, our model knows that our desired output is “are,” but random weights did not provide us with this desired result. As you can see in the above visual, the word “fiber” has the highest score instead of “are”. This is where neural networks’ “backpropagation” kicks in. Backpropagation is an algorithm that optimizes the weights by the gradient of the loss function. It is an essential element of neural networks. It repeatedly runs until weights are updated, so our model returns the desired result. These optimized weights are called “embeddings.”
Because we chose to use two nodes, we have two weights for each word that defines the input of nodes (activation functions). These are the numbers that compose the vector we talked about at the beginning of this article.
When the model is run and the weights are optimized, we can plot these weights and see that similar words are close to each other in the figure.
We were able to plot them because these vectors are two-dimensional. In practice, two or three-dimensional vectors rarely provide much value, so higher dimensional vectors are preferred in real life practices.
As you can see, “oranges” and “peaches” are close to each other. But how close are they? Are they closer to each other than the proximity between ‘vitamin’ and ‘mineral?’
Cosine similarity
Because each word has its vector, we can calculate the cosine similarity. Cosine similarity shows how much a word can be described with another word. In other words, how similar they are. In math, we find cosine similarity by calculating the dot product of unit vectors of two vectors.
Let’s say “peaches” is represented with vector A, and “oranges” is represented with vector B.
Cosine similarity is a number between -1 and 1. If two words are completely the same, their cosine similarity is 1. If they are entirely dissimilar, their cosine similarity is -1. This is an essential detail because this is the underlying algorithm of most recommender systems.
Using this algorithm, the neural network model can learn that oranges and peaches are similar words because they are used in a similar context.
These word embeddings are then used in more complicated models, such as LaMDA (for Google Bard) or GPT-3 (for ChatGPT ). These complicated models use multiple nearby words as input and many more hidden layers. Other than the model I introduced in this article, there are contextual word embedding models that consider the content of the text. For example, when I say, “It is a fine restaurant.” and “Fine, I’ll do it!” I use the word ‘fine’ in two different meanings. These contextual word embedding models assign different vectors for each of these ‘fine’ words because they have different meanings.
Word embedding is highly popular in production settings. One product that uses word embedding is AWS BlazingText. BlazingText’s model uses a ‘Word2Vec algorithm’ that maps words to high-quality distributed vectors.
I hope this article helps you gain some insight into the underlying algorithm of today’s popular NLP models. I want to thank my coworkers at Slalom, Dr. Ryan Croke, and Dr. Jack Bennetto, for helping me better understand this topic and supporting the development of this article.