How do words become numbers in AI language models?

Rafael Guerra
12 min readJan 30, 2023

If you have recently used ChatGPT or a similar AI language tool, you may have been impressed by how well it seems to understand written questions — and how quickly and seamlessly it can respond to them. But behind the scenes, there are complex models working hard. Models that interpret text, break them down, perform computations, and present us with an answer at the end. None of this would be possible if models didn’t have a clever way of converting words into numbers.

This conversion process of words into numbers is a central component of many AI models dealing with language, yet it is not as widely discussed as many other aspects of the said models. This may have something to do with some of the calculus and linear algebra it can involve. But in this piece, I hope to help this topic gain more visibility by presenting it with a more intuitive approach — and hopefully, decent illustrations — so that all of us, but especially those who may be still hesitant or ‘creeped out’ by AI, can come to appreciate and be curious about its elegance.

Right off the bat, I should say that multiple methods exist to convert words into numbers for AI model processing. Some of the simpler methods such as one-hot encoding or skip-grams are still interesting and useful, but the current method of preference for high-performance models is the method of word embeddings, and therefore, that’s what we’ll go over in this piece.

Tokenization: the first step — pretty much always

Regardless of which method is ultimately used for converting words into numbers, a common step is taken by all: reducing every unit of text to its smallest form. We call the smallest form a token. Think of a token as a small unit of text that can either be the word itself or part of a word. Here is an example of how an ordinary sentence could be broken down into tokens:

I, unfortunately, don’t like it.

Notice the ‘##’ in front of the last two tokens. The token ‘board’ can also be used to compose the word ‘skateboard’ or ‘bodyboard’, and the suffix ‘ing’ represents a verb tense. Because these are common ‘pieces’ that appear in different words, it’s useful to create a token for each of them instead of simply using ‘snowboarding’. ChatGPT uses a model called BERT to convert its words into numbers. We’ll talk more about BERT later in this piece and in other pieces to come as it remains hot in the field of AI language models.

Encoding meaning into tokens

Once a model has tokenized its words, it now has a vocabulary. For reference, a common version of BERT has a vocabulary of 30,522 words and upwards of 50,000 tokens. At this point, you may think — why don’t we just assign a unique random number to each token and call it a day?

It’s not a bad idea! But sometimes the same token can refer to different things depending on how it’s placed in a sentence. Here’s an example — what does the token ‘it’ refer to in each of the prompts below?

I guess I didn’t realize if it says 7, everyone shows up at 9.

We know from our own language knowledge, that ‘it’ refers to the party in the first example and to ‘the time’ in the second one. But for a model to learn that, it will need to find a way to numerically encode the relative position of the word ‘it’.

Besides the position in the sentence, the context of the sentence also matters. Take the example below, for instance, and notice how the same word refers to completely different concepts. It is imperative that the model builds associations so that it knows when ‘right’ appears in the same sentence as ‘wrong’ it likely means something different than when it appears in the same sentence as ‘left’.

Turn left since you are already going that way.

Assigning a list of numbers to each token

Since there are so many different aspects we need to consider for each token, it doesn’t make sense to simply assign each token a number. Rather, we need to assign each token a list of numbers, each representing some aspect of the token. Let’s take the token ‘snow’ as an example again. If we want to manually come up with three aspects, or ‘properties’, of that token, we could consider the following:

  • Is this token associated with cold climates?
  • Is this token a verb?
  • Is this token the name of a person?

By answering the questions with a value of 0 when the answer is ‘no’ and a value of ‘1’ when the answer is yes, we could have a representation like this:

It could be a person’s last name if this was an HBO show

One convenient aspect of this approach is that we can compare words just by their ‘1 or 0’ lists and can start to measure geometrical distances between them. We will go into detail about how similarities between words are actually calculated in the next piece, but for now, it’s sufficient to understand what the following numerical representation already allows us to do. Take, for example, the words ‘snow’, ‘skiing’, and ‘Rafael’ and appreciate how we can already infer proximity simply by examining what kinds of lines each token would make if they were plotted in 3-D space where each axis was one of the properties.

As someone from the tropics, I am definitely not associated with cold climates.
Imagine you are at the tip of the arrow. Now imagine travel distances to other points — hopefully, that will help with the visualization. Of course, if you’ve taken linear algebra, just compute the dot product!

The lines for both ‘snow’ and ‘skiing’ are closer and pointing towards the right, so they must be ‘more related’ to one another than they are to ‘Rafael’, whose vector is farther and points upwards. This, in fact, does make sense as ‘snow’ and ‘skiing’ are far more likely to be present in the same sentence as ‘snow’ and ‘Rafael’ or ‘skiing’ and ‘Rafael’.

But how do models ‘know’ about the parameters of each word?

In the example above, we used our own knowledge of words to come up with ‘properties’, but in the real world, models have zero starting knowledge, so how do they come up with the parameters themselves — or to put it differently, how do they know which questions to ask about how to describe each word is?

“The heck is this?”

The answer is — they don’t. Rather than asking questions about each word, models will use algorithms to learn about words through random initiation, guessing, and learning by example. In data science, a common algorithm deployed to do this is neural networks. In neural networks, we choose a very specific task with an answer we know, but the model doesn’t. We ask the model to guess the answer based on calculations they can make by combining the inputs in different ways.

Word, word, word, word, word…ok, I think the answer is also a word?!

If the model’s answer is wrong, we let it know how wrong it is and it can use that margin of error to re-calculate. This goes on for a while until the model’s ‘numbers’ start to guess correctly more often than not. Here’s an illustrated example.

It will usually take more than three attempts…

Each time a model generates a guess, it does so by initializing a random value for how important a ‘neuron’ is and how that ‘neuron’ relates to other ‘neurons’. A ‘neuron’ is just an empty node that is connected to any random number of inputs, in our cases the tokens, signifying whether they are present or not. Ultimately, neural networks used in professional settings will have thousands, maybe millions, of neurons, so when they come up with ‘parameters’ that successfully lead to a correct prediction, those numbers do a good job at prediction — but have no meaningful interpretation.

Hard to interpret what those features could mean, but hey, it works!

On one hand, it’s a bummer that the dimensions the model assigns to each token are purely based on numerical calculations by trial, error, and improvement. Through this method, there is virtually no way for us to be able to know what a calculation actually represents, like in our previous example with ‘snow’ and cold temperatures. On the other hand, this method is incredibly scalable and there is virtually no limit to how dimensions we could capture.

I will go over neural networks in far more detail in a different piece. But because I don’t want anyone to stop reading here out of confusion, I will link to this excellent article that explains it well. I particularly love the image they provide, showing how a neural network could classify an image by extracting properties of the pixels, combining their information, and coming up with a guess on whether the image was, for example, that of a leaf. We could — and in fact — will perform similar tasks to understand words.

How many ‘properties’ should we capture in each list?

You might be wondering — if there is virtually no limit to how many parameters we can use in a neural network, how do we choose the right number? An analogy I can give is to imagine the list of parameters as something we could fit inside a bucket.

Which one would you rather carry?

The bigger the bucket, the more parameters we can fit and therefore, the more information we have — which is great. But, the bigger the bucket, the heavier and more inconvenient it is to carry around. In practical terms, the more parameters we work with, the longer it may take for models to run, and the more costly they may get too. There is also going to be a point of diminishing marginal returns where more information gained does not translate to a proportional increase in quality, so there is no clear incentive to increase the number no matter what.

The base model of BERT uses 768 properties for each token. That’s the ‘sweet spot’ the creators of BERT decided on. There is nothing special about that number, and in fact, BERT has bigger versions with more properties, although they are not as commonly used.

How do we even represent 768 properties of something?

If we go back to our example of the word ‘snow’, we can represent its 768 properties, which we can also refer to as ‘dimensions’ as a list that is just a lot longer. Something like this instead of the earlier list of three values:

Trust me, you’d need to zoom in a lot if I included a square for each dimension!

Visualizing 3 dimensions is a lot easier than visualizing 768 dimensions. But the beauty of the mathematics of word embeddings is that the formulas to calculate distance still work no matter how many dimensions are involved. And even visualizing those dimensions is possible with sophisticated algorithms that ‘summarize’ what’s going on by picking up statistical patterns across the dimensions and plotting them in a 2-D axis. One such algorithm is called t-Distributed Stochastic Neighbor Embedding) (t-SNE) — and you can read more about it here — but for the purposes of this piece, you really don’t have to know much about it at all.

What tests did BERT use to come up with the 768 dimensions?

In order to generate a list of 768 dimensions for each token, BERT used a specific kind of neural network called transformer neural networks. As mentioned above, neural networks are used to generate an ‘educated guess’. But in order for the guess to be generated, a specific test needs to be set up. In the earlier example, the question was whether a leaf was present in a provided image. For BERT, there were two tests used involving the tokens in its vocabulary: the Masked Language Modeling Test, and the Next Sentence Prediction Test.

Task 1: The Masked Language Modeling Test

The Masked Language Modeling test is one where we give BERT a list of sentences, but randomly omit a token from them, replacing them instead with the token ‘[MASK]’. BERT then has the task to guess what that word could be based on the presence of the other non-masked tokens in the sentence. Since we know what that omitted word is, we can let BERT know if it got it right or wrong and the next time it encounters a sentence, it will adjust its calculations accordingly. Thinking of the leaf example above, this could be a representation of what’s going on in this test:

Piano Snowboarding could be a really cool sport, though

Task 2: Perform the Next Sentence Prediction test

At the same time that BERT is being challenged on the Masked Language Modeling test, it is also undergoing a different test entirely — Next Sentence Prediction. The concept is very similar to the previous test, but instead of masking individual tokens and trying to predict them, BERT looks at a pair of sentences and tries to determine if the second sentence should follow the first one. When BERT is being trained, 50% of the paired sentences were in fact originally paired up, and 50% of the time, the second sentence was randomly selected. Again, a visual representation of the test could look something like this:

Maybe if this was a stream-of-consciousness novel or something…

Both images above hopefully help you understand neural networks a little bit. There are differences between ‘transformer’ neural networks and the kinds of neural networks displayed above — but the underlying concepts are exactly the same. In future pieces, we can go over some of those differences, such as the concept of attention in models.

The final word embeddings

As both tests are finished, the final 768 dimensions for each token are finalized. Together, these sets capture how each word is related to one another, as measured by their ability to perform well in the tests of Masked Language Modeling and Next Sentence Prediction, both representative of the kinds of real-world use of AI language models. The final matrix of tokens that is produced from the tests has one row per word and one column per dimension. Therefore, the final dimension of the matrix is 30522 by 768 and is what we would call BERT’s word embeddings.

At this point, honestly, I’m out of witty captions. You’re almost at the end, though!

Note in the illustration above that the values for each dimension are not always 0 or 1, like in previous examples. Neural networks use weights and biases and change the parameters based on error rates, learning rates, and many other continuous parameters. Ultimately, the 768 sets of values are continuous so they will, in all likelihood, almost never be exactly 0 or 1.

Summarizing — and looking ahead

AI language models have become sophisticated quickly. This is largely due to the method of word embeddings, which allows for the effective conversion of words into tokens into numbers such that many properties of words can be captured and used for comparisons. Models such as BERT use 768 different attributes for each token, which allow it to infer the context behind language and therefore perform well at a range of language tasks.

In the next piece, I will focus on a similar but fundamentally different question — once models have word embeddings, how do they calculate the similarity between words? We’ll go over the intuition behind cosine similarity, a powerful and versatile algorithm that can be used in a variety of situations from advanced language models to simpler data analytics tasks such as cleaning up data. I hope to see you then!

Further Readings

The following articles and videos were excellent sources I used to come up with the right language and explanations for this piece. I highly recommend them.

--

--