Explained: Tokens and Embeddings in LLMs

The numbers that make all the sense

The Research Nest


Image source: https://i.kym-cdn.com/entries/icons/original/000/000/056/itsover1000.jpg

“GPT4 Turbo has a context length of 128K tokens.”

“Claude 2.1 has a context length of 200K tokens.”

Sounds like some important detail. What is a token, really?

Consider a sentence — “It’s over 9000!”.

We can represent it as ["It's", "over", "9000!"] where each array element can be called a token.

In the world of natural language processing, it is the smallest unit of analysis that we define. What you call a token depends on your tokenization method; plenty of such methods exist. Creating tokens is basically the first step to perform for most NLP tasks.

Tokenization Methods in NLP

Let’s directly jump to a code sample to understand some popular ways to tokenize a string.

# Example string for tokenization
example_string = "It's over 9000!"

# Method 1: White Space Tokenization
# This method splits the text based on white spaces
white_space_tokens = example_string.split()

# Method 2: WordPunct Tokenization
# This method splits the text into words and punctuation
from nltk.tokenize import WordPunctTokenizer
wordpunct_tokenizer = WordPunctTokenizer()
wordpunct_tokens = wordpunct_tokenizer.tokenize(example_string)

# Method 3: Treebank Word Tokenization
# This method uses the standard word tokenization of the Penn Treebank
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(example_string)

white_space_tokens, wordpunct_tokens, treebank_tokens
(["It's", 'over', '9000!'],
['It', "'", 's', 'over', '9000', '!'],
['It', "'s", 'over', '9000', '!'])

Each method has its unique way of breaking down the sentence into tokens. You can create your own method if you want, but the basic crux remains the same.

Why do we need to tokenize strings?

  1. To break down complex text into manageable units.
  2. To present text in a format that is easier to analyze or perform operations on.
  3. Useful for specific linguistic tasks like part-of-speech tagging, syntactic parsing, and named entity recognition.
  4. Uniformly preprocess text in NLP applications and create structured training data.

Most NLP systems perform some operations on these tokens to perform a specific task. For example, we can design a system to take a sequence of tokens and predict the next token. We can also convert the tokens into their phonetic representation as part of a text-to-speech system. Many other NLP tasks can be done, like keyword extraction, translation, etc.

How do we actually use these tokens to build these systems in the first place?

  1. Feature Extraction: Tokens are used to extract features that are fed into machine learning models. Features might include the tokens themselves, their frequency, their part-of-speech tags, their position in a sentence, etc. For instance, in sentiment analysis, the presence of certain tokens might be strongly indicative of positive or negative sentiment.
  2. Vectorization: In many NLP tasks, tokens are converted into numerical vectors using techniques like Bag of Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), or word embeddings (like Word2Vec, GloVe). This process turns text data into numbers that machine learning models can understand and work with.
  3. Sequence Modeling: For tasks like language modeling, machine translation, and text generation, tokens are used in sequence models like Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), or Transformers. These models learn to predict sequences of tokens, understanding the context and the likelihood of token occurrence.
  4. Training the Model: In the training phase, models are fed tokenized text and corresponding labels or targets (like categories for classification tasks or next tokens for language models). The models learn patterns and associations between the tokens and the desired output.
  5. Context Understanding: Advanced models like BERT and GPT use tokens to understand context and generate embeddings that capture the meaning of a word in a specific context. This is crucial for tasks where the same word can have different meanings based on its usage.

If you are new to all this, don’t worry about the keywords you just read. In very simple terms, we have text strings that we convert to independent units called tokens. This makes it easier to convert them to “numbers,” later, which the computer understands.

ChatGPT and Tokens

What do tokens look like in the context of LLMs like ChatGPT? The tokenization methods used for LLMs differ from those used in general NLP.

Broadly speaking, we can call it “subword tokenization,” where we create tokens that need not necessarily be complete words as we see in whitespace tokenization. This is precisely why one word is not equal to one token.

When they say GPT-4 Turbo has 128K tokens as its context length, it is not exactly 128K words but a number close to it.

Why use such different and more complicated tokenization methods?

  1. These tokens are more intricate representations of language than complete words.
  2. They help address a large range of vocabulary, including rare and unknown words.
  3. Working with smaller subunits is computationally more efficient.
  4. They help with better contextual understanding.
  5. It’s more adaptable across languages that can be quite different from English.

Tokenization Methods in LLMs

Byte-Pair-Encoding (BPE)

Many open-source models, like Meta’s LLAMA-2 and the older GPT models, use a version of this method.

In a real-world context, BPE analyzes a large amount of text to determine the most common pairs.

Let’s take a simple example with the GPT-2 Tokenizer.

from transformers import GPT2Tokenizer

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "It's over 9000!"

# Tokenize the text
token_ids = tokenizer.encode(text, add_special_tokens=True)

# Output the token IDs
print("Token IDs:", token_ids)

# Convert token IDs back to raw tokens and output them
raw_tokens = [tokenizer.decode([token_id]) for token_id in token_ids]
print("Raw tokens:", raw_tokens)
Token IDs: [1026, 338, 625, 50138, 0]
Raw tokens: ['It', "'s", ' over', ' 9000', '!']


What’s a token ID? Why is it a number?

Let’s break down how this process works.

Building the “Vocabulary” (this is basically part of the BPE method)

  • Starting with Characters: Initially, the vocabulary consists of individual characters (like letters and punctuation).
  • Finding Common Pairs: The training data (a large corpus of text) is scanned to find the most frequently occurring pairs of characters. For example, if ‘th’ appears often, it becomes a candidate to be added to the vocabulary.
  • Merging and Creating New Tokens: These common pairs are then merged to form new tokens. The process continues iteratively, each time identifying and merging the next most frequent pair. The vocabulary grows from individual characters to common pairings and eventually to larger structures like common words or parts of words.
  • Limiting the Vocabulary: There’s a limit to the vocabulary size (e.g., 50,000 tokens in GPT-2). Once this limit is reached, the process stops, resulting in a fixed-size vocabulary that includes a mix of characters, common pairings, and more complex tokens.

Assigning Token IDs

  • Indexing the Vocabulary: Each unique token in the final vocabulary is assigned a unique numerical index or ID. This is done straightforwardly, much like indexing in a list or array.
  • Token ID Representation: In the context of GPT-2, each piece of text (like a word or part of a word) is represented by the ID of the corresponding token in this vocabulary. If a word is not in the vocabulary, it’s broken down into smaller tokens that are in the vocabulary.
  • Special Tokens: Special tokens (like those representing the start and end of a text or unknown words) are also assigned unique IDs.

The key point is that the assignment of token IDs is not arbitrary but based on the frequency of occurrence and combination patterns in the language data the model was trained on. This allows GPT-2 and similar models to efficiently process and generate human language using a manageable and representative set of tokens.

Here, the “vocabulary” refers to all the unique tokens that the model can recognize and work with. It’s essentially the tokens created with the help of training data using the given tokenization method.

Phew! That’s a lot of stuff to process.

Most current generation of LLMs use some variation of BPE. For example, the Mistral model uses the byte fallback BPE tokenizer.

Some other methods beyond BPE include unigram, sentencepiece, and wordpiece.

Let’s not worry about all that.

For now, what’s important to know is that creating tokens is one of the first steps when dealing with NLP or LLMs. Different tokenization methods exist to create tokens, which are also assigned some token IDs.

What’s an Embedding?

We already came across this word. Before we get to it, let’s clear up some confusion.

  1. Token IDs are a straightforward numerical representation of tokens. It is, in fact, a basic form of vectorization. They do not capture any deeper relationships or patterns between the tokens.
  2. Standard vectorization techniques (like TF-IDF) include creating more complex numerical representations based on some logic.
  3. Embeddings are advanced vector representations of tokens. They try to capture the most nuance, connections, and semantic meanings between tokens. Each embedding is generally a series of real numbers on a vector space computed by a neural network.

In short, text is converted to tokens. Tokens are assigned token IDs. These token IDs can be used to create embeddings for more nuanced numerical representation in complex models.

Why all this?

Because computers understand and operate over numbers.

Embeddings are the “real inputs” of LLMs.

Let’s create an embedding to see what it really looks like.

Token to Embedding Conversion

Just like different tokenization methods, we have several approaches to make the token-embedding conversion. Here are some of the popular ones:

  1. Word2Vec — a neural network model
  2. GloVe (Global Vectors for Word Representation) — an unsupervised learning algorithm
  3. FastText — an extension of Word2Vec
  4. BERT (Bidirectional Encoder Representations from Transformers)
  5. ELMo (Embeddings from Language Models) — a deep bidirectional LSTM model.

Let’s not worry about the internal workings of each method for now. All you need to know is that you can use them to create numerical representations of text that computers can make sense of.

Let me use BERT to create embeddings as an example.

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained model
model = BertModel.from_pretrained('bert-base-uncased')

# Text to be tokenized
text = "It's over 9000!"

# Encode text
input_ids = tokenizer.encode(text, add_special_tokens=True)

# Output the token IDs
print("Token IDs:", input_ids)

# Convert token IDs back to raw tokens and output them
raw_tokens = [tokenizer.decode([token_id]) for token_id in input_ids]
print("Raw tokens:", raw_tokens)

# Convert list of IDs to a tensor
input_ids_tensor = torch.tensor([input_ids])

# Pass the input through the model
with torch.no_grad():
outputs = model(input_ids_tensor)

# Extract the embeddings
embeddings = outputs.last_hidden_state

# Print the embeddings
print("Embeddings: ", embeddings)
Token IDs: [101, 2009, 1005, 1055, 2058, 7706, 2692, 999, 102]
Raw tokens: ['[CLS]', 'it', "'", 's', 'over', '900', '##0', '!', '[SEP]']
Embeddings: tensor([[[ 0.1116, 0.0722, 0.3173, ..., -0.0635, 0.2166, 0.3236],
[-0.4159, -0.5147, 0.5690, ..., -0.2577, 0.5710, 0.4439],
[-0.4893, -0.8719, 0.7343, ..., -0.3001, 0.6078, 0.3938],
[-0.2746, -0.6479, 0.2702, ..., -0.4827, 0.1755, -0.3939],
[ 0.0846, -0.3420, 0.0216, ..., 0.6648, 0.3375, -0.2893],
[ 0.6566, 0.2011, 0.0142, ..., 0.0786, -0.5767, -0.4356]]])

Carefully observe the code.

  • Like in the previous example with GPT-2, we first tokenize the text. The BERT Tokenizer uses wordpiece method for the same. It basically breaks down words into smaller pieces based on certain criteria.
  • We get the token IDs and then print the raw tokens. Notice how it’s different compared to the GPT-2 tokenizer output.
  • We create a tensor from the token IDs and pass it to a pre-trained BERT model as input.
  • We take the final output from the last hidden state.

As you can see, embeddings are basically arrays of numbers.

When you say, “It’s over 9000!” the computer essentially reads a very large N-dimensional tensor array with real numbers.

Why are embeddings so large and complex? What do they signify?

  1. Each token’s embedding is a high-dimensional vector. This allows the model to capture a wide range of linguistic features and nuances, like the meaning of a word, its part of speech, and its relationship to other words in the sentence.
  2. Contextual Embeddings: Unlike simpler word embeddings (like Word2Vec), BERT’s embeddings are contextual. This means the same word can have different embeddings based on its context (its surrounding words). The embeddings need to be rich and complex to capture this contextual nuance.
  3. In our example, the sentence “It’s over 9000!” is tokenized into multiple tokens (including special tokens added by BERT for processing). Each token gets its own embedding vector.
  4. In more complex models like BERT, you not only get the final embeddings but also have access to the embeddings from each layer of the neural network. Each layer captures different aspects of the language, adding to the complexity and size of the tensor.
  5. Input for Further Tasks: These embeddings are used as input for various NLP tasks like sentiment analysis, question answering, and language translation. The richness of the embeddings allows the model to perform these tasks with a high degree of sophistication.
  6. Model’s Internal Representation: The complexity of these tensors reflects how the model ‘understands’ language. Each dimension in the embedding can represent some abstract language feature the model learned during its training.

In short, embeddings are the secret sauce that makes the LLMs work so well. If you find ways to create better embeddings, you will likely create a better model.

When these numbers are processed with the architecture of a trained AI model, it computes new values in the same format, representing the answer to the task for which the model was trained. In LLMs, it’s the prediction of the next token.

The result you see on the user interface is basically the text retrieved from the output numbers produced.

When training an LLM, you are essentially trying to optimize all the mathematical computations that happen in the model with the input embeddings to create the desired output.

All such computations include some parameters called model weights. They determine how the model processes input data to produce output.

Embeddings are, in fact, a subset of the model’s weights. They are the weights associated with the input layer (in the case of feedforward networks) or the embedding layer (in models like Transformers) (generally the first layer).

Model weights and embeddings can be initialized (or computed) as random variables or taken from pre-trained models. These values are then updated during the training phase.

The goal is to find the right values for the model weights, such that the computations it does, given an input, produce the most accurate output for the given context.

Intutitive Conclusions

  1. Large language models are basically large black boxes doing complex computations with embeddings and model weights.
  2. Text -> Tokens -> Token IDs -> Embeddings. Computers operate over numbers under the hood. Embeddings are the secret sauce that gives LLMs their contextual language understanding.
  3. There are many different techniques to create tokens and embeddings and that can significantly affect how the model works.

And that's a high-level overview of the building blocks in creating/using LLMs. I have documented all the stuff I have been exploring in the way I understand.

Random Fun Fact

We computed a huge tensor array embedding of the simple text, “It’s over 9000!”. How many elements are actually there in that embedding?

We have a simple function called numel() to calculate the same.

# Calculate the number of elements in the embedding tensor
num_elements = embeddings.numel()


Hmm. Looks like “It’s over 9000!” isn’t actually over 9000 in this context :P

In case of any errata or follow-up questions, feel free to discuss them in the article responses section. I will update the content accordingly.

Loved the content and want me to write such in-depth articles for your startup website, blog, or documentation? Feel free to hit me up with a proposal at adityavivek.xq@gmail.com.



The Research Nest

Exploring Tech, Life, and Careers Through Content 🚀