Tokenization and Embedding: Science Behind Large Language Model

Gaurav Gupta
Epsilon Engineering Blog
6 min readApr 30, 2024

By Gaurav Gupta

I bet many of you are already familiar with Chat-GPT, right?

Since the launch of Chat GPT Model terms like OpenAI, LLMs, Transformer, and Neural Networks have been buzzing around everywhere. You’ve probably delved into countless blogs or videos explaining the inner workings of large language models (LLMs) but let me switch gears a bit.

This blog isn’t about mysteries of LLMs and their transformer architecture. Instead, we’re diving deep into two pivotal components of language models: Tokenizers and Embeddings. Let’s intriguing into the heart of natural language understanding!

Let’s start with a simple prompt to the GPT.

Capital of Karnataka is __________?

Here is the response.

Capital of Karnataka is Bengaluru?

Therefore, from the responses we observe, it seems like Language Models (especially GPT) are reasoning like humans does. They craft sentences and respond just like we do. But do LLMs truly think like us? If so, how do they grasp the meanings of words like we do? And if not, how do they manage to churn out such human-like responses? let’s understand more with another example below.

As Bangalore’s temperatures soar by the day, you find yourself pondering the idea of opening an ice cream parlor. However, despite hours of brainstorming, you’re stumped when it comes to crafting the perfect tagline for your venture. Frustrated yet hopeful, you turn to GPT for assistance, typing in the simple request:

“Tell me about the tagline for an ice cream parlor.”

In response, GPT generates a suggestion “Scooping Happiness...!!”.

But how does it process your input and conjure up these creative outputs?

Ah.., the secret lies in three powerful components: Tokenization, Embedding, and Neural Networks. Exciting, right? Now, let’s decode each of them. But first, what exactly is tokenization, and how does it work.

Tokenization –

Here you have provided a text input (“Tell me tagline for ice cream parlor”) to GPT, once done the GPT tokenizer reads the entire input.

Tokenizer further breaks down the input text into smaller units called tokens. Tokens can be words, subworlds, or even characters, depending on the specific tokenizer used. In our case we are using GPT-3 Tokenizer.

Further each token is assigned a unique numerical ID, and the tokenizer converts the input text into a sequence of these numerical token IDs.

Therefore, every input that we are providing to GPT is nothing but a token (numerical id) or a sequence of tokens.

So, I believe we are clear so far that GPT doesn’t understand the language the way humans do but it just processes sequence of numerical ids, that we call tokens.

But how does it find the association among words(tokens) and provide human like response, here comes the concept of embedding?

Let’s spend some time to understand the embedding, A little math please.

Embedding:

Word embedding is like creating a map for words in vector space. Each word is represented by a unique location on the map, and words with similar meanings are grouped closer together.

This helps the computer understand relationships between words. For example, word, “King” and “Men” might be close together because they’re both Male and Human. Similarly, Women and Queen are similar in vector space.

Interestingly in our case “suggesting of tag line for Ice Cream parlor” example, the response form GPT is “Scooping Happiness” because worlds like Ice Cream and Scoop should be closer in the vector space, that’s why it suggests the title.

In a nutshell, word embedding helps the computer understand language better, making it smarter at tasks like translation or answering questions.

Let’s understand a bit more from the famous “King — Man + Woman = Queen” examples.

Suppose our vocabulary has only five words: King, Queen, Man, Woman, and Child. We could encode the word ‘Queen’ in binary format as:

In real world we have around 170 thousand words in English vocabulary and machine has no way to understand the similarity b/w ice cream and scoop or King and Queen so to make meaningful comparison among these words, world must be represented in vector space with several hundred dimensions and each word is represented by a distribution of weights across those elements as below.

Now, let’s tackle the second question, Embeddings play a crucial role in enabling machines to grasp the associations and similarities between words. This understanding aids in predicting words within a sequence effectively.

Moreover, these embeddings are then fed into Language Models (specifically neural networks) as inputs. The neural networks churn out another set of tokens based on probability scores. However, the process of how neural networks process these embeddings to generate another set is a topic for another discussion. For now, let’s only focus on tokenization and embedding.

Let’s focus back on the results,

So far GPT received the bunch of tokens, created it’s embedding, processed it through neural networks and return another set of tokens as below.

Here is the response from the Model, it contains the 20 character and 6 token, which again is the sequence of the numerical id (Tokens).

These tokes (numeric ids) further decode into words by GPT tokenizer as below.

And here is the final response from the model.

Therefore, it’s important to note that tokenization is a reversible process. After the model generates output, the tokenizer converts the sequence of token IDs back into human-readable text.

So finally, you got the nice Tag line for your Ice Cream parlor and understood how Embedding, and tokenizer works.

So, if you understood the 2 concepts of Tokenization and Embedding you are 50% there to develop end to end understanding of Large Language Models.

Tokenization and Embedding process Flow:

Hope this flow further enrich your understanding with LLMS.

Here are some foods for thought for you.

  • Why tokens have been created and why we can’t just use one token for each letter?
  • How tokens have been created.
  • Can I create my own tokenizer?

Bonus:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

GPT Tokenizer: OpenAI Platform

Hope you found this useful, if you are interested in any other LLM related topic please drop a comment.

--

--