Demystifying Transformers: Tokenizers

Text-to-Number Converter

3 min readFeb 28, 2024

This article is part of the series Demystifying Transformers.

Introduction

Machine learning models don’t work directly with the raw text that we understand as humans. To make sense of language, these models need to translate text into a numerical format. This is where tokenizers come into the picture, playing a crucial role in natural language processing (NLP) tasks that utilize powerful Transformer models.

What is a Tokenizer?

In essence, a tokenizer is a tool that breaks down text into smaller pieces called tokens. These tokens become the basic units that a Transformer model can process. There are a few key reasons why tokenization is crucial:

Numerical Representation: Computers and machine learning models fundamentally work with numbers. Tokenizers bridge this gap by converting human-readable text into numerical representations that models can understand.
Handling Vocabulary: Natural language is incredibly vast in terms of its vocabulary. Tokenizers help manage this complexity by breaking words into smaller units, reducing the overall size of unique elements a model needs to deal with.
Contextual Understanding: While working with text, the order and surrounding context of words are essential to meaning. Tokenizers can preserve this information to some extent by generating sequences of tokens.

Why Tokenizers Matter

Tokenizers form a foundational preprocessing step within NLP pipelines. The way a tokenizer works directly impacts the following:

Model Input: Transformer models expect numerical input. The tokens the tokenizer produces create this input.
Model Performance: The choice of tokenizer influences how a model interprets relationships between words and their context, ultimately affecting the model’s performance on tasks like translation, summarization, or question answering.

Types of Tokenizers

Let’s look at some common types of tokenizers:

Word-Based Tokenizers: Perhaps the simplest type, these tokenize text by splitting sentences into individual words.
Subword Tokenizers: These break down words into smaller units for better handling of rare words and out-of-vocabulary terms. Popular choices include:
Byte Pair Encoding (BPE): Builds a vocabulary by iteratively combining frequent pairs of characters.
WordPiece: Similar to BPE, but uses a likelihood-based approach to find the breakdown of words.
Character-Level Tokenizers: Here, words are split into individual characters. This approach works well with languages that don’t have clear word boundaries.

How Tokenizers Work: An Example

Let’s illustrate with a sentence: “This is an informative blog post.”

A word-based tokenizer might output: [“This”, “is”, “an”, “informative”, “blog”, “post”]
A BPE tokenizer could produce: [“Th”, “is”, “ an”, “ inform”, “ative”, “ blog”, “ post”]

Notice how the BPE tokenizer handles the word “informative” more effectively as it’s less common.

Python Code: Character-Level Tokenizer

The code is available in this colab notebook:

import string

class CharacterTokenizer:
    """
    Character-level tokenizer with encoding and decoding functionality.
    """
    def __init__(self):
        """
        Initializes the tokenizer with all possible characters.
        """
        self.vocab = {char: i for i, char in enumerate(string.printable)}
        self.inverse_vocab = {v: k for k, v in self.vocab.items()}

    def encode(self, text):
        """
        Encodes a text string into a list of integer token IDs.

        Args:
            text: The input text string to encode.

        Returns:
            A list of integer token IDs.
        """
        return [self.vocab[char] for char in text]

    def decode(self, tokens):
        """
        Decodes a list of integer token IDs back into a text string.

        Args:
            tokens: A list of integer token IDs to decode.

        Returns:
            The decoded text string.
        """
        return ''.join([self.inverse_vocab[token] for token in tokens])

# Example Usage
tokenizer = CharacterTokenizer()  # No initial text needed

encoded_tokens = tokenizer.encode("Hello world!")
print(encoded_tokens)  

decoded_text = tokenizer.decode(encoded_tokens)
print(decoded_text)

Output:

[43, 14, 21, 21, 24, 94, 32, 24, 27, 21, 13, 62]
Hello world!

For Byte Pair Encoding Tokenizer, checkout Andrej Karpathy’s implementation at https://github.com/karpathy/minbpe.

Conclusion

The right tokenizer is essential for building effective NLP models with Transformers. Understanding how they work and the different types available will help you make informed decisions when working with language data. Often, the best tokenizer for your needs will depend on the specific task and the nature of the language you’re working with.