Member-only story
2D Tokenization for Large Language Models
This article is about how we process text before passing it to large language models, the problems with this approach, and an alternative solution.
The Problem With (1D) Tokenization
When passing text to a Large Language Model (LLM), text is broken down into a sequence of words and sub-words. This sequence of tokens is then replaced with a sequence of integers and passed to the model.
LLMs contain an embedding matrix to store a representation for each of these tokens. In the case of the RoBERTa model, there are 768 numbers to represent each of the ~50,000 tokens in its vocabulary.
This approach raises several questions, like “how should spaces be represented in the sequence of tokens?” and “should different capitalization be considered a different word?”.
If we look in the embedding matrix of the RoBERTa model, there isn’t just one representation of the word dog
, there are separate representations for variations in capitalization, pluralization, and whether or not it’s preceded by a space. In total, RoBERTa has seven separate slots in its vocab for various forms of dog
.
>>> roberta_tok.convert_ids_to_tokens([39488, 8563, 16319, 2335, 3678, 18619, 20226])
['Dog', 'ĠDog', 'dog', 'Ġdog', 'Ġdogs', 'ĠDogs', 'dogs']