Transformers — Unknown Hero. Part 2

Mika.i Chak
3 min readMar 29, 2024

Previously in Part 1, I mentioned about “word” but it was used for convenience. In fact, token is a better term compare to “word”. In this post, we are going to see what is tokenization and how each model uses different tokenization mechanisms.

Tokenization is the process of converting a sequence of text into individual words, subwords, or tokens that the model can understand. LLMs use subword algorithms like BPE or wordpiece to split text into smaller units that capture common and uncommon words. This approach helps to limit the model’s vocabulary size while maintaining its ability to represent any text sequence.
https://datasciencedojo.com/blog/embeddings-and-llm/

Using a simple input of “The quick brown fox jumps over a lazy dog named Brown.”, we could observe some differences between LLM models.

  • Each model uses different token ID for the same token. For the word “The”, Gemma-7b use 651 while GPT2 use 464.
  • Word is split differently. For the word “jumps”, Gemma-7b recognizes the word “jumps” with token 178057, while GPT2 and Mistral’s vocabulary uses two tokens in “j” and “umps”.
  • Line-break is treated differently. Most of the models here has a special token for line-break, but bert-base-uncased model doesn’t.
  • Capital letter is treated differently. For…

--

--