Member-only story

Featured

Byte Latent Transformer: Changing How We Train LLMs

Vishal Rajput
AIGuys
Published in
8 min readJan 5, 2025

--

We all know that computers can’t read text, what they read are numbers. All the text is converted into numbers using different strategies and fed to the computers. But what about AI, can’t LLMs read and write a text? No, they read and write tokens. Tokens are a way to represent text, which are the fundamental units of text used to process and generate language. Tokens can represent characters, subwords, words, or even punctuation, depending on how the language model’s tokenizer works.

However, new work from Meta’s FAIR lab is challenging the well-established paradigm of tokens in LLM space, and it has developed a thing called Patches and Byte Latent Transformer. So, without further ado, let’s go deep into this new paper.

Table of Contents:

  • Tokens and Tokenization
  • Tokenization Algorithms
  • The Problem
  • Dynamic Tokenization

Tokens and Tokenization

A token is a segment of text that the model processes as a single unit.

For example:

  • Word-based tokenization: “Artificial Intelligence” → ["Artificial", "Intelligence"]
  • Subword tokenization: “Artificial Intelligence” → ["Art", "ificial", "Int", "elligence"]
  • Character-based tokenization: “Artificial Intelligence” → ["A", "r", "t", "i", ...]

--

--

AIGuys
AIGuys

Published in AIGuys

Deflating the AI hype and bringing real research and insights on the latest SOTA AI research papers. We at AIGuys believe in quality over quantity and are always looking to create more nuanced and detail oriented content.

Responses (1)