Transformer Model: Simplified

Imagine you’re reading a book and you want your computer to understand it just like you do so you can discuss the story with it. The Transformer is a special kind of tool that helps the computer learn the language, the order of words, and the meaning behind sentences and paragraphs. It’s like having a brilliant friend — an intricately layered individual, if you will — who is well-read, really good at understanding stories, and can explain them to the computer in their common language of mathematics (their “grammar”) using numbers (their “alphabet”).

5 min readDec 15, 2023

“The transformer is well named, as it transformed everything.” - Ilya Sutskever, Co-founder and Chief Scientist at OpenAI

Input Layer: Understanding Words

When reading a book we see words on the page and our eyes encode them into signals that our brains can work with to figure out the content of the text. Computers understand things in just ones and zeros, so they need all data in the form of numbers.

The first layer of the Transformer takes the words and turns them into a code of numbers through a process called tokenization, where each word (or part of word) gets its unique token, like a fingerprint, that the computer can understand and reference any time it comes across it in the text. This allows the computer to recognize and work with the words just like our brains do.

Just as each human has had a different life and education and thus may have different understanding of various ideas, so do language models trained on different content create different “vector embeddings”, that is the coordinates of the words relative to all the other words within its matrix of vocabulary.

Understanding of Vector Algebra and Trigonometry is central to understanding language models architecture — I suggest brushing up on these areas of math; I recommend 3Blue1Brown YouTube channel for their great visual explanations.

Positional Encoding: Keeping Track of Order

Imagine if there were no numbers on the pages of a book, no table of contents, no chapters or section separations, no paragraphs, and you just had to guess the order of things. That would be confusing! Humanity has adopted various ways to encode our thoughts into words, with each language having its grammatical rules and standards, to best communicate our message.

We look at the text with our eyes and see where each word is relative to other words. Computers need to use numbers to achieve this. Positional Encoding is like giving each word a special number to show where it is in the sentence, like a house number on a street. It allows the computer to understand the sequence of words in a given text, allowing it to follow the story as a whole.

Attention Mechanism: “Attention is All You Need”

When you’re reading a book some words are more critical than others. The Attention Mechanism is like a highlighter that highlights important words; it’s like saying, “Hey computer, pay extra attention to these words!” It allows the model to dynamically focus on different parts of the input sequence, giving more weight to the relevant words and less weight to less relevant ones through attention scoring of each word. This ability to selectively attend to different parts of the input is crucial for capturing long-range dependencies of words, facilitating understanding of long format text. This idea was introduced in the “Attention is All You Need” paper, which is like the instruction manual for making attention work in the Transformer Model.

Multi-Head Attention: Teamwork for Deeper Understanding

In the multi-head attention mechanism the approach to understanding the narrative is similar to a group of friends, i.e. “heads”, each paying attention to different facets of the same story. Like individuals who resonate with varying aspects — one with character arcs, another with thematic elements, and another with plot progression — each head in this multi-head system specializes in a distinct “dimension” of the text. This allows for an understanding that is both comprehensive and nuanced, as each segment of the mechanism delves into and interprets separate layers of the narrative, like friends engaging with different aspects of a book and then have a discussion, bringing together their unique insights.

This collaborative synthesis of perspectives enables the computer to develop a multifaceted and profound understanding of the story. It goes beyond mere surface-level comprehension; it taps into the subtleties of implied meanings, emotional undercurrents, and the intricate interplay of narrative elements.

Feedforward Neural Network: Figuring Out the Details

Once the computer understands the big picture, it’s time to look at the nitty-gritty details. This stage is like examining each phrase and metaphor under a magnifying glass, delving into the deeper meanings embedded within every word. The network scrutinizes the text with forensic precision, probing the subtleties of language and context to extract nuanced sentiments from the text.

In this process, the Feedforward Neural Network methodically assembles the pieces of the narrative. It evaluates the nuances and inflections in the text, understanding how each element contributes to the overall story arc. This detailed examination allows the network to construct a comprehensive understanding of the text, ensuring the computer appreciates the story’s depth and complexity, like a human enjoying a well-written book.

Layer Normalization: Keeping Things Balanced

In any story, balance is crucial. Layer normalization is like making sure everything is in harmony. It ensures that the computer doesn’t get too fixated on one part of the story and keeps a balanced view, just like you would when reading a book. It’s like having a friend who makes sure you understand the whole story, not just bits and pieces.

Output Layer: Putting It All Together

After understanding the words, their order, importance, and details, it’s time to put it all together. The output layer is like the moment when you finish reading a book and understand the whole story.

This layer compiles everything the computer has learned to give a meaningful output, the moment when the computer shares its understanding of the story with you.

“Predicting the next token well means you understand the underlying reality that led to the creation of that token.” - Ilya Sutskever, Co-founder and Chief Scientist at OpenAI

A great video explaining the Transformer Model in detail.

An awesome lecture by Andrej Karpathy explaining how Large Language Models like ChatGPT work.