Transformers Course — Lesson 2: Understanding the Architecture of Transformers

Machine Learning in Plain English
3 min readMay 11, 2023

--

Key Components: Encoders, Decoders, and Self-Attention Mechanism

Let’s imagine Transformers as a high-tech factory. The raw materials (data) enter, and a finished product (processed data) comes out. This factory has three main departments: the Encoder, the Decoder, and the Self-Attention Mechanism.

  • Encoder: Picture the Encoder as the initial quality check. It examines each piece of raw material (data point) and converts it into a more usable form. In Transformers, the Encoder processes the input data and translates it into a sequence of vectors — compact, rich representations that capture the essence of each data point.
  • Decoder: The Decoder is like the final assembly line. It takes the processed materials from the Encoder and assembles them into the final product. The Decoder reads the sequence of vectors and generates the output, whether that’s a translation, a summary, a response, or another form of processed data.
  • Self-Attention Mechanism: This is the secret sauce. Imagine a group of expert inspectors who can figure out how important each piece of material is to the final product, even considering how they all relate to each other. That’s what the Self-Attention Mechanism does — it determines the importance of each data point in the context of the others.

Exploring the Encoder-Decoder Structure

The Transformer’s architecture is like an intricate dance between the Encoder and Decoder, perfectly choreographed and in sync.

  • Encoder: Each Encoder is composed of multiple layers (imagine floors in a building). As the data rises from one layer to the next, it gets more and more refined. Each layer has two sub-layers: a Self-Attention Mechanism and a Feed Forward Neural Network. The Self-Attention Mechanism helps the model understand the input sequence’s context, while the Feed Forward Neural Network aids in transforming the input.
  • Decoder: The Decoder also has multiple layers, but with a twist. It has an extra Self-Attention Mechanism. This additional mechanism helps the Decoder pay attention to the Encoder’s output while generating its own. It’s like a quality checker cross-verifying with the initial reports while doing their inspection.

Detailed View of the Self-Attention Mechanism

Picture the Self-Attention Mechanism as a team of forensic investigators, each expertly discerning how clues relate to each other to solve a case. For each word in a sentence (or each data point in a sequence), the Self-Attention Mechanism calculates a score signifying how much focus should be placed on other words when encoding a particular word. It’s like understanding not just each clue individually, but also how it relates to the whole crime scene.

Positional Encoding and its Significance

One puzzle piece remains: Positional Encoding. Remember, Transformers look at all data points simultaneously, unlike humans who read word by word. But the order of words matters, right? “Dog bites man” is not the same as “man bites dog.”

That’s where Positional Encoding comes in. It’s like a GPS for Transformers, helping them understand the position of data points in a sequence. It adds a sense of order, just like page numbers do for a book. Without it, Transformers would be like a reader jumbling up all the sentences in a story. With Positional Encoding, Transformers can understand sequences as well as, if not better than, their RNN and CNN predecessors.

In the next section, we’ll dive deeper into the inner workings of the Transformer model, focusing on how each component fits together to make this incredible machine work.

--

--