Masked multi-head attention explained in a simple way

Published in

DataThings

12 min readAug 28, 2023

With the recent rise of chat GPT, many companies started to race to implement their own version of Large Language Model (LLM) which uses the transformer architecture. At the core of this architecture, we find multi-head attention layers. What do these layers do? and how can we demystify their function?

Introduction

Since we live in the chat GPT era, the first action is to ask chat gpt to explain multi-head attention in a simple way. Here is the answer I got:

In technical terms, the attention mechanism uses a kind of “soft spotlight” that assigns weights to different parts of the input. These weights determine how much attention the AI should give to each part when generating the output. By combining the weighted information from different parts of the input, the AI creates a more complete and contextually accurate result.
So, the attention mechanism is like the AI’s way of following the recipe and deciding which ingredients to focus on at each step to create the best possible outcome.

Funny to see chat GPT explaining the inner-working of itself

Next step is to search a bit about attention layer, we find the following diagram:

Masked multi-head attention explained in a simple way

Introduction

Written by Assaad MOAWAD