Attention Networks: A simple way to understand Multi-Head Attention

3 min readJul 10, 2022

In recent years NLP has become the fastest evolving area in deep learning along with computer vision. The transformer architecture has made it possible to develop new models capable of being trained on large corpora while being much better than recurrent neural networks such as LSTM. These new models are used for Sequence Classification, Question Answering, Language Modeling, Named Entity Recognition, Summarization, or Translation.

In this post, we will look at the Multi-Head Mechanism. To understand them better you need to have a good understanding of what Attention Networks or Mechanisms are. I won’t cover the introduction for them here for that you can check out my previous post.

Attention Networks: A simple way to understand Self Attention

“Every once in a while, a revolutionary product comes along that changes everything.” — Steve Jobs

medium.com

What is Multi-Head Attention?

To define Multi-head Attention, it’s basically clubbing together multiple self-attention. Let’s understand this with an example, Assume that you were playing a game and you came across a Giant fire-breathing Dragon. Let’s assume the only power from which the dragon could hurt your character is through fire. Your goal is to defeat that dragon. Now, if you were to give this to the self-attention mechanism, the algorithm would have learned and understood to pay attention to the dragon’s neck or face as that would have helped you to understand when it is about to breathe fire🔥 And you would have easily defeated the beast.

Now, let’s make the problem more complex, What if I tell you that instead of one dragon there were two and both of them could breathe fire?

You will likely get defeated by using only a Single Attention as it had the ability to focus on a single dragon. So, here Multi-head Attention comes to the rescue.

Source: Visual Guide to Transformer Neural Networks — (Episode 2) Multi-Head & Self-Attention

As said before, in Multi-Head Attention we have multiple self-Attention, each would learn a different linguistic phenomenon. Each Attention Head, therefore, outputs its own Attention Filter which in turn outputs its own filtered value matrix, each zooming in on different combinations of linguistic features. The below-attached image will give a better intuition.

After selecting the number of Attention Heads we require to solve a problem statement( in our case just 2 heads) we concatenate them and pass them through a linear layer to get the desired output shape.