Attention Networks: A simple way to understand Multi-Head Attention

Geetansh Kalra
3 min readJul 10, 2022

--

Source: Unsplash

In recent years NLP has become the fastest evolving area in deep learning along with computer vision. The transformer architecture has made it possible to develop new models capable of being trained on large corpora while being much better than recurrent neural networks such as LSTM. These new models are used for Sequence Classification, Question Answering, Language Modeling, Named Entity Recognition, Summarization, or Translation.

In this post, we will look at the Multi-Head Mechanism. To understand them better you need to have a good understanding of what Attention Networks or Mechanisms are. I won’t cover the introduction for them here for that you can check out my previous post.

What is Multi-Head Attention?

To define Multi-head Attention, it’s basically clubbing together multiple self-attention. Let’s understand this with an example, Assume that you were playing a game and you came across a Giant fire-breathing Dragon. Let’s assume the only power from which the dragon could hurt your character is through fire. Your goal is to defeat that dragon. Now, if you were to give this to the self-attention mechanism, the algorithm would have learned and understood to pay attention to the dragon’s neck or face as that would have helped you to understand when it is about to breathe fire🔥 And you would have easily defeated the beast.

Source: Unsplash

Now, let’s make the problem more complex, What if I tell you that instead of one dragon there were two and both of them could breathe fire?

Source: Google Images

You will likely get defeated by using only a Single Attention as it had the ability to focus on a single dragon. So, here Multi-head Attention comes to the rescue.

Source: Visual Guide to Transformer Neural Networks — (Episode 2) Multi-Head & Self-Attention

As said before, in Multi-Head Attention we have multiple self-Attention, each would learn a different linguistic phenomenon. Each Attention Head, therefore, outputs its own Attention Filter which in turn outputs its own filtered value matrix, each zooming in on different combinations of linguistic features. The below-attached image will give a better intuition.

Source: Image made by Author

After selecting the number of Attention Heads we require to solve a problem statement( in our case just 2 heads) we concatenate them and pass them through a linear layer to get the desired output shape.

Source: Image made by Author

Done, This is how Multi-head Attention works, and this way you could defeat as many dragons that can come together in the game.

Source: GIPHY

In the next article, I would be covering Cross and Mask Attention

Hope this Helps!!

References:

1: Visual Guide to Transformer Neural Networks — (Episode 2) Multi-Head & Self-Attention: https://youtu.be/mMa2PmYJlCo

--

--

Geetansh Kalra

Hello People. I am working as Data Scientist at Thoughtworks. I like to write about AI/ML/Data Science Topics and Investing