Attention Mechanism in Deep Learning : Simplified

5 min readFeb 29, 2020

Why is attention in deep learning getting so much… umm, attention?

What exactly is the attention mechanism?

Look at the image below and answer me, what is the color of the soccer ball? Also, which Georgetown player, the guys in white, is wearing the captaincy band?

When you were trying to figure out answers to the questions above, did your mind do this weird thing where it focused on only part of the image?

Also when you were reading the sentence above, did your mind start associating different words together, ignoring certain phrases at times to simplify the meaning?

What happened? Well, it’s easy enough to explain. You were ‘focusing’ on a smaller part of the whole thing because you knew the rest of the image/sentence was not useful to you at that particular moment. So when you were trying to figure out the color of the soccer ball, your mind was showing you the soccer ball in HD but the rest of the image was almost blurred. Similarly, when you were reading the question, once you understood that the guys in white were Georgetown players, you could blur out that part of the sentence to simplify its meaning.

In an attempt to borrow inspiration from how a human mind works, researchers in Deep Learning have tried replicating this behavior using what is known as the ‘attention mechanism’. Very simply put, attention mechanism is just a way of focusing on only a smaller part of the complete input while ignoring the rest.

How does it work?

Attention can be simply represented as a 3 step mechanism. Since we are talking about attention in general, I will not go into details of how this adapts to CV or NLP, which is very straightforward actually.

Create a probability distribution that rates the importance of various input elements. These input representations can be words, pixels, vectors etc. Creating these probability distributions is actually a learnable task.
Scale the original input using this probability distribution such that values that deserve more attention gets enhanced while others get diluted. Kinda like blurring everything else that doesn’t need attention.
Now use these newly scaled inputs and do further processing to get focused outputs/results.

Attention has completely changed the NLP game

Attention mechanism has been adopted in NLP for a relatively long time now, being used with several recurrent processing models like RNNs, LSTMs etc. As we noticed earlier, by focusing on only a short subset of words at a time, the attention mechanism can help these models better understand the language. But even after all that, attention was only used as an addition to the main model and RNNs were still ruling the world of NLP.

However, things changed when around 3 years ago a new paper was released named ‘Attention is All you Need’. As the name suggests, this model architecture, which is commonly known as Transformer, was able to replace the recurrent processing units with solely attention networks. Not only did it easily outperform RNNs, but Transformer based models are still making amazing progress and are the current leaders of various NLP competitions and tasks.

A small attention-based Transformer network [Source]

Does attention mean explanation?

In the last few years, there has been a tremendous hype of what is known as explainable AI, or XAI for short. With AI breaking into fields like medical diagnosis and autonomous driving, people are now starting to fear that a BlackBox is making life and death decisions. For us to trust the decisions made by AI, new research has been done in the direction of creating models that can also explain these decisions.

For several years it was believed that the attention mechanism can provide some sort of explanation on the predictions provided by the model. I mean it does make sense to think that the part of the input the model is focusing on should tell us something about the reasoning of its predictions. However, a deeper probe recently claimed that attention really has no link with explainability and various attention distributions can provide similar results. To add to the fun of this discovery, another paper recently went against this claim, stating that ‘explainability’ is actually subjective and thus saying that attention does not provide any explanation is incorrect.

According to me though, at least on some intuitive level, probing into the results of attention branches of the network should provide insights on how the model works and thus should have some connection with explainability.

What’s next?

While attention has always been utilized as a side mechanism for improving the performance of deep learning architectures, the recent success of Transformers in NLP suggests that attention alone is powerful enough to do amazing things that other networks cannot do. Also, it will be interesting to see how the field of explainable AI adopts the attention mechanism.

This blog is a part of an effort to create simplified introductions to the field of Machine Learning. Follow the complete series here

Machine Learning : Simplified

Know it before you dive in

towardsdatascience.com

Or simply read the next blog in the series

Pre-trained Language Models : Simplified

Sesame street of the NLP world

towardsdatascience.com

References

[1] Ramachandran, Prajit, et al. “Stand-alone self-attention in vision models.” arXiv preprint arXiv:1906.05909 (2019).
[2] Guan, Qingji, et al. “Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification.” arXiv preprint arXiv:1801.09927 (2018).
[3] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
[4] Jain, Sarthak, and Byron C. Wallace. “Attention is not Explanation.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
[5] Wiegreffe, Sarah, and Yuval Pinter. “Attention is not not Explanation.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

Attention Mechanism in Deep Learning : Simplified

What exactly is the attention mechanism?

How does it work?

Attention has completely changed the NLP game

Does attention mean explanation?

What’s next?

Machine Learning : Simplified

Know it before you dive in

Pre-trained Language Models : Simplified

Sesame street of the NLP world

References

Written by Prakhar Ganesh