Analytics Vidhya
Published in

Analytics Vidhya

“All you need is Attention,” Part I But Why?

As confusing as the title looks, we need a little digging up to do if we are going to make sense of what this title means. Buckle up, It’s a walk back in the time.

When Google introduced the Seq2Seq model in 2014, it was like finding the holy grail for NLP. This deep learning-based simple encoder-decoder model, when introduced first, was mainly used for machine translation. In a nutshell, Sequence-to-sequence learning is about training models to convert sequences from one domain (e.g. sentences in English) into sequences in another domain (e.g. the same sentences translated to French) in a machine translation context. Gradually it matched to be the most sought after architecture setup for all NLP tasks where text generation was the primary objective.

A typical depiction of the Encoder decoder Model & associated bottlenecks

Recurrent Neural Network (RNN) with the characteristic property of dependency among the input undoubtedly became the goto technique adopted by the seq2seq models. But the usual training obstacles were still creating the inherent problem of Vanishing gradients. As always LSTM and GRU, the most successful extension of RNN turned out to be the messiah here too. Having the ability to keep the most relevant information in the sentence, LSTMs proved to be providing contextual information consideration when generating output text sequences. And thus seq2seq became a big hit with the correct combination of architecture and algorithm. Or is it?

Seq2seq models are processed one step at a time where it collects the input and moves forward through the encoder, hidden states and decoder. But this input it collects was of fixed length. So whenever the input exceeded this fixed length memory, the hidden states got bottlenecked on their way to decoders with data. On top of this issue, the model started to give more relevance for the later words in the sequence which lead to context generation of output piece of text. Thus longer sequences became problematic for the vanilla seq2seq to handle. Therefore to handle this difficulty, the “One vector per word ” instead of the one vector sentence concept was introduced. This idea helped in making the weakness of fixed length input tamed. But the hidden states still got throttled up with added responsibility of processing all these vectors thus causing memory exhaustion.

Now we have two major challenges to beat, the memory problem and the context information grasping of the longer sequences. The context made a new demand for the predictions, the models should consider the alignment of the sentence. Now at hand, we had all these vectors representing each word in the sequence so in some way we have to find the context and avoid processing the entire set of vectors. And that was what exactly Attention did. With the help of attention, we were able to look into the specific words which matter in the prediction of new output. Finally, the selected words helped in understanding the context and the memory issue was dealt with by only needing to deal with those word vectors.

Now that we know why we needed attention, it’s time for shedding some light on how Attention works. Stay tuned for part II.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

A Portable Image Processing Accelerator using FPGA

Reinforcement learning for logic synthesis

Explaining Machine Learning in Simple Terms

Paper Summary — What is being transferred in transfer learning?

Motivation: The Time is Right

Face Recognize: Tag names in your photo with simple code. OpenCV

Initialization Techniques for Neural Networks

Machine Learning and the game of Go

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sreelakshmi V B

Sreelakshmi V B

NLP | Data Science Enthusiast | Writer | Singer

More from Medium

Question Answering for Dravidian Languages — Hindi and Tamil

An practical introduction to Diff-Pruning for BERT

Beyond English-Centric Multilingual Machine Translation [Paper Summary]

The Transformer: Key Takeaways