“All you need is Attention,” Part I But Why?

Published in

Analytics Vidhya

3 min readDec 9, 2021

As confusing as the title looks, we need a little digging up to do if we are going to make sense of what this title means. Buckle up, It’s a walk back in the time.

When Google introduced the Seq2Seq model in 2014, it was like finding the holy grail for NLP. This deep learning-based simple encoder-decoder model, when introduced first, was mainly used for machine translation. In a nutshell, Sequence-to-sequence learning is about training models to convert sequences from one domain (e.g. sentences in English) into sequences in another domain (e.g. the same sentences translated to French) in a machine translation context. Gradually it matched to be the most sought after architecture setup for all NLP tasks where text generation was the primary objective.

A typical depiction of the Encoder decoder Model & associated bottlenecks

Recurrent Neural Network (RNN) with the characteristic property of dependency among the input undoubtedly became the goto technique adopted by the seq2seq models. But the usual training obstacles were still creating the inherent problem of Vanishing gradients. As always LSTM and GRU, the most successful extension of RNN turned out to be the messiah here too. Having the ability to keep the most relevant information in the sentence, LSTMs proved to be providing contextual information consideration when generating output text sequences. And thus seq2seq became a big hit with the correct combination of architecture and algorithm. Or is it?

Seq2seq models are processed one step at a time where it collects the input and moves forward through the encoder, hidden states and decoder. But this input it collects was of fixed length. So whenever the input exceeded this fixed length memory, the hidden states got bottlenecked on their way to decoders with data. On top of this issue, the model started to give more relevance for the later words in the sequence which lead to context generation of output piece of text. Thus longer sequences became problematic for the vanilla seq2seq to handle. Therefore to handle this difficulty, the “One vector per word ” instead of the one vector sentence concept was introduced. This idea helped in making the weakness of fixed length input tamed. But the hidden states still got throttled up with added responsibility of processing all these vectors thus causing memory exhaustion.

Now we have two major challenges to beat, the memory problem and the context information grasping of the longer sequences. The context made a new demand for the predictions, the models should consider the alignment of the sentence. Now at hand, we had all these vectors representing each word in the sequence so in some way we have to find the context and avoid processing the entire set of vectors. And that was what exactly Attention did. With the help of attention, we were able to look into the specific words which matter in the prediction of new output. Finally, the selected words helped in understanding the context and the memory issue was dealt with by only needing to deal with those word vectors.

Now that we know why we needed attention, it’s time for shedding some light on how Attention works. Stay tuned for part II.

“All you need is Attention,” Part I But Why?

Written by Sreelakshmi V B