**Attention for Neural (connectionist) Machine Translation**

**A quick recap of encoder-decoder (seq2seq) model:** In part 1 of this series https://medium.com/@gautam.karmakar/seq2seq-model-for-language-a099387ce837 I tried to explain RNN encoder decoder model first proposed by Cho et al in 2014. It uses two recurrent neural network. One encodes the input sequence into a fixed length vector representation and another decodes that representations into another sequence of symbols. They are jointly trained to increase conditional probability of the target sequence given the input sequence. In addition to standard log loss of recurrent neural network using conditional probabilities of phrase pairs computed by RNN encoder-decoder found to improve empirical performance.

Image courtesy: https://smerity.com/articles/2016/google_nmt_arch.html

In part 2 of the series https://medium.com/@gautam.karmakar/learning-phrase-representation-using-rnn-encoder-decoder-for-machine-translation-9171cd6a6574 we saw how vanilla encoder-decoder model is used for Neural Machine Translation

**Problem with encoder-decoder model: **The problem with this model is that decoder has to learn to predict target sequence just from a fixed length vector created at the end of encoder steps. To encode all the relevant information of the input sequence especially for longer sequence to a single vector is problematic. Cho et al, 2014a showed that this problem of encoding information into a single vector increases or in other words performance of decoder decreases almost linearly as length of input sequence increases. Also seeing how human perform translation tasks it doesn’t make sense that encoder needs to encode all the information once and decoder has to predict everything from that vector, because human translators does pay attention in parts of source sentence and then go back to source sentence as many times as needed to translate complete sentence. Hence intuitively it make sense for model not to depend on soley one vector of encoder to predict the complete translation.Attention model simulates this pattern of learning by looking back at the source when decoding. This provides shorter distance for decoder to source removing information bottleneck and we are no longer constrained to one single hidden state to learn decode.

Image courtesy:https://smerity.com/articles/2016/google_nmt_arch.html

**Attention Model: **Attention model unlike basic encoder-decoder model pays attention to a subset of hidden state from encoder step for each word in the sentence. It adaptively learns to choose this subset at every decoding step. It also continue to use previous prediction like basic encoder-decoder model. Originally developed by Sutsekever et al, 2014 and Cho et al, 2014a called this alignment but later on attention caught on.

In this model at each hidden state h_i a attention score is calculated alpha<i> such that alpha_ti adds up to 1. Then it produces a context vector c_i which is weighted hidden state from time step i.

In encoder-decoder model encoder produces a fixed context vector C from input sequence {x1, x2, ….x_Tx} where hiddent state at time step t,

h_t = f(x_t, h_t-1)

C = q({h_1, h_2, ……h_Tx})

Normally a LSTM model is used for f and q. Decoder predicts next word in output sequence based on context vector C and previous predictions {y1, y2, …..y_Ty}

References: https://arxiv.org/abs/1409.0473

https://nlp.stanford.edu/pubs/emnlp15_attn.pdf

There are also cases in papers where encoder is created using CNN architecture instead of standard LSTM.

**Learning to align and translate:**

As shown above in attentional mechanism of encoder-decoder architecture in practice a bidirectional stacked LSTM is used as encoder and similar decoder using stacked LSTM is designed.

The context vector is calculated as weighted sum of all input hidden states h_i. Note that each h_i encodes input upto that step and hence store local context and learned weights alpha_ti decides how important that hidden state is in order for decoder to predict next sequence as output.

The weights are calculated as none other than a softmax as shown below.

This is called alignment model in the paper. This scores how well inputs around position j and position i match. The score is based on the RNN hidden state S_i -1. In this model alignment is learned using feedforward network and not considered as latent variable.

The probability alpha_ij or its associated energy e_ij reflects to previous hidden state s_i-1 in deciding the next state s_i and generating y_i. Intuitively it acts as a attention in the decoder. Decoder learns to pay attention to subset of encoder hidden state at each time step and relieves encoder from the burden of storing all information of input sequence into a fixed length vector.

Information can be stored in sequence of annotations of hidden states that can be adaptively selected or rejected while decoding.