Implementing Attention Models in PyTorch

Sumedh Pendurkar
Intel Student Ambassadors
7 min readMar 17, 2019

--

Introduction:

Recurrent Neural Networks have been the recent state-of-the-art methods for various problems whose available data is sequential in nature. Adding attention to these networks allows the model to focus not only on the current hidden state but also take into account the previous hidden state based on the decoder’s previous output. There have been various different ways of implementing attention models. One such way is given in the PyTorch Tutorial that calculates attention to be given to each input based on the decoder’s hidden state and embedding of the previous word outputted. This article would introduce you to these mechanisms briefly and then demonstrate a different way of implementing attention that does not limit the number of input samples taken into consideration for calculating attention.

Long Short Term Memory (LSTM):

Vanilla Recurrent Neural Networks fail to consider long term dependencies in various applications like language translators. Therefore, LSTMs were proposed to capture these long term dependencies. They have a memory cell which stores such long term dependencies and the hidden states are updated based on the update gates. The equations that govern the functioning of LSTM would make its working more clearer.

Some conventions used
x<ᵗ> input at time step ‘t’
a<ᵗ> hidden activation at time step ‘t’
c<ᵗ> memory cell at time step ‘t’
‘W’ trainable weights used for each operation
‘b’ trainable biases used…

--

--