- Sequence-to-Sequence Models
- Problem with Seq2Seq Models
- Need for Attention
- Custom Keras Attention Layer — Code Example
- Update with TensorFlow 2.0
In this article, we will try to understand the basic intuition of attention mechanism and why it came into picture. We aim to understand the working of encoder-decoder models and how attention helps achieve better results.
We will see how to build a custom attention layer with Keras and default attention layer that is provided with TensorFlow 2.0.
2. Sequence-to-Sequence Models
A sequence to sequence model aims to map a fixed-length input with a fixed-length output where the length of the input and output may differ.
Example: “You are reading this article” in English is “Vous lisez cet article” in French where
Input length = 5 words
Output length = 4 words
Here LSTM/GRU would fail to map each word of English to French hence we use a sequence to sequence model to address problems like this one.
The encoder-decoder model for recurrent neural networks(RNN) is a powerful type of sequence-to-sequence models. These models are used a lot in the field of natural language processing with use-cases like machine-translation, image captioning and text summarisation.
A sequence to sequence model has two components, an encoder and a decoder. The encoder encodes a source sentence to a concise vector (called the context vector), where the decoder takes in the context vector as an input and computes the translation using the encoded representation.
3. Problem with Seq2Seq Models
A problem with these models are that, performance decays as the length of the input sentence increases. The reason being —
- The words to be predicted depends on the context of the input sentence and not on the single word. So, basically to predict a word in French, we might make use of 2–3 words in the English sentence. That is how humans translate one language to another.
- Another limitation is that with longer sentences, we have to compress all the information of the input sentence into a fixed length vector. Not all words in the sentence are important to predict the correct word.
Now, with the length of the input sentence increasing, over time our LSTM/GRU loses the context of the long sentence thereby losing the meaning of the whole sentence and eventually resulting in poor performance.
4. Need for Attention
To tackle the above mentioned limitations, Attention mechanism was introduced.
The whole idea of attention is instead of relying just on the context vector, the decoder can have access to the past states of the encoder. At each decoding step, the decoder gets to look at any particular state of the encoder.
Attention mechanism tries to identify which parts of the input sequence are relevant to each word in the output and uses the relevant information to select the appropriate output.
Here, we make use of bi-directional GRU cells where the input sequences are passed both in forward and backward direction. The output is then concatenated and passed on to the decoder.
In order to tackle the limitation, we use the weighted sum of selected number of past encoded states. We have two constraints:-
1. Number of past states necessary
2. Weight for the selected past states
Since these constraints can be learned with back-prop, we can assume this to be as a layer which fits between the encoder and decoder
5. Custom Keras Attention Layer — Code Example
This is a code snippet that was used to create Attention layer for one of the problems.
from keras.layers import Layer
import keras.backend as Kclass Attention(Layer):
Matrices for creating the context vector.
Function which does the computation and is passed through a softmax layer to calculate the attention probabilities and context vector.
For Keras internal compatibility checking.
The get_config() method collects the input shape and other information about the model.
input_text_bgru = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='float32')
embedding_layer_bgru = Embedding(len(tokenizer.word_index) + 1,
g = embedding_layer_bgru(input_text_bgru)
g = SpatialDropout1D(0.4)(g)
g = Bidirectional(GRU(64, return_sequences=True))(g)
att = Attention(MAX_SEQUENCE_LENGTH)(g)
g = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "he_uniform")(g)
avg_pool1 = GlobalAveragePooling1D()(g)
max_pool1 = GlobalMaxPooling1D()(g)
g = concatenate([att,avg_pool1, max_pool1])
g = Dense(128, activation='relu')(g)
bgru_output = Dense(2, activation='softmax')(g)
6. Update with TensorFlow 2.0
With TensorFlow 2.0, Attention layer has been added as one of the layers and can now be directly implemented without defining it explicitly.
query_value_attention_seq = tf.keras.layers.Attention()(
This makes it easier to implement and becomes less cumbersome for Machine learning developers while designing complex architecture.
Attention Mechanism doesn’t limit itself to machine translation. It is used in image captioning where we use Visual Attention with help of CNN to get the feature maps.
Attention mechanism has uses beyond what we mentioned in this article. Hopefully you could get a general overview as to what problem Attention Mechanism is trying to solve. We implemented basic Attention Mechanism in Seq2Seq models with RNNs in this article. However, Transformer models like Google’s BERT and XLNet are major advancements which makes use of self-attention mechanism, are currently state-of-the-art in the field of NLP.
Connect with me
*** Thank you all for reading this article. Your suggestions are very much appreciated! ***