Intuitive Understanding of Seq2seq model & Attention Mechanism in Deep Learning

Ajay jangid
Analytics Vidhya
Published in
14 min readSep 12, 2019

In this article, I will give you an explanation of sequence to sequence model which has shown great demand recently in application like Machine translation, Image Captioning, Video Captioning, Speech recognition, etc.

Table of Content:

  1. What is the Seq2seq model?
  2. Why we need the seq2seq model?
  3. How does it work?
  4. Limitation of the Seq2seq model?
  5. What is the Attention mechanism?
  6. Why we need it & how it solves the problem?
  7. How does it work?
  8. Implementation with code
  9. Conclusion
  10. Reference

Seq2seq model:

Sequence to sequence was first introduced by Google in 2014. So let’s go through our question what is seq2 seq model? Sequence to sequence model tries to map input text with fixed length to output text fixed-length where the length of input and output to the model may differ. As we know variants of Recurrent neural networks like Long short term memory or Gated Recurrent Neural Network (GRU) are the method we mostly used since they overcome the problem of vanishing gradient.

Let’s have an example to get a clear picture of what is it:

Here we can observe that one language is translated into another language. There are many examples as follow:

  1. Speech Recognition
  2. Machine Language Translation
  3. Name entity/Subject extraction
  4. Relation Classification
  5. Path Query Answering
  6. Speech Generation
  7. Chatbot
  8. Text Summarization
  9. Product Sales Forecasting

We have divided our seq2seq model or encoder-decoder model in two-phase:

  1. Training phase i.e process under encoder and decoder.
  2. Inference phase i.e process during test time.

So let’s dive into stage 1- Training phase

Training Phase:

So in the training phase we setup our encoder and decoder model. After setting up we trained our model and predict every timestamp by reading input word by word or char by char.

Before going through training and testing the data need to be cleaned and we need to add tokens to specify the start and end of sentences. So that model will understand when to end. Let’s look into encoder architecture!

Encoder:

Let’s understand through the diagram so that we get a clear view of the flow of the encoder part.

So from fig of the encoder as LSTM network, we can get the intuition that at each time stamp word is read or processed and it captures the contextual information at every timestamp from the input sequences passed to the encoder model.

In fig, we have passed an example like “How are you <end>” which is processed word by word. One important note in encoder we return final hidden state(h3) and cell state (c3) as an initializer state for decoder model.

Decoder:

As decoder is also an LSTM network that reads the entire target sequence or sentence word-by-word and predicts the same sequence offset by one timestep.

The decoder model is trained to predict the next word in the sequence given the previous word. Let’s go through the fig for a better understanding of decoder model flow:

Source of the image — link

<Go> and <end> are the special tokens that are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by just passing the first word into the decoder which would be always the <Go> token or you can specify any token you want. And the <end> token signals specify the end of the sentence to model.

Inference Phase:

After training our encoder-decoder or Seq2seq model, the model is tested on new unseen input sequences for which the target sequence is unknown. So in order get a prediction for a given sentence we need to set up the inference architecture to decode a test sequence:

So how does it work?

Here are the steps to process for decoding the test sequence:

  1. First, encode the entire input sequence and initialize the decoder with the internal states of the encoder.
  2. Pass <Go> token or token which you had specified as an input to the decoder.
  3. Run the decoder for one timestep with the internal states
  4. The output will be the probability for the next word. The word with the maximum probability will be selected
  5. Pass the maximum probability word as an input to the decoder in the next timestep and update the internal states with the current time step
  6. Repeat steps 3–5 until we generate <end> token or hit the maximum length of the target sequence.

Let’s go through a fig where we have given input and shown how the inference process is done and how encoding-decoding of a sentence is done:

So why does the seq2seq model fails?

So we understood the architecture and flow of the encoder-decoder model since it very useful but it got some limitation. As we saw encoder takes input and converts it into a fixed-size vector and then the decoder makes a prediction and gives output sequence. It works fine for short sequence but it fails when we have a long sequence bcoz it becomes difficult for the encoder to memorize the entire sequence into a fixed-sized vector and to compress all the contextual information from the sequence. As we observed that as the sequence size increases model performance starts getting degrading.

So how to overcome this problem?

Attention:

So to overcome this problem we introduce the concept of attention mechanism. So in this, we give importance to specific parts of the sequence instead of the entire sequence predict that word. Basically, in the attention, we don’t throw away the intermediate from the encoder state but we utilize this to generate context vector from all states so that the decoder gives output result.

Let’s go through an example to understand how attention mechanism works:

Source sequence: “Which sport do you like the most?

Target sequence: “I love cricket”

So the idea of attention was to utilize all the contextual information from the input sequence so that we can decode our target sequence. Let’s look at the example we mention. The first word ‘I’ in the target sequence is connected to the fourth word ‘you’ in the source sequence, right? Similarly, the second-word ‘love’ in the target sequence is associated with the fifth word ‘like’ in the source sequence. As we can observe that we are paying more attention to context information like sports, like, you in the source sequence.

So, instead of going through the entire sequence it pays attention to specific words from the sequence and gives out the result based on that.

Source: link

How does it work:

We understood how theoretically attention works now its time to get technically understand since we will stick to our example “Which sport do you like the most?”.

Stepwise step:

1. Computing score of each encoder state:

So we will train our feed-forward network (encoder-decoder) using all encoder states as we recall we are initializing decoder initial state with encoder final state. So by training our network, we will generate a high score for the states for which attention is used and we ignored the score with low value.

Let s1, s2, s3, s4,s5,s6, and s7 be the scores generated for the states h1, h2, h3, h4, h5, h6, and h7 correspondingly. Since we assumed that we need to pay more attention to the states h2,h4, and h5 and ignore h1, h3, h6, and h7 in order to predict “I”, we expect the above neural to generate scores such that s2, s4, and s5 are high while s1, s3, s6, and s7 are relatively low.

2. Computing attention weights:

After generating scores, we apply softmax on these scores to obtain weights w1,w2,w3,w4,w5,w6, and w7. Since we know how softmax works it will give probability with sum up of all the weights value in the range of 0–1.

For example:

w1=0.23, w2=0.11,w3=0.13..which all weights sum to 1.

3. Computing context vector:

After computing attention weights now we will calculate context vector which will be used by the decoder to predict next word in sequence.

ContextVector = e1 * h1 + e2 * h2 + e3 * h3 + e4 * h4 + e5 * h5 + e6*h6 + e7*h7

Clearly, if the values of e2 and e4 are high and those of e1, e3,e5,e6, and e7 are low then the context vector will contain more information from the states h2 and h4 and relatively less information from the states h1, h3,h6, and h7.

4. Adding context vector with the output of the previous timestamp:

In this simply add our context vector with <Go> token since for first time stamp we don’t have previous time stamp output.

After this decoder generates output for a word in sequence and similarly we will get prediction every word in sequence. Once the decoder outputs <end> we stop generating a word.

Note:

  1. Unlike in the seq2seq model, we used a fixed-sized vector for all decoder time stamp but in case of attention mechanism, we generate context vector at every timestamp.
  2. Due to the advantage of attention mechanism, the performance of the model improved and we observe better results.

There are 2 different ways of doing attention mechanisms we will not go into depth. So these two class depends on how context vector information you need to compress from the input sequence:

  1. Global Attention
  2. Local Attention

Let’s dive into these.

Source: link

Global Attention:

Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector. Here we will focus on all the intermediate state and collect all the contextual information so that our decoder model predicts the next word.

Local Attention:

Here, the attention is placed on only a few source positions. Only a few hidden or intermediate states of the encoder are considered for deriving the attended context vector since we only give importance to specific parts of the sequence.

Explanation with Code:

import necessary package:#Packagesimport pandas as pd
import re
import string
from string import digits
import numpy as np
from sklearn.utils import shuffle
from keras.layers import Input, LSTM, Embedding, Dense,Dropout,TimeDistributed
from keras.models import Model

Reading data:

data = pd.read_csv('./mar.txt',sep='\t',names=['eng','mar'])data.head()
#Output
eng mar
0 Go. जा.
1 Run! पळ!
2 Run! धाव!
3 Run! पळा!
4 Run! धावा!

Pre-processing data:

# lowercase all the character
source['eng']=data.eng.apply(lambda x: x.lower())
target['mar']=data.mar.apply(lambda x: x.lower())
# Remove quotes
data.eng=data.eng.apply(lambda x: re.sub("'","",x))
data.mar=data.mar.apply(lambda x: re.sub("'","",x))
#specifying to remove punctuation
exclude = set(string.punctuation)
# Remove all special character
data.eng=data.eng.apply(lambda x: "".join(ch for ch in x if x not in exclude))
data.mar=data.mar.apply(lambda x:"".join(ch for ch in x if x not in exclude))
#Remove all numbers from text
remove_digits = str.maketrans('','',digits)
data.eng=data.eng.apply(lambda x: x.translate(remove_digits))
data.mar = data.mar.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))
# Remove extra spaces
data.eng=data.eng.apply(lambda x: x.strip())
data.mar=data.mar.apply(lambda x: x.strip())
data.eng=data.eng.apply(lambda x: re.sub(" +", " ", x))
data.mar=data.mar.apply(lambda x: re.sub(" +", " ", x))
# Add start and end tokens to target sequences
data.mar = data.mar.apply(lambda x : 'START_ '+ x + ' _END')

Building vocabulary:

# Vocabulary of English storing all the words in a set and same for marathi vocab
all_eng_words=set()
for eng in data.eng:
for word in eng.split():
if word not in all_eng_words:
all_eng_words.add(word)
# Vocabulary of marathi
all_marathi_words=set()
for mar in data.mar:
for word in mar.split():
if word not in all_marathi_words:
all_marathi_words.add(word)
# Max Length of source sequence to specify wat size will be there for input
lenght_list=[]
for l in data.eng:
lenght_list.append(len(l.split(' ')))
max_source_length = np.max(lenght_list)
max_source_length
#35
# Max Length of target sequence to specifying wat size will be for target output
lenght_list=[]
for l in data.mar:
lenght_list.append(len(l.split(' ')))
max_target_size = np.max(lenght_list)
max_target_size
#37
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_marathi_words))
#storing the vocab size for encoder and decoder
num_of_encoder_tokens = len(all_eng_words)
num_of_decoder_tokens = len(all_marathi_words)
print("Encoder token size is {} and decoder token size is {}".format(num_of_encoder_tokens,num_of_decoder_tokens))
#output
Encoder token size is 8882 and decoder token size is 14689
#for zero padding
num_of_decoder_tokens += 1
num_of_decoder_tokens
#output
14690
# dictionary to index each english character - key is index and value is english character
eng_index_to_char_dict = {}
# dictionary to get english character given its index - key is english character and value is index
eng_char_to_index_dict = {}
for key, value in enumerate(input_words):
eng_index_to_char_dict[key] = value
eng_char_to_index_dict[value] = key
#similary for target i.e marathi words
mar_index_to_char_dict = {}
mar_char_to_index_dict = {}
for key,value in enumerate(target_words):
mar_index_to_char_dict[key] = value
mar_char_to_index_dict[value] = key

Splitting data into train and test:

# Splitting our data into train and tes partfrom sklearn.model_selection import train_test_splitX, y = data.eng, data.mar
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
print("Training data size is {} and testing data size is {}".format(X_train.shape, X_test.shape))
#output
Training data size is (23607,) and testing data size is (10118,)

Training data in batches:

#we will not train on whole data at time instead we will train on batch so as to reduce computation and increase learning and performace of modeldef generate_batch(X = X_train, y = y_train, batch_size = 128):
while True:
for j in range(0, len(X), batch_size):

#encoder input
encoder_input_data = np.zeros((batch_size, max_source_length),dtype='float32')
#decoder input
decoder_input_data = np.zeros((batch_size, max_target_size),dtype='float32')

#target
decoder_target_data = np.zeros((batch_size, max_target_size, num_of_decoder_tokens),dtype='float32')

for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
for t, word in enumerate(input_text.split()):
encoder_input_data[i, t] = eng_char_to_index_dict[word] # encoder input seq

for t, word in enumerate(target_text.split()):
if t<len(target_text.split())-1:
decoder_input_data[i, t] = mar_char_to_index_dict[word] # decoder input seq
if t>0:
# decoder target sequence (one hot encoded)
# does not include the START_ token
# Offset by one timestep since it is one time stamp ahead
decoder_target_data[i, t - 1, mar_char_to_index_dict[word]] = 1

yield([encoder_input_data, decoder_input_data], decoder_target_data)

Defining Encoder model:

latent_dim = 256
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(num_of_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

Defining decoder model:

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_of_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
hidden_with_time_axis = tf.expand_dims(latent_dim, 1) score = Dense(tanh(Dense(enc_output) + Dense(hidden_with_time_axis)))attention_weights = softmax(score, axis=1) #
context_vector = attention_weights * enc_output
context_vector = tf.reduce_sum(context_vector, axis=1)
x = Concatenate([tf.expand_dims(context_vector, 1),dec_emb], axis=-1)decoder_outputs, _, _ = decoder_lstm(x,
initial_state=encoder_states)
decoder_dense = TimeDistributed(Dense(num_of_decoder_tokens, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs)
decoder_outputs = Reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

Compiling model:

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 64
epochs = 50

Training our model:

history=model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
validation_steps = val_samples//batch_size)

Saving our model:

model.save_weights('./weights.h5')

Inference model:

# Inference model
#storing encoder input and internal states so as to give to decoder part
encoder_model = Model(encoder_inputs, encoder_states)
#specifying hidden and cell state for decoder part as vector process it will get output predicted and again we add to decoder statesdecoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence# To predict the next word in the sequence, set the initial states to the states from the previous time stepdecoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
# A dense softmax layer to generate prob dist. over the target vocabulary
# Final decoder model
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs2] + decoder_states2)

Decoding sequence:

def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1,1))
# Populate the first character of target sequence with the start character.
target_seq[0, 0] = mar_char_to_index_dict['START_']
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = mar_index_to_char_dict[sampled_token_index]
decoded_sentence += ' '+sampled_char
# Exit condition: either hit max length
# or find stop character.
if (sampled_char == '_END' or
len(decoded_sentence) > 50):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1,1))
target_seq[0, 0] = sampled_token_index
# Update states
states_value = [h, c]
return decoded_sentence

Testing on train data:

train_gen = generate_batch(X_train, y_train, batch_size = 1)
k=-1
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence as per data:', X_train[k:k+1].values[0])
print('Actual Marathi Translation as per data:', y_train[k:k+1].values[0][6:-4])print('Predicted Marathi Translation predicted by model:', decoded_sentence[:-4])#output
Input English sentence as per data: i want something to drink. Actual Marathi Translation as per data: मला काहीतरी प्यायला हवंय. Predicted Marathi Translation predicted by model: मला काहीतरी प्यायला हवंय.

Testing on test data:

val_gen = generate_batch(X_test, y_test, batch_size = 1)
k=-1
#Adam
k+=1
(input_seq, actual_output), _ = next(val_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence as per data:', X_test[k:k+1].values[0])print('Actual Marathi Translation as per data:', y_test[k:k+1].values[0][6:-4])print('Predicted Marathi Translation predicted by model:', decoded_sentence[:-4])
#output
Input English sentence as per data: i dont speak ukrainian. Actual Marathi Translation as per data: मी युक्रेनियन बोलत नाही. Predicted Marathi Translation predicted by model: मी युक्रेनियन भाषा बोलत नाही.

For unseen query:

loading our model and model weights , compiling itmodel.load_model('./model.h5
model.load_weights('./weights.h5')
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])#pre-processing step
from string import digits
def pre_processing(sentence):
sentence = sentence.lower()
sentance = re.sub("'","",sentence).strip()
sentence = re.sub(" +", " ", sentence)
remove_digits = str.maketrans('','',digits)
sentence=sentence.translate(remove_digits)
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in exclude)
return encoder_input_data

Setting the inference model:

# Inference model
#storing encoder input and internal states so as to give to decoder part
encoder_model = Model(encoder_inputs, encoder_states)
#specifying hidden and cell state for decoder part as vector process it will get output predicted and again we add to decoder states
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence# To predict the next word in the sequence, set the initial states to the states from the previous time stepdecoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
# A dense softmax layer to generate prob dist. over the target vocabulary
# Final decoder model
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs2] + decoder_states2)
#decoding unseen query:
def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1,1))
target_seq[0, 0] = mar_char_to_index_dict['START_']
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = mar_index_to_char_dict[sampled_token_index]
if (sampled_char == '_END'):
break;
decoded_sentence += ' '+sampled_char

target_seq = np.zeros((1,1))
target_seq[0, 0] = sampled_token_index
# Update states
states_value = [h, c]
return decoded_sentence

Conclusion:

  1. We understood the concept behind the seq2seq model, its working and limitation.
  2. We also understood how we can use attention mechanism to solve the problem of seq2seq model. So by using attention mechanism, the model was capable of finding the mapping between the input sequence and output sequence.
  3. The performance of the model can be increased by increasing the training dataset and using Bi-directional LSTM for better context vector.
  4. Use the beam search strategy for decoding the test sequence instead of using the greedy approach (argmax)
  5. Evaluate the performance of your model based on the BLEU score
  6. Only one drawback of attention is that it’s time-consuming. To overcome this problem Google introduced “Transformer Model” that we will see in future coming blogs.

Reference:

  1. https://www.appliedaicourse.com/
  2. https://www.coursera.org/lecture/nlp-sequence-models/attention-model-lSwVa (Andrew Ng’s Explanation on Attention)
  3. https://arxiv.org/abs/1409.3215
  4. https://keras.io/examples/lstm_seq2seq/
  5. https://arxiv.org/abs/1406.1078

Code:

Full code will be updated to my Github repo: here

--

--