Neural Machine Translation using Bahdanau Attention Mechanism

Yash Marathe
Analytics Vidhya
Published in
11 min readApr 24, 2020

Table of Contents

  1. Introduction
  2. Neural Machine Translation
  3. NMT using Seq2Seq Model without Attention
  4. Attention Mechanism
  5. Bahdanau Attention Mechanism
  6. Results
  7. References
Source- Page

Introduction

The origin of the concept of Machine Translation dates back to the 1930s when Peter Troyanskii presented the first machine for the selection and printing of words when translating from one language to another. Since then, there have been a lot of developments in the field of Machine Translation. From the IBM 701 computer which automatically translated 60 Russian sentences into English to Google Translate, which can convert almost any sentence to any language, we have come a long way. Here is a picture of the evolution of Machine Translation from Rule-Based Machine Translation to Neural Machine Translation from 1950 to 2015.

Source — Page

Neural Machine Translation (NMT)

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder–decoders and encodes a source sentence into a fixed-length vector from which a decoder generates a translation.

NMT Using Sequence to Sequence Model without Attention

The Encoder-Decoder architecture with recurrent neural networks became an effective approach for Neural machine translation (NMT). The key benefits of the approach are the ability to train a single end-to-end model directly on the source and target sentences and the ability to handle variable-length input and output sequences of text.

Below is an illustration of NMT with an RNN based encoder-decoder architecture.

Encoder-Decoder Architecture Source-Page

Encoder

Encoder reads the input sentence from the source language and encodes that information in vectors which are called hidden states. We only take the hidden state of the last RNN and discard the encoder outputs.

Decoder

Decoder takes the hidden state of the last Encoder RNN cell as the initial state of its first RNN cell along with the <start> token as the initial input to produce an output sequence. We use Teacher Forcing for faster and efficient training of the decoder. Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the ground truth from a prior time step as input. In this method, the right answer is given as the beginning of training so that the model will train quickly and efficiently.

You can read a detailed explanation of this model from this blog. For RNN variants such as LSTM and GRU, I suggest looking into Understanding LSTM Networks and Understanding GRU Networks.

What’s wrong with this model?

Disadvantages of Seq2Seq Model Source — Page

The information from the encoder will become less and less relevant with every time step as the encoder’s state is only passed to the first RNN cell of the decoder. In this model, the output sequence depends heavily on the hidden state of the final RNN cell of Encoder. This makes it difficult for the model to deal with long sequences. So, the apparent flaw in this system is its inability to remember longer sequences. The output sequence relies heavily on the context defined by the hidden state in the final output of the encoder, making it challenging for the model to deal with long sentences.

Attention Mechanism

The English word ‘Attention’ means notice taken of someone or something; regarding someone or something as exciting or important. The Attention Mechanism is based on this exact concept of directing the focus on important factors while predicting the output in Sequence to Sequence models.

Now, let’s have a look at this post.

Source-Page

Most of us (excluding some bright minds) have fallen prey to this meme at some point in our life. This meme is based on a simple psychological fact that when humans read a sentence, we interpret words and sentences together and not individually i.e., we do not read sequentially but we focus on two or more words together while reading. (Source)

In Sequence to Sequence models without attention, we process and predict the sentence sequentially. However, it is possible and highly probable that the prediction of a word from one language to another in NMT may depend on words before or after that specific word in the sentence.

The below illustration shows how the prediction of words depends on two or more words in the sentence. In the following gif, the links which are thin have a lower contribution to the prediction of a word while links that are thick have a higher contribution. We can observe that most of the predicted words in the target sequence depend on words after and before the corresponding word in the source sequence.

Attention Mechanism (Source — Page)

So, I guess we have now understood why we need Attention.

Now, let’s understand the Bahdanau Attention mechanism.

We are simultaneously going to implement the code with each step. You can download the dataset from this link. This dataset contains 336614 datapoints of Italian-English translation.

Data Preparation can be referred from this Page.

Bahdanau Attention Mechanism

Bahdanau Attention Mechanism (Source-Page)

Bahdanau Attention is also known as Additive attention as it performs a linear combination of encoder states and the decoder states. Now, let’s understand the mechanism suggested by Bahdanau.

Pseudocode:

Notations:FC = Fully connected (dense) layer,EO = Encoder output,H = hidden state,X = input to the decoder.* score = FC(tanh(FC(EO) + FC(H)))* attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.* context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.* embedding output = The input to the decoder X is passed through an embedding layer.* merged vector = concat(embedding output, context vector)This merged vector is then given to the GRU (Source - Page)

Step 1: Generating the Encoder Hidden States

We can use any variants of RNN such as LSTM or GRU to encode the input sequence. A hidden state will be produced by each cell for each input passed. Now, unlike the Sequence to Sequence model, we pass all the hidden states produced by all RNN units to the next step.

The Encoder can be built in Tensorflow using the following code.

class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True, recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))

If we look at this diagram of Decoder of the Bahdanau Attention mechanism, we can see that all the encoder hidden states, along with the decoder hidden state are used to generate the Context vector.

Step 2: Calculating the Alignment vector

Score function for Bahdanau Attention

Now, we have to calculate the Alignment scores. It is calculated between the previous decoder hidden state and each of the encoder’s hidden states. The alignment scores for each encoder hidden state are combined and represented in a single vector and then softmax-ed. The alignment vector is a vector that has the same length as the source sequence. Each of its values is the score (or the probability) of the corresponding word within the source sequence. Alignment vectors put weights on the encoder’s output. With those weights, the Decoder decides what to focus on at each time step.

Step 3: Calculating the Context vector

Equations for Bahadanau Attention

The encoder hidden states and their respective alignment scores (attention weights in the above equation)are multiplied to form the context vector. The context vector is used to compute the final output of the decoder.

class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
query_with_time_axis = tf.expand_dims(query, 1)
score = self.V(tf.nn.tanh(
self.W1(query_with_time_axis) + self.W2(values)))
attention_weights = tf.nn.softmax(score, axis=1)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights

Step 4: Decoding the output

The context vector that is obtained in the previous step is concatenated with the previous decoder output and fed into the Decoder RNN cell to produce a new hidden state. Then, this process repeats itself from step 2 again. The process repeats itself for each time step of the decoder until an ‘<end>’ token is produced or output is past the specified maximum length. The final output for the time step is obtained by passing the new hidden state through a Linear layer, which acts as a classifier to give the probability scores of the next predicted word.

class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
context_vector, attention_weights =
self.attention(hidden, enc_output)
x = self.embedding(x)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)
output = tf.reshape(output, (-1, output.shape[2]))
x = self.fc(output)
return x, state, attention_weights

Step 5: Training the dataset using Encoder-Decoder Model

First, we will define an optimizer and a loss function.

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)

Now, for training, we will implement the following.

Pass the input and initial hidden states through the Encoder which will return Encoder output sequence and Encoder Hidden state. The Encoder Hidden State, Encoder output, and Decoder input are passed to the Decoder. At the first timestep, Decoder takes ‘<start>’ as the input. Decoder returns the Decoder Hidden State and predicted word as output. We use teacher forcing for training where we pass the actual word to the Decoder at each time step. Then, calculate the gradient descent, apply it to the optimizer and backpropagate. (Source-Page)

def train_step(inp, targ, enc_hidden):
loss = 0
with tf.GradientTape() as tape:
enc_output, enc_hidden = encoder(inp, enc_hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
# Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
# passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
loss += loss_function(targ[:, t], predictions)
# using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
batch_loss = (loss / int(targ.shape[1]))
variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return batch_loss

Training with multiple epochs.

EPOCHS = 4
for epoch in range(EPOCHS):
start = time.time()
enc_hidden = encoder.initialize_hidden_state()
total_loss = 0
for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
batch_loss = train_step(inp, targ, enc_hidden)
total_loss += batch_loss
if batch % 1000 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
batch,
batch_loss.numpy()))
# saving (checkpoint) the model every 2 epochs
if (epoch + 1) % 2 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print('Epoch {} Loss {:.4f}'.format(epoch + 1,
total_loss / steps_per_epoch))
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

This model is trained on a Tesla K80 GPU which is provided by Google Colab. It took about 2400 seconds to train 4 epochs. At the end of the 4th epoch, the loss was 0.0837.

Training of the Encoder-Decoder model

Step 6: Predictions

In this phase, we don’t use Teacher Forcing. Instead, we pass the predicted word from the previous time step as an input to the decoder. We also store the attention weights so they can be used to plot the attention plot.

For the evaluation, first, we preprocess the sentence. Then we create tokens using the tokenizer object that was created during data preparation. After passing and creating input tensors, we initialize a hidden state which is initialized to zero and passed along with the input vector to the Encoder. After this, the Encoder hidden state and ‘<start>’ token are passed to the Decoder. Then we find the predicted_id with maximum probability using the Decoder input, hidden state, and context vector, and also, we store the attention weights. Now, we convert the predicted_id to word and append it to the result string. This continues till ‘<end>’ tag is encountered or the maximum target sequence is reached.

Function to evaluate a sentence.

def evaluate(sentence):
attention_plot = np.zeros((max_length_targ, max_length_inp))
sentence = preprocess_sentence(sentence)
inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],maxlen=max_length_inp,padding='post')
inputs = tf.convert_to_tensor(inputs)
result = ''
hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
for t in range(max_length_targ):
predictions, dec_hidden, attention_weights = decoder(dec_input,dec_hidden,enc_out)
# storing the attention weights to plot later on
attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()
predicted_id = tf.argmax(predictions[0]).numpy()
result += targ_lang.index_word[predicted_id] + ' '
if targ_lang.index_word[predicted_id] == '<end>':
return result, sentence, attention_plot
# the predicted ID is fed back into the model
dec_input = tf.expand_dims([predicted_id], 0)
return result, sentence, attention_plot

Function for plotting the attention weights.

# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap='viridis')
fontdict = {'fontsize': 14}
ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
plt.show()

Function to Translate a sentence

def translate(sentence):
result, sentence, attention_plot = evaluate(sentence)
print('Input: %s' % (sentence))
print('Predicted translation: {}'.format(result))
attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
plot_attention(attention_plot, sentence.split(' '), result.split(' '))
return result

Results

Prediction from the trained Model:

Predicted Result from the Encoder-Decoder Model

Google Translate Result:

Google Translation from Italian to English of the test sentence.

Most of the words predicted by the model are right except ‘it’ is predicted as ‘him’ and ‘me’ is predicted as ‘you’.

Attention Plot:

Attention Plot

The yellow and green shades suggest higher attention weights to the corresponding words in the source sequence in the prediction of the word of the target sequence.

Bleu Score:

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

The Bleu Score can be calculated as follows

import nltk.translate.bleu_score as bleu
reference = 'if you don t believe me , go and see it for yourself .'
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, result)))

Output

Bleu Score

References

  1. https://blog.floydhub.com/attention-mechanism/
  2. https://www.tensorflow.org/tutorials/text/nmt_with_attention#translate
  3. https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/
  4. https://towardsdatascience.com/implementing-neural-machine-translation-with-attention-using-tensorflow-fc9c6f26155f
  5. https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a

👏 Your 3 claps mean the world to me! If you found value in this article, a simple tap on those clapping hands would make my day.

🚀 Please consider following for more tech-related content.

🌟 If you found this blog helpful and would like to stay updated with more content, feel free to connect with me on LinkedIn.

--

--

Yash Marathe
Analytics Vidhya

🤖 Backend dev turning code experiments into tech symphonies. I write about the trials, triumphs, and real-world magic in backend, databases, and ML. 🚀📝