# SEQ2SEQ LEARNING

# PART F: Encoder-Decoder with Bahdanau & Luong Attention Mechanism

Welcome to **Part F** of the Seq2Seq Learning Tutorial Series. In this tutorial, we will design an Encoder-Decoder model to handle **longer** input and output sequences by using two **global attention mechanisms: Bahdanau & Luong**.

- First, we will observe that the Basic Encoder-Decoder model will fail to handle long input sequences.
- Then, we will discuss how to relate each output with all the inputs using the
**global attention**mechanism. - We will implement the
**Bahdanau**attention mechanism as a custom layer in Keras by using**subclassing**. - Then, we will integrate the
**attention**layer to the Encoder-Decoder model to efficiently process the longer data. - After observing the effect of the attention layer on performance, we will
**depict**the attention between inputs and outputs. - Lastly, we will code the Luong attention.

During the tutorial, we will be using the Encoder-Decoder model developed in **Part C**.

If you would like to follow up on all Seq2Seq tutorials, please **subscribe** to my YouTube Channel or **follow** my blog on Medium. Do not forget to turn on **notifications** so that you will be notified when ** new parts are uploaded**.

You can access t**he whole code on Colab**.

If you are ready, let’s get started!

# References:

- A ten-minute introduction to sequence-to-sequence learning in Keras by Francois Chollet
- How to Develop an Encoder-Decoder Model with Attention in Keras by Jason Brownlee

**Presentations:**

**Videos:**

**Notebooks**:

# SEQ2SEQ LEARNING SERIES:

- This is the
**Part F**of the Seq2Seq Learning series. **Please check out the previous parts**,to*especially Part C*the necessary background knowledge to follow this part with ease.*refresh*

You can access all my **SEQ2SEQ Learning** videos on **Murat Karakaya Akademi Youtube channel** in ENGLISH or TURKISH

You can access my blog on Medium

**Part A: AN INTRODUCTION TO SEQ2SEQ LEARNING AND A SAMPLE SOLUTION WITH MLP NETWORK**

- YouTube Videos in ENGLISH or TURKISH / Medium Post / Colab Notebook

**Part B: SEQ2SEQ LEARNING WITH RECURRENT NEURAL NETWORKS (LSTM)**

- YouTube Video in ENGLISH or TURKISH / Medium Post / Colab Notebook

**Part C: SEQ2SEQ LEARNING WITH A BASIC ENCODER DECODER MODEL**

- YouTube Video in ENGLISH or TURKISH/ Medium Post / Colab Notebook

**Part D: SEQ2SEQ LEARNING WITH AN ENCODER DECODER MODEL WITH TEACHER FORCING**

- YouTube Video in ENGLISH or TURKISH / Medium Post / Colab Notebook

**Part E: SEQ2SEQ LEARNING WITH AN ENCODER DECODER MODEL WITH TEACHER FORCING FOR VARIABLE INPUT AND OUTPUT SIZE: MASKING & PADDING**

- YouTube Video in ENGLISH or TURKISH / Medium Post / Colab Notebook

**Part F: SEQ2SEQ LEARNING WITH AN ENCODER DECODER MODEL + BAHDANAU & LUONG ATTENTION**

- YouTube Video in ENGLISH or TURKISH/ Medium Post / Colab Notebook

You can watch this notebook on Murat Karakaya Akademi channel for details:

# WHY WE HAVE SO MANY PARTS?

- We aim to code an Encoder-Decoder Model with Attention.
- However, I would like to develop the solution by showing the shortcomings of other possible approaches.
- Therefore, in the first 2 parts, we will observe that initial models have their own weaknesses.
- We also understand why the Encoder-Decoder paradigm is so successful.

**So, please patiently follow the parts as we develop a better solution :)**

# Sample Problem:

In a real-life scenario:

- We are given 2 parallel data sets X and y such that X[i] and y[i] have some relationship
- For instance: we are given the
**same book’s text**in English (X) and Turkish (y). - Thus the
**statement X[i]**in English is translated into Turkish as**y[i] statement** - We use the parallel date set to train a
**seq2seq**model which would learn how to**convert/transform X[i] to y[i]**

Below, we will develop such an encoder-decoder model for fixed-size input and output sequences

The sample problem is to **find the reverse of a given sequence**

**Given sequence** X

X=[3, 2, 9, 1]

**Output sequence** (y) is the **reversed** input sequence (X)

y=[1, 9, 2, 3]

`Configure the sample problem`

`#@title Configure problem`

**n_timesteps_in = 4**

#each input sample has 4 values

**n_features = 10**

#each value is one_hot_encoded with 10 0/1

**train_size= 2000**

**test_size = 200**

For each input sequence (X), selecting 4 random numbers beteen 1 and 10 (0 is reserved )

A sample X

X=[2, 9, 9, 2]

reversed input sequence (X) is the output sequence (y)

y=[2, 9, 9, 2]

Each input and output sequences are converted one_hot_encoded format in 10 dimensions

X=[[0 0 1 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 1]

[0 0 0 0 0 0 0 0 0 1]

[0 0 1 0 0 0 0 0 0 0]]

y=[[0 0 1 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 1]

[0 0 0 0 0 0 0 0 0 1]

[0 0 1 0 0 0 0 0 0 0]]

Generated sequence datasets as follows (batch_size,time_steps, features)

X_train.shape: (2000, 4, 10) y_train.shape: (2000, 4, 10)

X_test.shape: (200, 4, 10) y_test.shape: (200, 4, 10)

time: 79.7 ms

# QUICK LSTM REMINDER

- An LSTM layer accepts a series of inputs. Here the input [1, 2, 3, 4] has four-time steps.
- The LSTM layer process input step by step:
- Time step 1: input 1 is processed and 3 outputs are generated 2 hidden states (the same values) and 1 cell state.
- One hidden state is outputted and can be used for prediction or to connect another layer of LSTM
- The other hidden state and the cell state are used for the next time step.
- At the end of the time steps (here 4), the LSTM layer outputs:
- Each time step’s hidden states
- Last time step’s hidden state
- Last time step’s cell state

You can think above figure as **a simple Encoder**:

- Encoder processes the input and provides the
**context vector**() for the decoder. *Let’s see the details below.*Last time step’s hidden state + Last time step’s cell state*

# A BASIC LSTM-BASED ENCODER & DECODER MODEL

**Encoder** receives `encoder input data`

and

- process it
- outputs its
as the*last hidden state + last cell state***context vector** - transfers this
**context vector**to the decoder

**Decoder**:

- Decoder’s
**initial states**(hidden state + cell state) are the context vector sent by the encoder - Decoder’s
**initial input**is a special symbol such as ‘**START**’

Then, Decoder *runs* **in a loop**. At each time step, Decoder:

- consumes the
**input**and**states** **outputs**its**last hidden state**,**last hidden state (yes again!)**,**last cell state**- uses last hidden state as
**the next input**for itself - uses last hidden state + last cell state as
**the next states**for itself - uses last hidden state for the
**prediction**for the current time step

For details about Encoder-Decoder Model and LSTM, you can check my Youtube Playlists:

- All About LSTM
- Seq2Seq Learning Tutorials
- LSTM hakkında herşey!
- Seq2Seq Öğrenme: Adım Adım Python ve Keras ile Uygulama

## Let’s review the code

You can match the above figure with the below code.

## Here is the complete code:

#@title latentSpaceDimension is the dimension of the each state vectorlatentSpaceDimension = 16defcreate_hard_coded_decoder_input_model(batch_size):

#The first part is encoder

encoder_inputs = Input(shape=(n_timesteps_in, n_features), name='encoder_inputs')

encoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='encoder_lstm')

encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

# initial context vector is the states of the encoder

states = [state_h, state_c]

#Set up the decoder layers

decoder_inputs = Input(shape=(1, n_features))

decoder_lstm = LSTM(latentSpaceDimension, return_sequences=True, return_state=True, name='decoder_lstm')

decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

# Prepare decoder input data that just containsthe start character 0

# Note that we made it a constant one-hot-encoded in the model

# that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop

decoder_input_data = np.zeros((batch_size, 1, n_features))

decoder_input_data[:, 0, 0] = 1 #

# that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop

inputs = decoder_input_data

#decoder will only process one timestep at a time.

for _ in range(n_timesteps_in):

# Run the decoder on one timestep

outputs, state_h, state_c = decoder_lstm(inputs,

initial_state=states)

outputs = decoder_dense(outputs)

# Store the current prediction(we will concatenate all predictions later)

all_outputs.append(outputs)

#Reinject the outputs as inputsfor the next loop iteration

# as well as update the states

inputs = outputs

states = [state_h, state_c]

# Concatenate all predictionssuch as [batch_size, timesteps, features]

decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

# Define and compile model

model = Model(encoder_inputs, decoder_outputs, name='model_encoder_decoder')

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

return model

- Create and compile the model

batch_size = 10

model_encoder_decoder=create_hard_coded_decoder_input_model(batch_size=batch_size)

#model_encoder_decoder.summary()time: 1.17 s

## Train model

Actually, you can train the model with a simple fit method as below.

**model_encoder_decoder.fit(X_train, y_train,**

batch_size=batch_size,

epochs=30,

validation_split=0.2)

However, I will use my train function which implements Early Stopping monitoring Validation Accuracy for comparison reasons.

train_test(model_encoder_decoder, X_train, y_train , X_test, y_test, batch_size=batch_size,epochs=40,patience=5 ,verbose=1)training for 40 epochs begins with EarlyStopping(monitor= val_accuracy, patience= 5 )....

Epoch 1/40

180/180 [==============================] - 10s 14ms/step - loss: 2.2497 - accuracy: 0.2251 - val_loss: 1.9826 - val_accuracy: 0.3275

***

***

Epoch 35/40

180/180 [==============================] - 1s 6ms/step - loss: 0.0177 - accuracy: 1.0000 - val_loss: 0.0333 - val_accuracy: 0.9975

Epoch 36/40

180/180 [==============================] - 1s 6ms/step - loss: 0.0142 - accuracy: 1.0000 - val_loss: 0.0294 - val_accuracy: 0.9950Epoch 00036: early stopping

40 epoch training finished...PREDICTION ACCURACY (%):

Train: 99.950, Test: 100.000

**10 examples from test data...**

Input Expected Predicted T/F

[2, 3, 1, 1] [1, 1, 3, 2] [1, 1, 3, 2] True

[7, 9, 1, 6] [6, 1, 9, 7] [6, 1, 9, 7] True

[2, 9, 3, 8] [8, 3, 9, 2] [8, 3, 9, 2] True

[7, 7, 9, 9] [9, 9, 7, 7] [9, 9, 7, 7] True

[7, 1, 1, 7] [7, 1, 1, 7] [7, 1, 1, 7] True

[8, 4, 6, 9] [9, 6, 4, 8] [9, 6, 4, 8] True

[2, 5, 9, 9] [9, 9, 5, 2] [9, 9, 5, 2] True

[2, 2, 4, 5] [5, 4, 2, 2] [5, 4, 2, 2] True

[8, 7, 5, 7] [7, 5, 7, 8] [7, 5, 7, 8] True

[4, 9, 9, 1] [1, 9, 9, 4] [1, 9, 9, 4] True

**Accuracy: 1.0**

time: 47.6 s

## Observations

- When the sequence size (
`n_timesteps_in`

) is 4 (Encoder-Decoder model terminates at**Epoch 31 with 99% accuracy**.

# ATTENTION MECHANISM

## Why?

According to the inventors “Neural Machine Translation by Jointly Learning to Align and Translate” by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio:

- “One of the
**motivations**behind the proposed approach (**attention mechanism**) was the use of a fixed-length context vector in the basic encoder-decoder approach. We conjectured that this limitation may make the basic encoder-decoder approach to**underperform**with. “*long sentences*

We can check the validity of these arguments by increasing the sequence size (`n_timesteps_in`

) to **16**

- Remember that when the sequence size (
`n_timesteps_in`

) is**4**(Encoder-Decoder model terminates at**Epoch 31 with 99% accuracy**. - However, when the sequence size (
`n_timesteps_in`

) is**16**Encoder-Decoder model runs all the 40 epochs and finishes**with only 36% accuracy!**

That is, as argued, the Encoder-Decoder model **underperforms** with long sequences.

## How does it work?

- “The proposed approach provides an intuitive way to inspect the (soft-) alignment
**between**thetranslation and*words in a generated*sentence”.*those in a source*

To understand how the attention mechanism works, **first** **compare** the Encoder-Decoder Model we coded above with an Encoder-Decoder model including the attention mechanism in figures

**Note that:**

In the above figure, the Encoder-Decoder Model we have coded **use**

**only**the decoder’s**last**hidden and cell states- the decoder’s states as an initial context vector-only
**once** - In the below figure, the Encoder-Decoder model with the attention mechanism:
- We use
**not only**the last hidden and cell states**but also**the decoder’s hidden states generated at**all the time steps** - We use all the decoder’s hidden states at
**all consecutive time steps**

**Basically**:

- First, we initialize the Decoder states by using the last states of the Encoder as usual
- Then at
**each decoding time step**: - We use
**Encoder’s all hidden states**and**the previous Decoder’s output**to calculate a**Context Vector**by applying*an Attention Mechanism* - Lastly, we
**concatenate**the**Context Vecto**r with**the previous Decoder’s output**to createto the decoder.*the input*

I will provide a more detailed explanation about the model after discussing and implementing Bahdanau attention.

## Attention: How to calculate Context Vector

According to “Effective Approaches to Attention-based Neural Machine Translation” by Minh-Thang Luong, Hieu Pham, Christopher D. Manning, the attention mechanism above is called “**Global Attention**”:

“The idea of a global attentional model is to consider **all the hidden states of the encoder $h_{s}$** when deriving the **context vector $c_{t}$**”

That is, we **attend** to **all the decoder outputs** for generating each decoder’s output as follows:

## Notation

**h_{s}**: all the hidden states of the encoder

**h_{t}**: previous hidden states of the decoder (previous time step output)

**c_{t}**: context vector

**W**: Weight matrix for parametrizing the calculations

## Calculate a **score** to relate the **Encoder’s all hidden states** and **the previous Decoder’s output**

There are many different scores proposed by researchers. The most important ones are:

You can think of these scores as the level of relationship between the **Encoder’s all hidden states** and **the previous Decoder’s output**.

We use $W$ matrices to parametrize the calculations. That is, we will learn the weight values during training via backpropagation. The model will learn how to calculate better scores.

$tanh$ is a single hidden layer network model here.

$v$ is another single hidden layer network model here.

As a result of the above model, we expect that these layers ($ W, tanh, v$) will learn how to calculate a suitable score during training.

## Calculate the **Attention Weights** by normalizing the scores.

These are the weights for each decoder hidden state $h_{s}$.

Simply, we can use `softmax()`

to calculate the probability distribution.

## Calculate the **Context Vector** by applying the **Attention Weights** onto **decoder hidden states $h_{s}$**.

Thus, we will have weighted decoder hidden states $h_{s}$ at the end

After calculating the context vector, we can **concatenate** it with the previous decoder hidden state (output) **to generate the input for the next decoder output**.

## Let’s code Bahdanau Attention Layer

First, I would like to share with you the core code snippet:

I borrowed the below code from Tensorflow official web site and appended necessary comments to relate the above formula with the below code.

Please **pay attention** to each tensor **dimensions**. That is really important for **understanding how it all works together**!

**class BahdanauAttention(tf.keras.layers.Layer):**

def __init__(self, units, verbose=0):

super(BahdanauAttention, self).__init__()

self.W1 = tf.keras.layers.Dense(units)

self.W2 = tf.keras.layers.Dense(units)

self.V = tf.keras.layers.Dense(1)

self.verbose= verbose

def call(self, query, values):

if self.verbose:

print('\n******* Bahdanau Attention STARTS******')

print('query (decoder hidden state): (batch_size, hidden size) ', query.shape)

print('values (encoder all hidden state): (batch_size, max_len, hidden size) ', values.shape)

# query hidden state shape == (batch_size, hidden size)

# query_with_time_axis shape == (batch_size, 1, hidden size)

# values shape == (batch_size, max_len, hidden size)

# we are doing this to broadcast addition along the time axis to calculate the score

query_with_time_axis = tf.expand_dims(query, 1)

if self.verbose:

print('query_with_time_axis:(batch_size, 1, hidden size) ', query_with_time_axis.shape)

# score shape == (batch_size, max_length, 1)

# we get 1 at the last axis because we are applying score to self.V

# the shape of the tensor before applying self.V is (batch_size, max_length, units)

**score = self.V(tf.nn.tanh( self.W1(query_with_time_axis) + self.W2(values)))**

if self.verbose:

print('score: (batch_size, max_length, 1) ',score.shape)

# attention_weights shape == (batch_size, max_length, 1)

**attention_weights = tf.nn.softmax(score, axis=1)**

if self.verbose:

print('attention_weights: (batch_size, max_length, 1) ',attention_weights.shape)

# context_vector shape after sum == (batch_size, hidden_size)

**context_vector = attention_weights * values**

if self.verbose:

print('context_vector before reduce_sum: (batch_size, max_length, hidden_size) ',context_vector.shape)

context_vector = tf.reduce_sum(context_vector, axis=1)

if self.verbose:

print('context_vector after reduce_sum: (batch_size, hidden_size) ',context_vector.shape)

print('\n******* Bahdanau Attention ENDS******')

return context_vector, attention_weights

`time: 26.4 ms`

# Integrate the attention layer into the Encoder-Decoder model

## In an Encoder-Decoder with **Attention** Layer set-up,

Encoder provides:

- the
by sending its*initial states***last hidden state + last cell state** - the
**context vector**by sending its**all hidden states**

The decoder needs **2 inputs** to ** generate/predict an output**:

- an input tensor
- a state tensor

The decoder:

- initializes its state by consuming the ***initial state***s
- uses
**decoder’s last hidden state**as the*initial input* - calculates
using*attention vector***initial input**+**encoder’s all hidden states** - applies the
**attention**to**encoder’s all hidden states**finds the*context vector* - concatenate
**context vector**+**START**to generate the**decoder input** - then
**runs in a loop:** - consume the input and states
- outputs its
,*last hidden state, last hidden state (yes again!), last cell state* - use
**last hidden state + last cell state**as the**next state** - use
**last hidden state**as the next**initial input** - calculates
using*attention vector***initial input**+**encoder’s all hidden states** - applies the attention to
**encoder’s all hidden states**finds the**context vector** - concatenate
**context vector**+**initial input**to generate the**decoder input**

`verbose= 0 `

#See all debug messages

batch_size=1

if verbose:

print('***** Model Hyper Parameters *******')

print('latentSpaceDimension: ', latentSpaceDimension)

print('batch_size: ', batch_size)

print('sequence length: ', n_timesteps_in)

print('n_features: ', n_features)

print('\n***** TENSOR DIMENSIONS *******')

# The first part is encoder

encoder_inputs = Input(shape=(n_timesteps_in, n_features), name='encoder_inputs')

encoder_lstm = LSTM(latentSpaceDimension,return_sequences=True, return_state=True, name='encoder_lstm')

encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm(encoder_inputs)

if verbose:

print ('Encoder output shape: (batch size, sequence length, latentSpaceDimension) {}'.format(encoder_outputs.shape))

print ('Encoder Hidden state shape: (batch size, latentSpaceDimension) {}'.format(encoder_state_h.shape))

print ('Encoder Cell state shape: (batch size, latentSpaceDimension) {}'.format(encoder_state_c.shape))

# initial context vector is the states of the encoder

encoder_states = [encoder_state_h, encoder_state_c]

if verbose:

print(encoder_states)

# Set up the attention layer

attention= BahdanauAttention(latentSpaceDimension, verbose=verbose)

# Set up the decoder layers

decoder_inputs = Input(shape=(1, (n_features+latentSpaceDimension)),name='decoder_inputs')

decoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='decoder_lstm')

decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

**# 1 initial decoder's input data**

# Prepare initial decoder input data that just contains the start character

# Note that we made it a constant one-hot-encoded in the model

# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop

# one-hot encoded zero(0) is the start symbol

inputs = np.zeros((batch_size, 1, n_features))

inputs[:, 0, 0] = 1

**# 2 initial decoder's state**

# encoder's last hidden state + last cell state

decoder_outputs = encoder_state_h

states = encoder_states

if verbose:

print('initial decoder inputs: ', inputs.shape)

# decoder will only process one time step at a time.

for _ in range(n_timesteps_in):

**# 3 pay attention**

# create the context vector by applying attention to

# decoder_outputs (last hidden state) + encoder_outputs (all hidden states)

context_vector, attention_weights=attention(decoder_outputs, encoder_outputs)

if verbose:

print("Attention context_vector: (batch size, units) {}".format(context_vector.shape))

print("Attention weights : (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

print('decoder_outputs: (batch_size, latentSpaceDimension) ', decoder_outputs.shape )

context_vector = tf.expand_dims(context_vector, 1)

if verbose:

print('Reshaped context_vector: ', context_vector.shape )

**# 4. concatenate the input + context vectore to find the next decoder's input**

inputs = tf.concat([context_vector, inputs], axis=-1)

if verbose:

print('After concat inputs: (batch_size, 1, n_features + hidden_size): ',inputs.shape )

** # 5. passing the concatenated vector to the LSTM**

# Run the decoder on one timestep with attended input and previous states

decoder_outputs, state_h, state_c = decoder_lstm(inputs,

initial_state=states)

#decoder_outputs = tf.reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))

outputs = decoder_dense(decoder_outputs)

**# 6. Use the last hidden state for prediction the output**

# save the current prediction

# we will concatenate all predictions later

outputs = tf.expand_dims(outputs, 1)

all_outputs.append(outputs)

**# 7. Reinject the output (prediction) as inputs for the next loop iteration**

# as well as update the states

inputs = outputs

states = [state_h, state_c]

**# 8. After running Decoder for max time steps**

# we had created a predition list for the output sequence

# convert the list to output array by Concatenating all predictions

# such as [batch_size, timesteps, features]

decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

**# 9. Define and compile model **

model_encoder_decoder_Bahdanau_Attention = Model(encoder_inputs, decoder_outputs, name='model_encoder_decoder')

model_encoder_decoder_Bahdanau_Attention.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

## TRAIN THE MODEL WITH ATTENTION

train_test(model_encoder_decoder_Bahdanau_Attention, X_train, y_train , X_test,training for 40 epochs begins with EarlyStopping(monitor= val_accuracy, patience= 3 )....

y_test, batch_size=batch_size,epochs=40, patience=3, verbose=1)

Epoch 1/40

1800/1800 [==============================] - 19s 7ms/step - loss: 2.0012 - accuracy: 0.2725 - val_loss: 1.4980 - val_accuracy: 0.4025

***

Epoch 7/40

1800/1800 [==============================] - 11s 6ms/step - loss: 5.6494e-04 - accuracy: 0.9999 - val_loss: 1.1552e-06 - val_accuracy: 1.0000

Epoch 8/40

1800/1800 [==============================] - 11s 6ms/step - loss: 7.0385e-05 - accuracy: 1.0000 - val_loss: 1.7178e-07 - val_accuracy: 1.0000

Epoch 00008: early stopping

40 epoch training finished...PREDICTION ACCURACY (%):

Train: 100.000, Test: 100.000

`10 examples from test data...`

Input Expected Predicted T/F

[2, 3, 1, 1] [1, 1, 3, 2] [1, 1, 3, 2] True

[7, 9, 1, 6] [6, 1, 9, 7] [6, 1, 9, 7] True

[2, 9, 3, 8] [8, 3, 9, 2] [8, 3, 9, 2] True

[7, 7, 9, 9] [9, 9, 7, 7] [9, 9, 7, 7] True

[7, 1, 1, 7] [7, 1, 1, 7] [7, 1, 1, 7] True

[8, 4, 6, 9] [9, 6, 4, 8] [9, 6, 4, 8] True

[2, 5, 9, 9] [9, 9, 5, 2] [9, 9, 5, 2] True

[2, 2, 4, 5] [5, 4, 2, 2] [5, 4, 2, 2] True

[8, 7, 5, 7] [7, 5, 7, 8] [7, 5, 7, 8] True

[4, 9, 9, 1] [1, 9, 9, 4] [1, 9, 9, 4] True

Accuracy: 1.0

time: 1min 38s

## Observations

When the sequence size (`n_timesteps_in`

) is **4**

- Encoder-Decoder model terminates at
**Epoch 31 with 99% accuracy**. - Encoder-Decoder Model with Attention terminates at
**Epoch 9 with 100%**

**However**, when the sequence size (`n_timesteps_in`

) is **16**

- Encoder-Decoder model runs all the
**40 epochs**and finishes with only**36%**accuracy!. - Encoder-Decoder Model with Attention terminates at
**Epoch 16 with 99%**

We can conclude that the Encoder-Decoder model with Attention is much more scalable in terms of sequence length.

## PREDICT WITH THE TRAINED MODEL

pred=model_encoder_decoder_Bahdanau_Attention.predict(X_test[0].reshape(1,n_timesteps_in,n_features), batch_size=1)

print('input', one_hot_decode(X_test[0]))

print('expected', one_hot_decode(y_test[0]))

print('predicted', one_hot_decode(pred[0]))input [2, 3, 1, 1]

expected [1, 1, 3, 2]

predicted [1, 1, 3, 2]

time: 46.1 ms

# BONUS 1: DEPICT THE ATTENTION

- For a sample input, we will record the attention values for each time step
- Then, we will use the color codes to depict the relation between inputs and outputs
- The lighter colors indicate higher values (attention)
- That is, the model pays more attention to some inputs when creating each output (prediction)
- The model learns where to look for to create the output!

def evaluate(seq_in):

attention_plot = np.zeros((n_timesteps_in, n_timesteps_in))

print ('attention_plot shape: (n_timesteps_in, n_timesteps_in) {}'.format(attention_plot.shape))

#sequence = [7, 9, 8, 5]

sequence = one_hot_encode(seq_in,n_features)

encoder_inputs=array(sequence).reshape(1,n_timesteps_in,n_features)

encoder_inputs = tf.convert_to_tensor(encoder_inputs,dtype=tf.float32)

print ('Encoder input shape: (batch size, sequence length, n_features) {}'.format(encoder_inputs.shape))

encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

print ('Encoder output shape: (batch size, sequence length, latentSpaceDimension) {}'.format(encoder_outputs.shape))

print ('Encoder Hidden state shape: (batch size, latentSpaceDimension) {}'.format(state_h.shape))

print ('Encoder Cell state shape: (batch size, latentSpaceDimension) {}'.format(state_c.shape))

# initial context vector is the states of the encoder

states = [state_h, state_c]

# Set up the attention layer

#attention= BahdanauAttention(latentSpaceDimension)

# Set up the decoder layers

#decoder_inputs = Input(shape=(1, (n_features+latentSpaceDimension)))

#decoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='decoder_lstm')

#decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

#INIT DECODER

# Prepare decoder input data that just contains the start character 0

# Note that we made it a constant one-hot-encoded in the model

# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop

decoder_input_data = np.zeros((1, 1, n_features))

decoder_input_data[:, 0, 0] = 1

# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop

inputs = decoder_input_data

#initial hiiden state

decoder_outputs = state_h

print('initial decoder inputs: ', inputs.shape)

# decoder will only process one timestep at a time.

for t in range(n_timesteps_in):

# pay attention

context_vector, attention_weights=attention(decoder_outputs, encoder_outputs)

print("Attention context_vector: (batch size, units) {}".format(context_vector.shape))

print("Attention weights : (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

# storing the attention weights to plot later on

attention_weights = tf.reshape(attention_weights, (-1, ))

attention_plot[t] = attention_weights.numpy()

decoder_outputs=tf.expand_dims(decoder_outputs, 1)

print('decoder_outputs: (batch_size, 1, latentSpaceDimension) ', decoder_outputs.shape )

context_vector = tf.expand_dims(context_vector, 1)

print('Reshaped context_vector: ', context_vector.shape )

inputs = tf.concat([context_vector, inputs], axis=-1)

#inputs = tf.concat([context_vector, inputs], axis=-1)

print('After concat inputs: (batch_size, 1, n_features + hidden_size): ',inputs.shape )

# passing the concatenated vector to the LSTM

# Run the decoder on one timestep

decoder_outputs, state_h, state_c = decoder_lstm(inputs,

initial_state=states)

#decoder_outputs = tf.reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))

outputs = decoder_dense(decoder_outputs)

# Store the current prediction (we will concatenate all predictions later)

outputs = tf.expand_dims(outputs, 1)

all_outputs.append(outputs)

# Reinject the outputs as inputs for the next loop iteration

# as well as update the states

inputs = outputs

states = [state_h, state_c]

# Concatenate all predictions such as [batch_size, timesteps, features]

decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

seq_out=one_hot_decode(decoder_outputs[0])

return seq_in, seq_out, attention_plottime: 63.8 ms# function for plotting the attention weights

def plot_attention(attention, sequence, predicted_sequence):

fig = plt.figure(figsize=(8,8))

ax = fig.add_subplot(1, 1, 1)

ax.matshow(attention, cmap='viridis')

fontdict = {'fontsize': 14}

ax.set_xticklabels([''] + sequence, fontdict=fontdict, rotation=90)

ax.set_yticklabels([''] + predicted_sequence, fontdict=fontdict)

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))

ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()time: 10.3 msdef translate(seq_in):

seq_in, seq_out, attention_plot = evaluate(seq_in)

print('Input: %s' % (seq_in))

print('Predicted translation: {}'.format(seq_out))

attention_plot = attention_plot[:len(seq_out), :len(seq_in)]

plot_attention(attention_plot, seq_in, seq_out)time: 3.59 ms#translate([1, 2, 3 ,4,5,6,7,8,9,1,2,3,4,5,7,6])

translate([1, 2, 3 ,4])Input: [1, 2, 3, 4]

Predicted translation: [4, 3, 2, 1]

`time: 176 ms`

# BONUS 2: LUONG ATTENTION

As we discussed, the only difference between Bahdanau and Luong Attention is the calculation of the score.

Here we use dot product for finding score as suggested by Luong et.al.

`#LUONGH Dot-product score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed) , perm=[0, 2, 1])`

`class LuongAttention(tf.keras.layers.Layer):`

def __init__(self, units):

super(LuongAttention, self).__init__()

self.W1 = tf.keras.layers.Dense(units)

self.W2 = tf.keras.layers.Dense(units)

self.V = tf.keras.layers.Dense(1)

def call(self, query, values):

print('\n******* Luong Attention STARTS******')

print('query (decoder hidden state): (batch_size, hidden size) ', query.shape)

print('values (encoder all hidden state): (batch_size, max_len, hidden size) ', values.shape)

# query hidden state shape == (batch_size, hidden size)

# query_with_time_axis shape == (batch_size, 1, hidden size)

# values shape == (batch_size, max_len, hidden size)

# we are doing this to broadcast addition along the time axis to calculate the score

query_with_time_axis = tf.expand_dims(query, 1)

print('query_with_time_axis:(batch_size, 1, hidden size) ', query_with_time_axis.shape)

values_transposed = tf.transpose(values, perm=[0, 2, 1])

print('values_transposed:(batch_size, hidden size, max_len) ', values_transposed.shape)

# score shape == (batch_size, max_length, 1)

# we get 1 at the last axis because we are applying score to self.V

# the shape of the tensor before applying self.V is (batch_size, max_length, units)

#BAHDANAU ADDITIVE:

#score = self.V(tf.nn.tanh(

# self.W1(query_with_time_axis) + self.W2(values)))

#LUONGH Dot-product

score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed) , perm=[0, 2, 1])

print('score: (batch_size, max_length, 1) ',score.shape)

# attention_weights shape == (batch_size, max_length, 1)

attention_weights = tf.nn.softmax(score, axis=1)

print('attention_weights: (batch_size, max_length, 1) ',attention_weights.shape)

# context_vector shape after sum == (batch_size, hidden_size)

context_vector = attention_weights * values

print('context_vector before reduce_sum: (batch_size, max_length, hidden_size) ',context_vector.shape)

context_vector = tf.reduce_sum(context_vector, axis=1)

print('context_vector after reduce_sum: (batch_size, hidden_size) ',context_vector.shape)

print('\n******* Luong Attention ENDS******')

return context_vector, attention_weights

# WRAP UP

- We have discussed Encoder-Decoder Model and observed that when the input sequence gets
**longer**Encoder-Decoder model**can not**relate the output to the input**successfully** - We have seen that we can relate each output with each input by using an attention mechanism.
- In the experiments, the positive effect of the attention mechanism on performance is clear.
- We implemented
**Bahdanau style (Additive)**attention which is amechanism.*global attention* - Another global attention mechanism is
**Luong Attention (Multiplicative)**in which only the calculation of the score values differs. - If only dot product use in
**Luong Attention (Multiplicative)**, it would be more efficient in terms of score calculations - We can summarize the methods to find score values as below:

# DO IT YOURSELF

- Add Teacher Forcing
- Add Beam Search
- Convert Bahdanau Attention to Luong Attention and re-run the experiments
- Use Function Subclassing API for creating Encoder and Decoder