PART F: Encoder-Decoder with Bahdanau & Luong Attention Mechanism

References:

SEQ2SEQ LEARNING SERIES:

WHY WE HAVE SO MANY PARTS?

Sample Problem:

Configure the sample problem

#@title Configure problem

n_timesteps_in = 4
#each input sample has 4 values

n_features = 10
#each value is one_hot_encoded with 10 0/1

train_size= 2000
test_size = 200

For each input sequence (X), selecting 4 random numbers beteen 1 and 10 (0 is reserved )

A sample X
X=[2, 9, 9, 2]

reversed input sequence (X) is the output sequence (y)
y=[2, 9, 9, 2]

Each input and output sequences are converted one_hot_encoded format in 10 dimensions
X=[[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1]
[0 0 1 0 0 0 0 0 0 0]]
y=[[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1]
[0 0 1 0 0 0 0 0 0 0]]

Generated sequence datasets as follows (batch_size,time_steps, features)
X_train.shape: (2000, 4, 10) y_train.shape: (2000, 4, 10)
X_test.shape: (200, 4, 10) y_test.shape: (200, 4, 10)
time: 79.7 ms

QUICK LSTM REMINDER

A BASIC LSTM-BASED ENCODER & DECODER MODEL

Let’s review the code

Here is the complete code:


#@title latentSpaceDimension is the dimension of the each state vector
latentSpaceDimension = 16
def create_hard_coded_decoder_input_model(batch_size):
# The first part is encoder
encoder_inputs = Input(shape=(n_timesteps_in, n_features), name='encoder_inputs')
encoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

# initial context vector is the states of the encoder
states = [state_h, state_c]

# Set up the decoder layers
decoder_inputs = Input(shape=(1, n_features))
decoder_lstm = LSTM(latentSpaceDimension, return_sequences=True, return_state=True, name='decoder_lstm')
decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []
# Prepare decoder input data that just contains the start character 0
# Note that we made it a constant one-hot-encoded in the model
# that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop
decoder_input_data = np.zeros((batch_size, 1, n_features))
decoder_input_data[:, 0, 0] = 1 #

# that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop
inputs = decoder_input_data
# decoder will only process one timestep at a time.
for _ in range(n_timesteps_in):
# Run the decoder on one timestep
outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
outputs = decoder_dense(outputs)
# Store the current prediction (we will concatenate all predictions later)
all_outputs.append(outputs)
# Reinject the outputs as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]

# Concatenate all predictions such as [batch_size, timesteps, features]
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

# Define and compile model
model = Model(encoder_inputs, decoder_outputs, name='model_encoder_decoder')
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
return model
batch_size = 10
model_encoder_decoder=create_hard_coded_decoder_input_model(batch_size=batch_size)
#model_encoder_decoder.summary()
time: 1.17 s

Train model

model_encoder_decoder.fit(X_train, y_train,
batch_size=batch_size,
epochs=30,
validation_split=0.2)
train_test(model_encoder_decoder, X_train, y_train , X_test, 	y_test, batch_size=batch_size,epochs=40,patience=5 ,verbose=1)training for  40  epochs begins with EarlyStopping(monitor= val_accuracy, patience= 5 )....
Epoch 1/40
180/180 [==============================] - 10s 14ms/step - loss: 2.2497 - accuracy: 0.2251 - val_loss: 1.9826 - val_accuracy: 0.3275
***
***
Epoch 35/40
180/180 [==============================] - 1s 6ms/step - loss: 0.0177 - accuracy: 1.0000 - val_loss: 0.0333 - val_accuracy: 0.9975
Epoch 36/40
180/180 [==============================] - 1s 6ms/step - loss: 0.0142 - accuracy: 1.0000 - val_loss: 0.0294 - val_accuracy: 0.9950
Epoch 00036: early stopping
40 epoch training finished...

PREDICTION ACCURACY (%):
Train: 99.950, Test: 100.000
png
png
10 examples from test data...
Input Expected Predicted T/F
[2, 3, 1, 1] [1, 1, 3, 2] [1, 1, 3, 2] True
[7, 9, 1, 6] [6, 1, 9, 7] [6, 1, 9, 7] True
[2, 9, 3, 8] [8, 3, 9, 2] [8, 3, 9, 2] True
[7, 7, 9, 9] [9, 9, 7, 7] [9, 9, 7, 7] True
[7, 1, 1, 7] [7, 1, 1, 7] [7, 1, 1, 7] True
[8, 4, 6, 9] [9, 6, 4, 8] [9, 6, 4, 8] True
[2, 5, 9, 9] [9, 9, 5, 2] [9, 9, 5, 2] True
[2, 2, 4, 5] [5, 4, 2, 2] [5, 4, 2, 2] True
[8, 7, 5, 7] [7, 5, 7, 8] [7, 5, 7, 8] True
[4, 9, 9, 1] [1, 9, 9, 4] [1, 9, 9, 4] True
Accuracy: 1.0
time: 47.6 s

Observations

ATTENTION MECHANISM

Why?

How does it work?

Attention: How to calculate Context Vector

Notation

Calculate a score to relate the Encoder’s all hidden states and the previous Decoder’s output

Calculate the Attention Weights by normalizing the scores.

Calculate the Context Vector by applying the Attention Weights onto decoder hidden states $h_{s}$.

Let’s code Bahdanau Attention Layer

time: 26.4 ms

Integrate the attention layer into the Encoder-Decoder model

In an Encoder-Decoder with Attention Layer set-up,

verbose= 0 
#See all debug messages

batch_size=1
if verbose:
print('***** Model Hyper Parameters *******')
print('latentSpaceDimension: ', latentSpaceDimension)
print('batch_size: ', batch_size)
print('sequence length: ', n_timesteps_in)
print('n_features: ', n_features)

print('\n***** TENSOR DIMENSIONS *******')

# The first part is encoder
encoder_inputs = Input(shape=(n_timesteps_in, n_features), name='encoder_inputs')
encoder_lstm = LSTM(latentSpaceDimension,return_sequences=True, return_state=True, name='encoder_lstm')
encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm(encoder_inputs)

if verbose:
print ('Encoder output shape: (batch size, sequence length, latentSpaceDimension) {}'.format(encoder_outputs.shape))
print ('Encoder Hidden state shape: (batch size, latentSpaceDimension) {}'.format(encoder_state_h.shape))
print ('Encoder Cell state shape: (batch size, latentSpaceDimension) {}'.format(encoder_state_c.shape))
# initial context vector is the states of the encoder
encoder_states = [encoder_state_h, encoder_state_c]
if verbose:
print(encoder_states)
# Set up the attention layer
attention= BahdanauAttention(latentSpaceDimension, verbose=verbose)


# Set up the decoder layers
decoder_inputs = Input(shape=(1, (n_features+latentSpaceDimension)),name='decoder_inputs')
decoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='decoder_lstm')
decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

# 1 initial decoder's input data
# Prepare initial decoder input data that just contains the start character
# Note that we made it a constant one-hot-encoded in the model
# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop
# one-hot encoded zero(0) is the start symbol
inputs = np.zeros((batch_size, 1, n_features))
inputs[:, 0, 0] = 1


# 2 initial decoder's state
# encoder's last hidden state + last cell state
decoder_outputs = encoder_state_h
states = encoder_states
if verbose:
print('initial decoder inputs: ', inputs.shape)

# decoder will only process one time step at a time.
for _ in range(n_timesteps_in):

# 3 pay attention
# create the context vector by applying attention to
# decoder_outputs (last hidden state) + encoder_outputs (all hidden states)
context_vector, attention_weights=attention(decoder_outputs, encoder_outputs)
if verbose:
print("Attention context_vector: (batch size, units) {}".format(context_vector.shape))
print("Attention weights : (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
print('decoder_outputs: (batch_size, latentSpaceDimension) ', decoder_outputs.shape )

context_vector = tf.expand_dims(context_vector, 1)
if verbose:
print('Reshaped context_vector: ', context_vector.shape )

# 4. concatenate the input + context vectore to find the next decoder's input
inputs = tf.concat([context_vector, inputs], axis=-1)

if verbose:
print('After concat inputs: (batch_size, 1, n_features + hidden_size): ',inputs.shape )

# 5. passing the concatenated vector to the LSTM
# Run the decoder on one timestep with attended input and previous states
decoder_outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
#decoder_outputs = tf.reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))

outputs = decoder_dense(decoder_outputs)
# 6. Use the last hidden state for prediction the output
# save the current prediction
# we will concatenate all predictions later
outputs = tf.expand_dims(outputs, 1)
all_outputs.append(outputs)
# 7. Reinject the output (prediction) as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]


# 8. After running Decoder for max time steps
# we had created a predition list for the output sequence
# convert the list to output array by Concatenating all predictions
# such as [batch_size, timesteps, features]
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

# 9. Define and compile model
model_encoder_decoder_Bahdanau_Attention = Model(encoder_inputs, decoder_outputs, name='model_encoder_decoder')
model_encoder_decoder_Bahdanau_Attention.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

TRAIN THE MODEL WITH ATTENTION

train_test(model_encoder_decoder_Bahdanau_Attention, X_train, y_train , X_test,
y_test, batch_size=batch_size,epochs=40, patience=3, verbose=1)
training for 40 epochs begins with EarlyStopping(monitor= val_accuracy, patience= 3 )....
Epoch 1/40
1800/1800 [==============================] - 19s 7ms/step - loss: 2.0012 - accuracy: 0.2725 - val_loss: 1.4980 - val_accuracy: 0.4025
***
Epoch 7/40
1800/1800 [==============================] - 11s 6ms/step - loss: 5.6494e-04 - accuracy: 0.9999 - val_loss: 1.1552e-06 - val_accuracy: 1.0000
Epoch 8/40
1800/1800 [==============================] - 11s 6ms/step - loss: 7.0385e-05 - accuracy: 1.0000 - val_loss: 1.7178e-07 - val_accuracy: 1.0000
Epoch 00008: early stopping
40 epoch training finished...

PREDICTION ACCURACY (%):
Train: 100.000, Test: 100.000
png
png
10 examples from test data...
Input Expected Predicted T/F
[2, 3, 1, 1] [1, 1, 3, 2] [1, 1, 3, 2] True
[7, 9, 1, 6] [6, 1, 9, 7] [6, 1, 9, 7] True
[2, 9, 3, 8] [8, 3, 9, 2] [8, 3, 9, 2] True
[7, 7, 9, 9] [9, 9, 7, 7] [9, 9, 7, 7] True
[7, 1, 1, 7] [7, 1, 1, 7] [7, 1, 1, 7] True
[8, 4, 6, 9] [9, 6, 4, 8] [9, 6, 4, 8] True
[2, 5, 9, 9] [9, 9, 5, 2] [9, 9, 5, 2] True
[2, 2, 4, 5] [5, 4, 2, 2] [5, 4, 2, 2] True
[8, 7, 5, 7] [7, 5, 7, 8] [7, 5, 7, 8] True
[4, 9, 9, 1] [1, 9, 9, 4] [1, 9, 9, 4] True
Accuracy: 1.0
time: 1min 38s

Observations

PREDICT WITH THE TRAINED MODEL

pred=model_encoder_decoder_Bahdanau_Attention.predict(X_test[0].reshape(1,n_timesteps_in,n_features), batch_size=1)
print('input', one_hot_decode(X_test[0]))
print('expected', one_hot_decode(y_test[0]))
print('predicted', one_hot_decode(pred[0]))
input [2, 3, 1, 1]
expected [1, 1, 3, 2]
predicted [1, 1, 3, 2]
time: 46.1 ms

BONUS 1: DEPICT THE ATTENTION

def evaluate(seq_in):
attention_plot = np.zeros((n_timesteps_in, n_timesteps_in))
print ('attention_plot shape: (n_timesteps_in, n_timesteps_in) {}'.format(attention_plot.shape))


#sequence = [7, 9, 8, 5]
sequence = one_hot_encode(seq_in,n_features)
encoder_inputs=array(sequence).reshape(1,n_timesteps_in,n_features)

encoder_inputs = tf.convert_to_tensor(encoder_inputs,dtype=tf.float32)
print ('Encoder input shape: (batch size, sequence length, n_features) {}'.format(encoder_inputs.shape))


encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

print ('Encoder output shape: (batch size, sequence length, latentSpaceDimension) {}'.format(encoder_outputs.shape))
print ('Encoder Hidden state shape: (batch size, latentSpaceDimension) {}'.format(state_h.shape))
print ('Encoder Cell state shape: (batch size, latentSpaceDimension) {}'.format(state_c.shape))
# initial context vector is the states of the encoder
states = [state_h, state_c]

# Set up the attention layer
#attention= BahdanauAttention(latentSpaceDimension)


# Set up the decoder layers
#decoder_inputs = Input(shape=(1, (n_features+latentSpaceDimension)))
#decoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='decoder_lstm')
#decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

#INIT DECODER
# Prepare decoder input data that just contains the start character 0
# Note that we made it a constant one-hot-encoded in the model
# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop
decoder_input_data = np.zeros((1, 1, n_features))
decoder_input_data[:, 0, 0] = 1
# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop
inputs = decoder_input_data
#initial hiiden state
decoder_outputs = state_h

print('initial decoder inputs: ', inputs.shape)

# decoder will only process one timestep at a time.
for t in range(n_timesteps_in):

# pay attention
context_vector, attention_weights=attention(decoder_outputs, encoder_outputs)



print("Attention context_vector: (batch size, units) {}".format(context_vector.shape))
print("Attention weights : (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

# storing the attention weights to plot later on
attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()


decoder_outputs=tf.expand_dims(decoder_outputs, 1)
print('decoder_outputs: (batch_size, 1, latentSpaceDimension) ', decoder_outputs.shape )

context_vector = tf.expand_dims(context_vector, 1)
print('Reshaped context_vector: ', context_vector.shape )
inputs = tf.concat([context_vector, inputs], axis=-1)
#inputs = tf.concat([context_vector, inputs], axis=-1)
print('After concat inputs: (batch_size, 1, n_features + hidden_size): ',inputs.shape )

# passing the concatenated vector to the LSTM
# Run the decoder on one timestep
decoder_outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
#decoder_outputs = tf.reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))

outputs = decoder_dense(decoder_outputs)
# Store the current prediction (we will concatenate all predictions later)
outputs = tf.expand_dims(outputs, 1)
all_outputs.append(outputs)
# Reinject the outputs as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]



# Concatenate all predictions such as [batch_size, timesteps, features]
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)
seq_out=one_hot_decode(decoder_outputs[0])

return seq_in, seq_out, attention_plot
time: 63.8 ms# function for plotting the attention weights
def plot_attention(attention, sequence, predicted_sequence):
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap='viridis')

fontdict = {'fontsize': 14}

ax.set_xticklabels([''] + sequence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([''] + predicted_sequence, fontdict=fontdict)

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()
time: 10.3 msdef translate(seq_in):
seq_in, seq_out, attention_plot = evaluate(seq_in)

print('Input: %s' % (seq_in))
print('Predicted translation: {}'.format(seq_out))

attention_plot = attention_plot[:len(seq_out), :len(seq_in)]
plot_attention(attention_plot, seq_in, seq_out)
time: 3.59 ms#translate([1, 2, 3 ,4,5,6,7,8,9,1,2,3,4,5,7,6])
translate([1, 2, 3 ,4])
Input: [1, 2, 3, 4]
Predicted translation: [4, 3, 2, 1]
png
time: 176 ms

BONUS 2: LUONG ATTENTION

class LuongAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(LuongAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)

def call(self, query, values):
print('\n******* Luong Attention STARTS******')
print('query (decoder hidden state): (batch_size, hidden size) ', query.shape)
print('values (encoder all hidden state): (batch_size, max_len, hidden size) ', values.shape)

# query hidden state shape == (batch_size, hidden size)
# query_with_time_axis shape == (batch_size, 1, hidden size)
# values shape == (batch_size, max_len, hidden size)
# we are doing this to broadcast addition along the time axis to calculate the score
query_with_time_axis = tf.expand_dims(query, 1)

print('query_with_time_axis:(batch_size, 1, hidden size) ', query_with_time_axis.shape)


values_transposed = tf.transpose(values, perm=[0, 2, 1])
print('values_transposed:(batch_size, hidden size, max_len) ', values_transposed.shape)

# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
#BAHDANAU ADDITIVE:
#score = self.V(tf.nn.tanh(
# self.W1(query_with_time_axis) + self.W2(values)))

#LUONGH Dot-product
score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed) , perm=[0, 2, 1])

print('score: (batch_size, max_length, 1) ',score.shape)
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
print('attention_weights: (batch_size, max_length, 1) ',attention_weights.shape)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
print('context_vector before reduce_sum: (batch_size, max_length, hidden_size) ',context_vector.shape)
context_vector = tf.reduce_sum(context_vector, axis=1)
print('context_vector after reduce_sum: (batch_size, hidden_size) ',context_vector.shape)


print('\n******* Luong Attention ENDS******')
return context_vector, attention_weights

WRAP UP

DO IT YOURSELF

You can follow Murat Karakaya Akademi on these social networks:

Assoc. Prof. Computer Engineering

Sign up for Deep Learning Tutorials with Keras Updates

By Deep Learning Tutorials with Keras

In this newsletter, you will find new Deep Learning Tutorials with Keras Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

The end-to-end Keras Deep Learning tutorials with complete Python code

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store