How to implement Seq2Seq LSTM Model in Keras #ShortcutNLP

If you got stuck with Dimension problem, this is for you

Akira Takezawa
Coldstart.ml
10 min readApr 7, 2019

--

Keras: Deep Learning for Python

Why do you need to read this?

If you got stacked with seq2seq with Keras, I’m here for helping you.

When I wanted to implement seq2seq for Chatbot Task, I got stuck a lot of times especially about Dimension of Input Data and Input layer of Neural Network Architecture.

So Here I will explain complete guide of seq2seq for in Keras. Let’s get started!

Menu

  1. What is Seq2Seq Text Generation Model?
  2. Task Definition and Seq2Seq Modeling
  3. Dimensions of Each Layer from Seq2Seq
  4. Preprocessing of Seq2Seq (in Chatbot Case)
  5. Simplest preprocessing of code: which you can use today!

1. What is Seq2Seq Text Generation Model?

Fig A — Encoder-Decoder training architecture for NMT — image copyright@Ravindra Kompella

Seq2Seq is a type of Encoder-Decoder model using RNN. It can be used as a model for machine interaction and machine translation.

By learning a large number of sequence pairs, this model generates one from the other. More kindly explained, the definition of Seq2Seq is below:

  • Input: Text Data
  • Output: Text Data as well

And here we have examples of business applications of seq2seq:

  • Chatbot (you can find from my GitHub)
  • Machine Translation (you can find from my GitHub)
  • Question Answering
  • Abstract Text Summarization (you can find from my GitHub)
  • Text Generation (you can find from my GitHub)

If you want more information about Seq2Seq, here I have a recommendation from Machine Learning at Microsoft on Yotube:

So let’s take a look at whole process!

— — — — —

2. Task Definition and Seq2Seq Modeling

https://www.oreilly.com/library/view/deep-learning-essentials/9781785880360/b71e37fb-5fd9-4094-98c8-04130d5f0771.xhtml

For training our seq2seq model, we will use Cornell Movie — Dialogs Corpus Dataset which contains over 220,579 conversational exchanges between 10,292 pairs of movie characters. And it involves 9,035 characters from 617 movies.

Here one of the conversations from the data set:

Mike: 
"Drink up, Charley. We're ahead of you."
Charley:
"I'm not thirsty."

Then we will input these pairs of conversation into Encoder and Decoder. So that means our Neural Network model has two input layer as you can see below.

This is our Seq2Seq Neural Network Architecture for this time:

copyright Akira Takezawa

Let’s visualize our Seq2Seq by using LSTM:

copyright Akira Takezawa

3. Dimensions of Each Layer from Seq2Seq

https://bracketsmackdown.com/word-vector.html

The Black Box for “NLP newbie” is I think this:

How each layer compiles data and change their Dimension of data?

To make this clear, I will explain how it works with detail. The Layers can be broken down into 5 different parts:

  1. Input Layer (Encoder and Decoder)
  2. Embedding Layer (Encoder and Decoder)
  3. LSTM Layer (Encoder and Decoder)
  4. Decoder Output Layer

Let’s get started!

1. Input Layer of Encoder and Decoder (2D->2D)

  • Input Layer Dimension: 2D (sequence_length, None)
# 2D
encoder_input_layer = Input(shape=(sequence_length, ))
decoder_input_layer = Input(shape=(sequence_length, ))

NOTE: sequence_length is MAX_LEN unified by padding in preprocessing

  • Input Data: 2D (sample_num, max_sequence_length)
# Input_Data.shape = (150000, 15)array([[ 1, 32, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 123, 56, 3, 34, 43, 345, 0, 0, 0, 0, 0, 0, 0],
[ 3, 22, 1, 6543, 58, 6, 435, 0, 0, 0, 0, 0, 0],
[ 198, 27, 2, 94, 67, 98, 0, 0, 0, 0, 0, 0, 0],
[ 45, 78, 2, 23, 43, 6, 45, 0, 0, 0, 0, 0, 0]
], dtype=int32)

NOTE: sample_num can be a length of training_data (150000)

  • Output Data: 2D

NOTE: Input() is used only for Keras tensor instantiations

— — — — —

2. Embedding layer of Encoder and Decoder (2D->3D)

  • Embedding Layer Dimension: 2D (sequence_length, vocab_size)
embedding_layer = Embedding(input_dim = vocab_size,
output_dim = embedding_dimension,
input_length = sequence_length)

NOTE: vocab_size is the number of unique words

  • Input Data: 2D (sequence_length, vocab_size)
# Input_Data.shape = (15, 10000)array([[ 1, 1, 0, 0, 1, 0, ...... 0, 0, 1, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, 1, ...... 0, 0, 0, 0, 0, 0, 1],
[ 0, 1, 0, 0, 0, 0, ...... 0, 0, 1, 0, 0, 0, 0],
[ 0, 1, 0, 0, 0, 1, ...... 0, 0, 0, 1, 0, 1, 0],
[ 0, 0, 1, 0, 1, 0, ...... 0, 0, 1, 0, 1, 0, 0]
], dtype=int32)

NOTE: Data should be a group of One-Hot Vector

  • Output Data: 3D (num_samples, sequence_length, embedding_dims)
# Output_Data.shape = (150000, 15, 50)array([[[ 1, 1, 0, 0, ...... 0, 1, 0, 0],
[ 0, 0, 1, 0, ...... 0, 0, 0, 1],
...,
...,
[ 0, 1, 0, 0, ...... 1, 0, 1, 0],
[ 0, 0, 1, 0, ...... 0, 1, 0, 0]],
[[ 1, 1, 0, 0, ...... 0, 1, 0, 0],
[ 0, 0, 1, 0, ...... 0, 0, 0, 1],
...,
...,
[ 0, 1, 0, 0, ...... 1, 0, 1, 0],
[ 0, 0, 1, 0, ...... 0, 1, 0, 0]],
....,] * 150000 , dtype=int32)

NOTE: Data is word embedded in 50 dimensions

— — — — —

3. LSTM layer of Encoder and Decoder (3D->3D)

The tricky argument of LSTM layer is these two:

1. return_state:

Whether to return the last state along with the output

2. return_sequences:

Whether the last output of the output sequence or a complete sequence is returned

You can find a good explanation from Understand the Difference Between Return Sequences and Return States for LSTMs in Keras by Jason Brownlee.

  • Layer Dimension: 3D (hidden_units, sequence_length, embedding_dims)
# HIDDEN_DIM = 20encoder_LSTM = LSTM(HIDDEN_DIM, return_state=True)    encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)decoder_LSTM = LSTM(HIDDEN_DIM, return_state=True, return_sequences=True)   
decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c])
  • Input Data: 3D (num_samples, sequence_length, embedding_dims)
# Input_Data.shape = (150000, 15, 50)array([[[ 1, 1, 0, 0, ...... 0, 1, 0, 0],
[ 0, 0, 1, 0, ...... 0, 0, 0, 1],
...,
...,
[ 0, 1, 0, 0, ...... 1, 0, 1, 0],
[ 0, 0, 1, 0, ...... 0, 1, 0, 0]],
[[ 1, 1, 0, 0, ...... 0, 1, 0, 0],
[ 0, 0, 1, 0, ...... 0, 0, 0, 1],
...,
...,
[ 0, 1, 0, 0, ...... 1, 0, 1, 0],
[ 0, 0, 1, 0, ...... 0, 1, 0, 0]],
....,] * 150000 , dtype=int32)

NOTE: Data is word embedded in 50 dimensions

  • Output Data: 3D (num_samples, sequence_length, hidden_units)
# HIDDEN_DIM = 20
# Output_Data.shape = (150000, 15, 20)
array([[[ 0.0032, 0.0041, 0.0021, .... 0.0020, 0.0231, 0.0010],
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
...,
...,
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
[ 0.0021, 0.0065, 0.0008, .... 0.0089, 0.0043, 0.0024]],
[ 0.0032, 0.0041, 0.0021, .... 0.0020, 0.0231, 0.0010],
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
...,
...,
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
[ 0.0021, 0.0065, 0.0008, .... 0.0089, 0.0043, 0.0024]],
....,] * 150000 , dtype=int32)

NOTE: Data reshaped by LSTM as hidden layer in 20 dimensions

Additional Information:

If return_state = False and return_sequences = False :

  • Output Data: 2D (num_sample, hidden_units)
# HIDDEN_DIM = 20
# Output_Data.shape = (150000, 20)
array([[ 0.0032, 0.0041, 0.0021, .... 0.0020, 0.0231, 0.0010],
[ 0.0076, 0.0767, 0.0761, .... 0.0098, 0.0065, 0.0076],
...,
...,
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
[ 0.0021, 0.0065, 0.0008, .... 0.0089, 0.0043, 0.0024]]
, dtype=float32)

— — — — —

4. Decoder Output Layer (3D->2D)

  • Output Layer Dimension: 2D (sequence_length, vocab_size)
outputs = TimeDistributed(Dense(VOCAB_SIZE, activation='softmax'))(decoder_outputs)

NOTE: TimeDistributedDenselayer allows us to apply a layer to every temporal slice of an input

  • Input Data: 3D (num_samples, sequence_length, hidden_units)
# HIDDEN_DIM = 20
# Input_Data.shape = (150000, 15, 20)
array([[[ 0.0032, 0.0041, 0.0021, .... 0.0020, 0.0231, 0.0010],
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
...,
...,
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
[ 0.0021, 0.0065, 0.0008, .... 0.0089, 0.0043, 0.0024]],
[ 0.0032, 0.0041, 0.0021, .... 0.0020, 0.0231, 0.0010],
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
...,
...,
[ 0.0099, 0.0007, 0.0098, .... 0.0038, 0.0035, 0.0026],
[ 0.0021, 0.0065, 0.0008, .... 0.0089, 0.0043, 0.0024]],
....,] * 150000 , dtype=int32)

NOTE: Data reshaped by LSTM as hidden layer in 20 dimensions

  • Output Data: 2D (sequence_length, vocab_size)
# Output_Data.shape = (15, 10000)array([[ 1, 1, 0, 0, 1, 0, ...... 0, 0, 1, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, 1, ...... 0, 0, 0, 0, 0, 0, 1],
[ 0, 1, 0, 0, 0, 0, ...... 0, 0, 1, 0, 0, 0, 0],
[ 0, 1, 0, 0, 0, 1, ...... 0, 0, 0, 1, 0, 1, 0],
[ 0, 0, 1, 0, 1, 0, ...... 0, 0, 1, 0, 1, 0, 0]
], dtype=int32)

After Data passed this Fully Connected Layer, we use Reversed Vocabularywhich I will explain later to convert from One-Hot Vector into Word Sequence.

— — — — —

4. Entire Preprocess of Seq2Seq (in Chatbot Case)

Creating A Language Translation Model Using Sequence To Sequence Learning Approach

Before jumping on preprocessing of Seq2Seq, I wanna mention about this:

We need some Variables to define the Shape of our Seq2Seq Neural Network on the way of Data preprocessing

  1. MAX_LEN: to unify the length of the input sentences
  2. VOCAB_SIZE: to decide the dimension of sentence’s one-hot vector
  3. EMBEDDING_DIM: to decide the dimension of Word2Vec

— — — — —

Preprocessing for Seq2Seq

OK, please put this information on your mind, let’s start to talk about preprocessing. The whole process could be broken down into 8steps:

  1. Text Cleaning
  2. Put <BOS> tag and <EOS> tag for decoder input
  3. Make Vocabulary (VOCAB_SIZE)
  4. Tokenize Bag of words to Bag of IDs
  5. Padding (MAX_LEN)
  6. Word Embedding (EMBEDDING_DIM)
  7. Reshape the Data depends on neural network shape
  8. Split Data for training and validation, testing

— — — — —

1. Text Cleaning

  • Function

I always use this my own function to clean text for Seq2Seq:

  • Input
# encoder input text data["Drink up, Charley. We're ahead of you.",
'Did you change your hair?',
'I believe I have found a faster way.']
  • Output
# encoder input text data['drink up charley we are ahead of you',
'did you change your hair',
'i believe i have found a faster way']

— — — — —

2. Put <BOS> tag and <EOS> tag for decoder input

  • Function

<BOS> means “Begin of Sequence”, <EOS> means “End of Sequence”.

  • Input
# decoder input text data[['with the teeth of your zipper',
'so they tell me',
'so which dakota you from'],,,,]
  • Output
# decoder input text data[['<BOS> with the teeth of your zipper <EOS>',
'<BOS> so they tell me <EOS>',
'<BOS> so which dakota you from <EOS>'],,,,]

— — — — —

3. Make Vocabulary (VOCAB_SIZE)

  • Function
  • Input
# Cleaned texts[['with the teeth of your zipper',
'so they tell me',
'so which dakota you from'],,,,]
  • Output
>>> word2idx{'genetically': 14816,
'ecentus': 64088,
'houston': 4172,
'cufflinks': 30399,
"annabelle's": 23767,
.....} # 14999 words
>>> idx2word{1: 'bos',
2: 'eos',
3: 'you',
4: 'i',
5: 'the',
.....} # 14999 indexs

— — — — —

4. Tokenize Bag of words to Bag of IDs

  • Function
  • Input
# Cleaned texts[['with the teeth of your zipper',
'so they tell me',
'so which dakota you from'],,,,]
  • Output
# decoder input text data[[10, 27, 8, 4, 27, 1107, 802],
[3, 5, 186, 168],
[662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]

— — — — —

5. Padding (MAX_LEN)

  • Function
  • Input
# decoder input text data[[10, 27, 8, 4, 27, 1107, 802],
[3, 5, 186, 168],
[662, 4, 22, 346, 6, 130, 3, 5, 2407],,,,,]
  • Output
# MAX_LEN = 10
# decoder input text data
array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0],
[3, 5, 186, 168, 0, 0, 0, 0, 0, 0],
[662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]

— — — — —

6. Word Embedding (EMBEDDING_DIM)

  • Function

We use Pretraind Word2Vec Model from Glove. We can create embedding layer with Glove with 3 steps:

  1. Call Glove file from XX
  2. Create Embedding Matrix from our Vocabulary
  3. Create Embedding Layer

Let’s take a look!

  • Call Glove file from XX
  • Create Embedding Matrix from our Vocabulary
  • Create Embedding Layer

— — — — —

7. Reshape the Data to neural network shape

  • Function
  • Input
# MAX_LEN = 10
# decoder input text data
array([[10, 27, 8, 4, 27, 1107, 802, 0, 0, 0],
[3, 5, 186, 168, 0, 0, 0, 0, 0, 0],
[662, 4, 22, 346, 6, 130, 3, 5, 2407, 0],,,,,]
  • Output
# output.shape (num_samples, MAX_LEN, VOCAB_SIZE)
# decoder_output_data.shape (15000, 10, 15000)
array([[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.]],
..., ], , dtype=float32)

— — — — —

8. Split Data for training and validation, testing

  • Function

— — — — —

References

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern