Neural Machine Translation using word level seq2seq model

Devesh Maheshwari
7 min readFeb 28, 2018

--

In this post, I would like to talk about my adventure with sequence models. I really liked the way google translate the languages and wanted to try it out.

While there are many char-level encoder-decoder models available online, those are not meant for translation tasks. As in the words of keras team :

“Note that it is fairly unusual to do character-level machine translation, as word-level models are more common in this domain”

So I decided to give it a try.

Before moving to finer details, I would like to talk about the architecture. In NMT models, popular architecture is the encoder-decoder architecture. To understand this architecture, consider the way human will translate the language, they will first hear the whole sentence, create the context and then translate the individual words for the given context. I can write it mathematically as well.

From this paper by google https://arxiv.org/pdf/1609.08144.pdf which explains mathematical formulation of the encoder-decoder models

Conditional probablity for translation

Here Y is the translated word(french), X is given word(english) and x_i are hidden state output of encoder for input word X. The above equation says that probablity of translated y_i is conditioned upon previously translated words(y0 to y_i-1) and hidden state output x_i for encoder.

Here is the diagram of a encoder-decoder network:

Image source :https://towardsdatascience.com/sequence-to-sequence-using-encoder-decoder-15e579c10a94

In the above figure, we can see the basic structure of a encoder-decoder model. The encoder encodes the words in a sequence and when the last word is read, it passes its internal hidden state to the decoder, which then starts generating the output sequence. Much of it is similar to a language model, however in language model, we take the probability distribution to choose the next word(sampling from the distribution), here we will use the greedy search and will select the next word using the highest probablity in the softmax layer.

There are alternatives to greedy search like beam search which consider multiple words for a single input word and creates beam and thus creates multiple sentences while finally choosing the sentence which has the highest overall probablity.

Now we have some idea of how the encoder-decoder system works, lets implement them. Implementing in TF would be a better approach and next step in this tutorial since in TF we don’t have to pass the whole dataset to the model. Keras is however easy to implement and you can build models quickly.

Here is the link of my github repository containing the full notebook, I will be explaining the few important parts of code here. The dataset of eng-french translation is provided here.

Let’s look at the data:

Eng to French dataset

After a bit of cleaning, we process the data as follows. We append the sentence with french with ‘Start_’ and “_End”. The reason for this is to have our model distinguish when the sentence starts and ends. We also create the vocabulary of all unique french and eng words.

lines.fr = lines.fr.apply(lambda x : 'START_ '+ x + ' _END')# Create vocabulary of words
all_eng_words=set()
for eng in lines.eng:
for word in eng.split():
if word not in all_eng_words:
all_eng_words.add(word)

all_french_words=set()
for fr in lines.fr:
for word in fr.split():
if word not in all_french_words:
all_french_words.add(word)
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_french_words))
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_french_words)
# del all_eng_words, all_french_words

Now comes the hard part, we will initialize the placeholders(not tf) for out input and output data. The encoder input data would be english sentences and since we will use embeddings, it would actually be index of words in vocab. The decoder input data will be index of french words in the sentences.

Since we will have softmax layer predicting the next french word given previous words out of all the french vocab size, the target data consists of 1 and 0 and for each input sentence, it will be of max_sentence_lenght_fr * vocabulary_size_of_french.

encoder_input_data = np.zeros(
(len(lines.eng), max_lenght_eng),
dtype='float32')
decoder_input_data = np.zeros(
(len(lines.fr), max_lenght_fr),
dtype='float32')
decoder_target_data = np.zeros(
(len(lines.fr), max_lenght_fr, num_decoder_tokens),
dtype='float32')
# generate datafor i, (input_text, target_text) in enumerate(zip(lines.eng, lines.fr)):
for t, word in enumerate(input_text.split()):
encoder_input_data[i, t] = input_token_index[word]
for t, word in enumerate(target_text.split()):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t] = target_token_index[word]
if t > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i, t - 1, target_token_index[word]] = 1.

We will shift the target data by 1 since we have appended ‘Start_’ in the french sentences and we expect that given the ‘Start_’ . from decoder and hidden state from encoder, model will predict the perfect starting word for our predicting french sentence which we will compare with decoder_target_data and compute loss and calculate gradients.

While testing, the decoder input is not available, but we initialize the decoder by ‘Start_’ and hidden state of encoder to predict the translated word, and these translated words would be fed again to LSTM to generate next french word, until we predict “_End’ which means sentences has been completed. Point to be noted is that we start translating only when encoder has read the whole english sentence and its hidden state outputs are available.

Lets build model

embedding_size = 50
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model
from keras.utils import plot_model
encoder_inputs = Input(shape=(None,))# English words embedding
en_x= Embedding(num_encoder_tokens, embedding_size)(encoder_inputs)
# Encoder lstm
encoder = LSTM(50, return_state=True)
encoder_outputs, state_h, state_c = encoder(en_x)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
# french word embeddings
dex= Embedding(num_decoder_tokens, embedding_size)
final_dex= dex(decoder_inputs)# decoder lstm
decoder_lstm = LSTM(50, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(final_dex,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')decoder_outputs = decoder_dense(decoder_outputs)# While training, model takes eng and french words and outputs #translated french word
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# rmsprop is preferred for nlp tasks
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

We will fit the model for 100 epoch, I know its lot but sequence models are supposed to be trained slowly. Keep it overnight on p2.large

model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=128,
epochs=100,
validation_split=0.20)

Once the model is trained, lets do predictions:

# define the encoder model 
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()
# Redefine the decoder model with decoder will be getting below inputs from encoder while in prediction
decoder_state_input_h = Input(shape=(50,))
decoder_state_input_c = Input(shape=(50,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
final_dex2= dex(decoder_inputs)decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
# sampling model will take encoder states and decoder_input(seed initially) and output the predictions(french word index) We dont care about decoder_states2
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs2] + decoder_states2)
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())

Finally the prediction script, which will initialize decoder with seed and then let it predict the next word again and again.

def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1,1))
# Populate the first character of target sequence with the start character.
target_seq[0, 0] = target_token_index['START_']
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index]
decoded_sentence += ' '+sampled_char
# Exit condition: either hit max length
# or find stop character.
if (sampled_char == '_END' or
len(decoded_sentence) > 52):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1,1))
target_seq[0, 0] = sampled_token_index
# Update states
states_value = [h, c]
return decoded_sentence

OMG, that was long, but now our hard work is paying off. Here are the few sampled output:

Translated sentences

Does they make sense? Yeah they do, you can check on google translate and they are actually good translations.

So that’s it, now you have your own machine translation system. Try it with any other language or any pattern.

For full code, go to my github page. Good luck and keep tuned for next post.

Further steps:

  1. build it in tensorflow with more data and bigger vocab
  2. Use hierarchical softmax to support large vocab and use attention for better translation.
  3. Need to go deeper.

Readings:

  1. https://arxiv.org/pdf/1609.08144.pdf
  2. https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
  3. https://arxiv.org/pdf/1708.01771.pdf
  4. https://arxiv.org/pdf/1409.3215.pdf

--

--

Devesh Maheshwari

Student at USF’s Data Science program. Trying to learn DS and share it to the world.