A Bouquet of Sequence to Sequence Architectures for Implementing Machine Translation

Published in

Analytics Vidhya

8 min readApr 16, 2020

In this article, we will be discussing various possible sequence to sequence architectures for implementing Machine Translation. Even though the primary problem that we will try to solve at hand would be Machine Translation, but the same Architectures, with slight modifications, also apply to other Machine Learning use-cases such as, but not limited to:

· Text Summarisation — a model to produce a summary of the input text

· Question Answering — a model to produce an answer to an input question

· Dialog — a model to generate the next dialogue/utterance in the sequence

· Document Classification — a model to classify the input document as sport, politics, finance etc

Thus, mastering the art of designing sequence to sequence architectures for Machine Translation would additionally arm you to tackle any of the above use-cases effortlessly.

Introduction

Machine Translation, as the name suggests, is a machine learning model that would help us convert text from one language to another. In this article, we will look at translating English sentences to French. I would grade this article as fairly Advanced and thus I would expect you to have a foundational knowledge of Recurrent Neural Networks including Long Short Term Memory & Gated Recurrent Unit — as we will be using these as building blocks for constructing the Sequence to Sequence Architectures. Also, a prior understanding of foundational concepts pertaining to Natural Language Processing viz. Word Embedding, Tokenization, Vocabulary, Corpus etc. will help you navigate confidently through this article. Without much ado, let’s now dive into the interesting world of Sequence to Sequence Architectures.

The following Sequence to Sequence Architectures will be covered in the article.

· Classic Many to Many Architecture

· Many to Many Architecture with Embedding & Bidirectional Layers

· Encoder — Decoder Architecture

I will be supplementing the text with self-doodled architecture diagrams and code snippets to ensure that I do a fair justice to explaining these concepts.

Classic Many to Many Architecture

Classic Many to Many Architecture is the simplest seq2seq architecture and is the easiest to comprehend. It has the same number of outputs as inputs. However, having same number of outputs as inputs is a limitation as it is rarely the case that a language translation has same number of sentences as the input sentence — thus classic architecture fraught with this limitation is not expected to perform well for Machine Translation as compared to other architectures. You can have multiple variations of the classic architecture by adding different kinds of recurrent cells (viz. RNN, LSTM or GRU) or by increasing the depth of hidden layers or by varying the number of dimensions in the RNN layer.

The below doodle represents a simple classic seq2seq architecture

Doodle: Classic Many to Many Architecture

Code Snippet: Classic Many to Many Architecture

import tensorflow as tffrom tensorflow.python.keras.models import Sequentialfrom tensorflow.keras.models import Modelfrom tensorflow.keras.layers import Input, Dense, SimpleRNN, GRU, TimeDistributeddef simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):learning_rate = 0.01model = Sequential([SimpleRNN(256, input_shape=input_shape[1:], return_sequences=True),TimeDistributed(Dense(french_vocab_size, activation='softmax'))])model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,optimizer=tf.keras.optimizers.Adam(learning_rate),metrics=['accuracy'])return modeltmp_x = pad(preproc_english_sentences, max_french_sequence_length)tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))simple_rnn_model = simple_model(tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size)simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Many to Many Architecture with Embedding & Bidirectional Layers

Classic Many to Many Architecture can be further bolstered by including an embedding layer and a bidirectional RNN. The embedding layer helps convert the input sentence tokens into multidimensional embeddings. Embedding could be thought of as a vectorized word representation where-in the words with similar semantic meanings are closer to each other in the vector space. Models with embedding layers are known to produce better Natural Language Processing models.

Bidirectional layer is an extension of a Simple RNN where the cell states flow in both the directions (forwards as well as backwards) as represented in the accompanying doodle. The bidirectional layer is based on the notion that both past & future words in the sequence have a bearing on the current translation (or any other Natural Language Processing task) besides the current input word.

Both these enhancements, embedding layer as well as bidirectional RNN are expected to have a positive impact on the output of the model.

The below doodle represents a simple classic seq2seq architecture with embedding & bidirectional RNN layers.

Doodle: Many to Many Architecture with Embedding & Bidirectional Layers

Code Snippet: Many to Many Architecture with Embedding & Bidirectional Layers

import tensorflow as tffrom tensorflow.python.keras.models import Sequentialfrom tensorflow.keras.models import Modelfrom tensorflow.keras.layers import Input, Dense, SimpleRNN, GRU, TimeDistributed, Embedding, Bidirectionalembedding_dim = 256def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):learning_rate = 0.005model = Sequential([Embedding(english_vocab_size+1, embedding_dim, input_length=input_shape[1], input_shape=input_shape[1:]),Bidirectional(GRU(256, return_sequences=True)),TimeDistributed(Dense(french_vocab_size, activation='softmax'))])model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,optimizer=tf.keras.optimizers.Adam(learning_rate),metrics=['accuracy'])return modeltmp_x = pad(preproc_english_sentences, max_french_sequence_length)embed_rnn_model = embed_model(tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size)embed_rnn_model.summary()embed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

Encoder Decoder Architecture

Encoder Decoder Architecture addresses the limitations of Classic Many to Many Architecture. The Encoder Decoder Architecture is able to generate French sentences of a length different to the input English sentences. Thus, the limitation of input & output sentences having the same length is overcome. Another limitation of classic architecture is that it translates one word of a sentence at a time which sometimes leads to imprecise translations. A more natural way to translate would be the big bang translation approach viz. read the whole sentence all at once and then translate it all together. The encoder decoder architecture uses the latter approach — it first encodes the input sentence into an encoded vector, which in turn is used to generate the output sentence. Caveat — encoder decoder architecture works efficiently only for shorter sentences as retaining all the information of a lengthy sentence into an encoder vector might not feasible; some loss of information is imminent. This problem is resolved by using Attention mechanism which I plan to cover in a future article.

The below doodle represents an Encoder Decoder Architecture.

Doodle: Encoder Decoder Architecture

Code Snippet: Encoder Decoder Architecture

import tensorflow as tffrom tensorflow.python.keras.models import Sequentialfrom tensorflow.keras.models import Modelfrom tensorflow.keras.layers import Input, Dense, SimpleRNN, GRU, TimeDistributed, Embeddingembedding_dim = 256def encdec1_model(encoder_input_shape, decoder_input_shape, output_sequence_length, english_vocab_size, french_vocab_size):learning_rate = 0.005encoder_inputs = Input(shape=encoder_input_shape[1:])en_x = Embedding(english_vocab_size+1, embedding_dim)(encoder_inputs)_, state_h, state_c = LSTM(french_vocab_size, return_state=True)(en_x)encoder_states = [state_h, state_c]decoder_inputs = Input(shape=decoder_input_shape[1:])dec_x = Embedding(french_vocab_size+1, embedding_dim)(decoder_inputs)decoder_outputs = LSTM(french_vocab_size, return_sequences=True)(dec_x, initial_state=encoder_states)decoder_outputs = TimeDistributed(Dense(french_vocab_size, activation='softmax'))(decoder_outputs)model = Model([encoder_inputs, decoder_inputs], decoder_outputs)model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,optimizer=tf.keras.optimizers.Adam(learning_rate),metrics=['accuracy'])return modeltmp_x = pad(preproc_english_sentences, max_english_sequence_length)tmp_y = pad(preproc_french_sentences, max_french_sequence_length)tmp_y = tmp_y.reshape((-1, preproc_french_sentences.shape[-2]))encdec1_rnn_model = encdec1_model(tmp_x.shape, tmp_y.shape, max_french_sequence_length, english_vocab_size, french_vocab_size)encdec1_rnn_model.summary()encdec1_rnn_model.fit([tmp_x, tmp_y], preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)

As you would have noted from doodle that Decoder part (RNN on the right side) is similar to Simple RNN Architecture — the Encoder part (RNN on the left side) is something extra that we have not seen before. The Decoder part rather than using English sentences as input takes in French sentences as the input, where as Encoder part takes in English sentences as the input. The main tenet of the Encoder Decoder Architecture is that the current prediction is dependent not only on the input English sentence, but also on the French sentence predicted so far. But there is a catch, during inference, we don’t have the French translations available to us. Thus, during inference, instead of inputting actual French sentences, we feed each Decoder unit with output of the previous Decoder unit. The first Decoder unit on the other hand, will be fed <bos> or <go> (a tag to indicate the beginning of sentence) — it will also be fed the cell states from the last Encoder (both long term & short term memory states if we used LSTM as the RNN unit). The output sequence/words will continue to be generated till the algorithm predicts <eos> or till max french sentence length is reached. Max french sentence length is configurable but typically we set it to the max length of french sentences used during training.

You must have noted that we have used Spare Cross Entropy as the loss function during training. There are other optimisation techniques like Beam Search that are known to perform better but are not very straightforward to implement. Discussion on Beam Search would be outside the scope of the current article.

Evaluation Metrics

Even though I have not presented any Evaluation metrics, one of the techniques which is quite popular for Machine Translations is Bleu Score — which is based on how many similar unigrams, bigrams, trigrams etc we are able to find in the actual human and the model predicted translation.

Bleu scores for evaluating the results of various architectures can be generated using NLTK library as shown in the below code snippet:

from nltk.translate.bleu_score import sentence_bleuimport statistics as statsbleu_score_list = []bleu_score_list = [sentence_bleu([modified_french_sentences[100000+x]], logits_to_text(final_pred[x][0], french_tokenizer)) for x in range(100)]print(stats.mean(bleu_score_list))

Postscript

Please note that I have deliberately not explained the preprocessing of input text in this article as it is a vast subject in itself. However, please keep in mind that you will almost always be required to implement all the text preprocessing steps like tokenizing sentences, padding them, converting all letters to lower case, removing the punctuations, lemmatisation, stemming etc. before you embark on Machine Translation. You might want to grab any beginners guide to Natural Language Processing to grasp these concepts.

I will strive to upload the accompanying curated code on github in near future so that can you download it on your computer & experiment with the various seq2seq architectures.

I do hope that you enjoyed this article — I’ll be really grateful if you are able to leave a rating / feedback below. Please do let me know if you come across any other interesting seq2seq architectures not already covered in this article; I would be more than happy to include them in a future publication.

Credits:

Co-authored by Anuj Kumar (https://www.linkedin.com/in/anujchauhan/)
Aurelien Geron — Hands-On Machine Learning with Scikit-Learn & TensorFlow
V Kishore Ayyadevara— Neural Networks with Keras Cookbook
Udacity — Natural Language Processing Nanodegree, Deep Learning Nanodegree
Coursera — TensorFlow in Practise & Deep Learning Specialization
Jason Browniee — https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

A Bouquet of Sequence to Sequence Architectures for Implementing Machine Translation

Written by Rohit Arora