Neural Machine Translation Using Sequence to Sequence Model

Word level English to Marathi language translation using encoder decoder LSTM Model.

Aditya Shirsath
Geek Culture
6 min readJun 7, 2021


1) Encoder-Decoder (Animation Source: Author)


Machine translation is one of earliest and challenging task of computers due to fluidity of human language. It is simply an automatic translation of text from one language to another.

In this article we will discuss little bit of encoder-decoder. Then we will walkthrough code of Neural machine translation. This is going to be a fun ride.


Before we go ahead you should know about following-

  • Recurrent Neural Network(RNN), Long Short Term Memory (LSTM).
  • Sequence to sequence architecture (Encoder decoder).

Encoder Decoder in short:-

2) Working of encoder-decoder (Animation source: Author) (Note:- SOS →Start Of String, EOS →End Of String)
  • In sequence to sequence model main parts are Encoder and Decoder.


Encoders can be any network like Recurrent Neural Network, LSTM, GRU or Convolutional neural net but we are using this seq2seq model for language translation so both encoder and decoder should be models which can handle sequential input. We will use LSTM.

Encoder takes input learns patterns in them, so we just take what it learned which is its hidden states[h, c](brown rectangle in above animation) and pass this as initial state for decoder. We don’t take output from each time step.


Decoder with encoder’s state as it’s initial state will predicts one word at time. Here important part to understand is unlike encoder decoder works differently during training and testing.

During Training:- We use technique called teacher forcing which helps in faster and efficient training. Now what is teacher forcing?

Teacher forcing:- It is a strategy of training neural network which uses actual output(ground truth) as input to each time step of decoder, instead of using output of previous time step as input.

To visualize teacher forcing watch above animation(2) carefully at decoder’s end. At each time step we are passing Actual output as input, they both(predicted and Actual) look same in animation that’s because our model is so good😆, if model predicted wrong word still we will pass correct word as input.

During Testing:- Input to each time step is predicted output of previous time step of decoder.

Decoder needs some special tokens to know whether sentence has started or finished. So we have to add SOS(Start Of String) and EOS (End Of String) tokens at start and end of every sentence of target language (i.e. Marathi in our case). And therefore decoder can handle different length sentence than encoder.


We have 41028 sentences of English and their respective Marathi translations. I got this data from you can get dataset for translation of many more languages.


First we need to understand what we need for translation.

  • We need sentences of one language and respective sentence for other language.
  • Their might be some contractions of English words which can confuse our model it will see words like “can’t” and “can not ” differently so we will expand all contractions.
  • Characters like coma and dot is not useful in translation so we’ll drop all of them.
  • Digits are also not needed.

Now in following code we’ll do all cleaning process and save data.

To work contraction function you need contractions-expansions dictionary which you can download from here this file contains 125+ contractions.

Preparing data for encoder decoder:-

Here is how we are going to prepare data for encoder decoder model

  • Add SOS and EOS tokes in Marathi sentences.
  • Get all unique words of English and Marathi and create respective vocabulary and sort them.
  • Using our sorted words list we can give each word number and form dictionary which is very helpful to convert words into numbers. We have to convert words into numbers because neural nets don’t accept text as input.
  • And lastly split data into train and test.

Data Generator:-

This is very important, why we need data generator?

As given in keras tutorial we have to convert data into 3D tensor of shape=[Batch size, Timesteps, Features]. Now in our case batch size, timestep and features are number of sentences, max-length-sentence, and number of unique words respectively.

Using this, for Marathi input tensor array becomes [41028, 37,13720] this will consume lot of memory.

To work more efficiently we’ll send data in the form of batches. So we will create data batch generator. This data generator’s are developed by keras team.

Build Encoder-Decoder LSTM :-

Before we get to coding part we have to understand few things-


We are going to pass our input of both encoder and decoder in embedding layer first and then in LSTM layer.

This layer take numbers as input and converts them into given number of dimensions, But why we need to do so? → Answer is, using this we can preserve semantic information of words means similar words will be closer to each other.

Example: Word ‘man’ will be closer to ‘women’, ‘ dog’ will be closer to ‘cat’ . It simply means vectors of man word will have similar numbers with women than dog and cat.

To learn more in detail check this out.

I experimented with some different values for embedding and LSTM units, for me following was best but you can try with different, Machine learning is all about experimenting.


  • In encoder pass encoder-input-data and take hidden states of encoder’s last time step as context vector[h, c].
  • Why set mask zero=True in embedding - when we generated input arrays in generator we padded them with zero to make them of max length. This mask zero will tell model to mask out 0.


  • Now decoder-input-data will be passed into decoder-embedding.
  • LSTM layer’s initial states are encoder’s final states.
  • Teacher forcing:- Here input of each time step is actual output of decoder’s previous step.
  • Get output by applying SoftMax which convert’s numbers into probabilities.


Train our model with some callbacks for 30 epochs. And don’t forget to save weights of model.

Inference Model:-

We use this model to predict output sequences by using weights of pre-trained model.

Here we can not just apply model.predict() as other ML and DL models because in our case encoder model learns features in input sentences and decoder simply takes encoders states and predicts word by word using decoder inputs. So for prediction we have to do same process.


One last thing to discuss, model will predict vector of number and one word at time so we have to create function to create sentences from predicted numbers.


Now we can get results by simple python code:-

Here are some of My results:-

hurray!! we got some amazing results.

End Note:-

Now encoder-decoder LSTM’s accuracy decreases with increasing length of sentences (see the last prediction in my results) because here we are using only state of last LSTM cell from encoder (Context vector). Its like remembering whole book and translating so its obvious to get less accuracy.

So in next article we will try to use Attention model.




Aditya Shirsath
Geek Culture

Fascinated by how Machine Learning, Deep Learning & NLP works. Get in touch- LinkedIn: