Neural Machine Translation Using seq2seq model with Attention.

Word level English to Marathi language translation using Bidirectional-LSTM with attention mechanism.

Aditya Shirsath
Geek Culture
6 min readJun 16, 2021


(Animation source: Author)


In this article we are going to discuss about very interesting topic of natural language processing(NLP) Neural Machine translation (NMT) using Attention model. Machine translation is nothing but automatic translation of text from one language to another.

Here we will learn how to use sequence to sequence architecture (seq2seq) with Bahdanau’s Attention mechanism for NMT.


This article assumes that you understand following:-

Before going through code we will discuss Bidirectional LSTM and Attention mechanism in short.


If you understand LSTM then Bidirectional is quite simple. In bidirectional network you can use simple RNN(Recurrent Neural Network), GRU (Gated Recurrent Unit) or LSTM(Long short Term Memory). I am going to use LSTM in this article.

  • Forward layer is our regular LSTM layer but Backward LSTM is layer who’s flow is in backward direction.
  • At each time step input is passed in both forward and backward layers.
  • Output at each time step is combination of both cells output(forward and backward layer). Therefore for prediction model will have knowledge of next words too.
Bidirectional LSTM (Image source: Author)

Why we need Bidirectional?

  • In any language sentences, next word will have impact on previous words. Example- (1) “ Harry likes apple, because he works their.” and (2) “Harry likes apple and it is healthy.”
  • In first sentence ‘apple’ means company and in second ‘apple’ means fruit. We can say this because we know next words. In first ‘apple’ depends on ‘work’ and in second it depends on ‘healthy’. Now in RNN/LSTM we only have forward layer so they will not have information of next sequence words, therefore without proper context of sentence model might not predict right words.
  • In case of Bidirectional we have forward and backward layer with that model can have information of both previous and next words, therefore with proper context of sentence model will predict better.

Bahdanau’s Attention:-

Here I assume you know about attention already. If not go through this paper. and if you prefer to watch video i would recommend watching Krish naik’s video on YouTube.

Video from Krish Naik’s YouTube channel.

Now why use Attention?

To predict word some of previous or next words are important. As we talked in Bidirectional discussion. But which word is more important. To find that we use Attention model to give importance to words. Then model can concentrate more on words who has more importance.

Attention Model(Image source: Author)(Content source: Bahdanu’s Attention)

Here is how attention works:-

  • After getting output Z (in attention image) which is concatenation of forward and backward hidden states [h, h`] first step is to calculate attention weights(α).
  • To calculate α we need score(e) which is calculated using above formula. The score is based on decoder’s(LSTM) hidden state(before predicting y) and output of encoder Z.
  • Then context vector (Ct) is computed as a weighed sum of attention weights(α) and output of encoder(Z).
  • Output y is generated by decoder using context vector(Ct) as input.


Download dataset for translation from

I am using English to Marathi translation dataset. But you can download any, you will just have to make some changes in code though.


First we need to understand what we need for translation.

  • We need sentences of one language and respective sentence for other language.
  • Their might be some contractions of English words which can confuse our model it will see words like “can’t” and “can not ” differently so we will expand all contractions.
  • Characters like coma, dot and Digits are not useful in translation so we’ll drop all of them.

Now in following code we’ll do all cleaning process and save data.

Note:- To work contraction function you need contractions-expansions dictionary which you can download from here this file contains 125+ contractions.

Prepare Data For Model:-

  • Add SOS(Start Of String) and EOS(End Of String) tokens in sentences of target language. Due to this tokens we can have length of target sentences and input sentences different from each other. And it also helps decoder to start and stop predicting.
  • Note:- length of inputs (i.e. English sentences) needs to be same and and length of target (i.e. Marathi) should be same but they can be different from each other.
  • Tokenize:- Neural networks does not accept text as input so we’ll have to convert them into numbers. To do so we will use Tensorflow’s Tokenizer.
  • This tokenizer is very helpful from which we can get frequency of words, dictionaries of word to index & index to word. Which will be used to convert words into numbers(for training) and numbers into words(for prediction)
  • Padding:- Neural networks also need input (i.e. sentences) in same length So we’ll pad sentences of English and Marathi language with ‘0’ to get length of sentences as maximum length sentence of respective language.

Build Model:-

We will first build encoder then decoder with attention layer.

Note:- I experimented with different different parameters following was best for me you can try with other parameters. Try changing embedding output-dimension, LSTM-units, or even add more LSTM layers. After all Machine learning is all about experimenting.


  • As we discussed previously we will use Bidirectional LSTM in encoder
  • Which will learn patterns in input language (i.e English)
  • We will use both encoder’s outputs and its states( context vector[h, c]). Now taking states is little different in Bidirectional because it has forward and backward states so we will have to consider both.(i.e. We will concatenate them.)


  • In decoder we use only LSTM.
  • Attention layer:- I have borrowed code for attention from [here]. You can put it in file or simply download file from here. We are using bahdanau’s attention as attention layer.

Here is our model plot-


Now, lets get to training our attention model.

Here i got 95.04% accuracy on validation set with 0.32 loss.

Inference Model:-

We use this model to predict output sequences by using weights of pre-trained model.

Here we can not just apply model.predict() as other ML and DL models because in our case encoder model learns features in input sentences and decoder simply takes encoders states and predicts word by word using decoder inputs. So for prediction we have to do same process.


Our model will predict numbers and word at time so we need function to convert them into sentence of target language.


Finally, let’s translate our sentences from English to Marathi.

Here are some of my result’s:-

WOW!!!!!We have got some AMAZING results.

End Note:-

If you haven’t already used Encoder-decoder model without attention you can check out my article.


Bahdanau’s Attention research paper here.



Aditya Shirsath
Geek Culture

Fascinated by how Machine Learning, Deep Learning & NLP works. Get in touch- LinkedIn: