Neural Machine Translation

Pawit Dev
9 min readApr 15, 2022

--

Machine translation is the task of translating a sentence from one language(source language) to an equivalent sentence in another language(target language)

https://www.intechopen.com/media/chapter/68953/media/F1.png

Development of Machine Translation

  1. Rule-based Machine Translation
    The first design of MT which is based on the hypothesis that all different languages have its symbol in representing the same meaning. In this method, the translation process could be treated as the word replacement in the source sentence.
    Weak point
    -
    it requires much linguistic knowledge.
    - it is impossible to write rules that cover all a language.
  2. Statistical Machine Translation
    The difference from Rule-based machine translation, SMT deal the translation task from a statistical . SMT model finds the words (or phrases) which have the same meaning through bilingual corpus by statistics.
    The most prevalent version of SMT is Phrase-based SMT (PBSMT), which in general includes pre-processing, sentence alignment, word alignment, phrase extraction, phrase feature preparation, and language model training. The key component of a PBSMT model is a phrase-based lexicon, which pairs phrases in the source language with phrases in the target language. The lexicon is built from the training data set which is a bilingual corpus.
  3. Neural Machine Translation(NMT)
    Is performing Machine Translation using Neural Networks. NMT models consists of two parts called encoder and decoder. Encoder is responsible for generating a real vector representation of sentence called summary vector or context vector that captures the important features of sentence. Decoder process the context vector to generate target language sentence word by word.

Neural Machine Translation(NMT)

The traditional MLP will predict next word from only one previous word. Then…how it can predict next word in correct grammar?

  1. RNN

Notation

  • H = hidden layer
  • Yt = output from RNN with the time step index t
  • Xt = input data with the time step index t (Current word)
  • Ht = hidden state with the time step index t
  • L = loss

RNN consumes 2 input which are Current word (Xt) and Hidden vector from previous word (H(t-1)) then feed these input to Neural Network to get current hidden vector(Ht) and use this hidden vector(Ht) to predict next word(Yt).
Then current hidden vector(Ht) will be treat as input to next time step(t+1)

So, all the information of all previous word will store into hidden vector. That means next word(Yt) derived from information of all previous word and update the hidden state on processing each word.
The key point of RNN is can capturing long-range dependency.

BUT RNN suffers vanishing gradient problem. Imagine…if we have 100 word of prediction(N = 100) and we back propagate all the step. The gradients of the loss function will multiply from N, (LN-1)(LN), (LN-2)..(LN), (LN-99)…(LN) and approaches to zero. If it cause vanishing gradient on 20th layer(N=20) the entire layer from 19 to first will not learn any more because gradients of the loss = 0 (No error occur)

2. LSTM
To solve the problem of vanishing gradient in RNN. The use of Long Short-Term Memory is proposed. LSTM is a kind of RNN.

https://medium.com/@sinart.t/long-short-term-memory-lstm-e6cb23b494c6

LSTM does not store every information of sentence in state, rather it learns to selectively forget and remember. In concept of gating mechanism.

Layer of LSTM
-
Forget Gate : It decides what information from previous memory cell to forget.
-Input Gate : It decides which information to pass to remember by cell.
- Candidate Memory : It records the information about current input.
- Output Gate : Controls what to output in hidden state at this point.
- Cell Memory : It is a cell memory updated after adding necessary information and removing unnecessary information.

Preventing the error gradients from vanishing

Notice that the gradient contains the forget gate’s vector of activations, which allows the network to better control the gradients values, at each time step, using suitable parameter updates of the forget gate. The presence of the forget gate’s activations allows the LSTM to decide, at each time step, that certain information should not be forgotten and to update the model’s parameters accordingly.

3. Encoder-Decoder with Attention Mechanism
Attention mechanism helps deciding which input frame(s) to be focused at and how much for the output prediction at the corresponding time step.

Attention weights : gives soft-alignment, as these represent the probabilistic alignment of how words in source language and target language are aligned while generating target words.

Attention Is All You Need

This paper present Transformer model, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

For Translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks and achieve new state of the art.

Transformer

Sequence-to-Sequence model
- Produce target sequence from a source sequence.
- Pros : It is capable of learning multiword expression, moderate-distance dependency, moderate reordering and conceptualization
- Cons : Very huge model (~100M parameter), need a lot of data and high computing resource

Encoder : Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network with residual connection around each of the two sub-layers, followed by layer normalization.

Decoder : Inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack and modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.

Attention : Mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values.

Scaled Dot-product Attention : Is a kind of search engine. Find the similarity of Q(Query) and K(Key) with dot-product to get Weight in terms of probability. Then dot-product Weight with V(value), we will get Scaled values and sum all scaled value to get attention. With this attention we can get the collocation of word.

Alignment attention : Find the collocation of Query(target language) and Key (Source language)

Self attention : Similar to Scaled Dot-product Attention but self attention will compute similarity of K(Key) and K(Key) with dot-product to get Weight in terms of probability rather than Q(Query) and K(Key) with the same process to find collocation of itself.

Multihead attention : Each attention head learning different collocation because in 1 sentence may have more than 1 collocation but 1 attention head can learn only 1 collocation.

NMT th → en with pre-trained model (Helsinki-NLP/opus-mt-th-en)

Transformer model is one state of the art on translation task have the pros of all model before with capability to learning multiword expression, moderate-distance dependency, moderate reordering and conceptualization and attention can help model to find collocation of words also faster than LSTM or RNN.

That’s why we are going to use Transformer model in this task with pre-trained model (Helsinki-NLP/opus-mt-th-en)

Our pipeline :
- Cleaning the corpus with string methods and regex.
- Fine tuning pre-trained model(Helsinki-NLP/opus-mt-th-en)
- Using Multilingual Universal Sentence Encoder (USE) to select best translation.

In class lecture : There’s technique called back translation
when your lack of data this technique will gain you a lot of data and cause low error to model performance based on this paper

As you see, the effect of noise to source language cause low error to model performance than the noise in the target language.

Back translation, also called reverse translation, is the process of re-translating content from the target language back to its source language in literal terms. more

Cleaning the corpus with string methods and regex

First, import necessary library and visualize our data.

import pandas as pd
import os
import numpy as np
en = []
f = open("/content/lang.en", "r")
for x in f:
en.append(x)
th = []
f = open("/content/lang.th", "r")
for x in f:
th.append(x)
df = pd.DataFrame(list(zip(en, th)),
columns =['en', 'th'])
df

Clean ‘\n’

df['en'] = df['en'].str.replace("\n","")
df['th'] = df['th'].str.replace("\n","")

Change ‘&apos’ → ‘

df['en'] = df['en'].str.replace("'","'")

and other…

Fine tuning pre-trained model(Helsinki-NLP/opus-mt-th-en)

Import and setting our model

from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs


model_args = Seq2SeqArgs()
model_args.num_train_epochs = 1
model_args.best_model_dir='/content/'
model_args.n_gpu=1
model_args.learning_rate = 1e-4
model_args.eval_batch_size=64
model_args.train_batch_size=64
model_args.num_beams=5
model_args.max_length=60
model_args.num_return_sequences=3
model_args.use_multiprocessing=False
# Initialize a Seq2SeqModel for English to German translation
model = Seq2SeqModel(
base_marian_model_name="Helsinki-NLP/opus-mt-th-en",
encoder_decoder_type="marian",
encoder_decoder_name="Helsinki-NLP/opus-mt-th-en",
use_cuda = True,
args=model_args,
src_lang = 'tha_THA',
trg_lang = 'en_EN'
)

Our data be like…

Training

model.train_model(data)

Testing

to_predict = [
"ช้าง ตากลม ยืนตาก ลม","ฝ้าย ชอบ ไป เที่ยว"
]

predictions = model.predict(to_predict)
predictions
>>>[['The elephants are standing in the wind.',
'The elephants were standing in the wind.',
'The elephants are standing at wind.'],
['Cottons like to go on trips.',
'Cotton likes to travel.',
'Cottons like to go on a trip.']]

As you see, this model can predict and return multiple sequence.
HOW TO SELECT THE BEST TRANSLATION !?

Using Multilingual Universal Sentence Encoder (USE) to select best translation.

Multilingual Universal Sentence Encoder (USE) is a language model from google to do Sentence Embedding (convert sentence into vector)

This model train over 16 languages include Thai, in other word all vector space from 16 languages will reduce to only 1 vector space

How we applied to find best translation ?

  1. Doing Sentence Embedding
  2. Find distance using Cosine Similarity
  3. Find Threshold to select a pair of translation from Cosine Similarity score

Example of Good translation

Example of Bad translation

Conclusion

We know development of NMT also the pros and cons of each model. Focus on Transformer model what is the advantage from RNN or LSTM and its architecture.

We clean corpus and use Transformer model with pre-trained from Huggingface(Helsinki-NLP/opus-mt-th-en). Then use Multilingual Universal Sentence Encoder (USE) to select Best translation from cousine-similarity score.

--

--

Pawit Dev

Super AI Engineer student who fascinated with AI technology such as NLP, Computer vision and especially in the field of TTS and Chatbot.