In this blogpost, you are going to see
- brief overview of Machine Translation
- evolution of machine translation from rule-based in 1950s to neural network-based in 2020s
Part 2: Pretrained Language Models for Neural Machine Translation
Part 3: Neural Machine Translation using EasyNMT Library
Overview of Machine Translation
Machine Translation is an NLP task (specifically NLG task) which involves translating the text sequence in the source language to the text sequence in the target language.
As shown in the figure, the model translates the sentence in English language to Telugu language. Here the source language is English and the target language is Telugu. Telugu is the fourth most spoken language in India.
The best example of MT system is Google Translate. As shown in the figure, we can use Google Translate to translate text, a complete document or even a web page. Google Translate supports more than 120 languages. On the top left, first you have to select the option (Text, Documents or Websites) depending on whether you want to translate piece of text, document or webpage. On the left side you have to choose the source language and on the right side, you have to choose the target language. Once you enter the text or upload the document or specify the web page url, you will get the translated version in the desired target language.
Evolution of Machine Translation (from Rule-based to Neural Network-based)
The evolution of MT systems started with rule-based systems in 1950s. Rule-based systems are difficult to develop and manage as they involve framing lot of rules and exceptions. In 1990s, these rules-based systems are replaced by Statistical Machine Translational Systems (SMT). Though SMT systems are better than rule-based systems, they are not end-to-end and are based on statistical models whose parameters are obtained from the analysis of bilingual text corpus.
With success of deep learning models in other NLP tasks, Neural Machine Translation (NMT) systems based on seq2seq models started to replace SMT systems. Unlike SMT systems, NMT are end-to-end i.e., the models receive the text sequence to translate and then generate the translated text sequence. At the end of 2016, SMT systems are replaced with NMT systems in companies like Google. The figure shows the evolution of NMT systems from 2014 to present.
Seq2Seq consists of encoder and decoder. Encoder is nothing but any deep learning models like RNN, LSTM or GRU. Similarly decoder is also any of the deep learning models like RNN, LSTM or GRU. Initially encoder sequentially process the input tokens. The vector in the last input time step is treated as the aggregation input sequence representation.
With this aggregation input vector as input, decoder sequentially generates the translated text sequence. The main drawback in this architecture is the use of fixed vector from the last time step as the aggregate representation. This is because this [a] fixed vector cannot represent the entire input sequence information [b] at each time step in the decoder, the output token depends only on certain input tokens. To overcome this information bottleneck, attention is mechanism is introduced in Seq2Seq models.
With this aggregation input vector as input, decoder sequentially generates the translated text sequence. The main drawback in this architecture is the use of fixed vector from the last time step as the aggregate representation. This is because this [a] fixed vector cannot represent the entire input sequence information [b] at each time step in the decoder, the output token depends only on certain input tokens. To overcome this information bottleneck, attention is mechanism is introduced in Seq2Seq models.
Attention layer helps to focus on selective input tokens at each time step in the decoder. Although attention mechanism has improved the results, the inherent drawbacks of sequential deep learning models (RNN, LSTM and GRU) like vanishing gradient problem and inability to take fully advantage of parallel processing power of advanced computer hardware like GPUs and TPUs, transformer based on self-attention mechanism is proposed.
In transformer model, source text sequence is encoded by using a stack of encoder layers. The output vectors from the final encoder layer represent source text sequence tokens enriched with rich contextual information using self-attention mechanism. At each time step, the decoder receives the output vectors from the encoder and generates the target tokens auto regressively.
This blogpost is originally published in my personal website. Feel free to connect with me through Twitter or LinkedIn.
I’m Katikapalli Subramanyam Kalyan (shortly Kalyan KS), NLP researcher with 5+ years of academic research experience. Apart from research papers in top tier journals in medical informatics and EMNLP, AACL-IJCNLP workshops, I have written two survey papers on transformers-based pretrained language models which received 35+ citations including the citations from top tier institutes like University of Oxford, University of Texas, Michigan State University, NTU Singapore, IIT Madras etc.