Transformer vs RNN and CNN for Translation Task

Yacine BENAFFANE
Analytics Vidhya
Published in
9 min readAug 13, 2019
Illustration : http://www.ebisss.com/translation-interpretation-services-kuala-lumpur-kl.html

Prerequisites

For automatic translation with Deep Learning, one uses the sequence to sequence model (Seq2Seq), with architectures such as the RNN and CNN, and by adding the mechanism of the attention. Here are some references to understand the article:

Introduction

Google Brain and their collaborators have published an article introducing a new architecture, the Transformer, based only on attention mechanisms (see reference [1]). It surpasses any other NMT models seen before such as Google Neural Machine Translation (GNMT) alias Google Translate.

The Transformer has been able to reach a new state of the art in translation. In addition to major improvements in the quality of translation, it also allows the realization of many other natural language processing (NLP) tasks.

Current architectures

RNNs (or LSTMs) have been established as advanced approaches to sequence modeling and transduction problems such as language modeling and machine translation.

Currently, complex RNN and CNN based on an encoder-decoder scheme dominate the transduction models. Reccurent models don’t allow parallelization during training because of their sequential nature, which poses a problem for learning long-term memory dependencies. The network performs better with more memory, but it ends up limiting the batch processing in case of learning long sequences. That’s why parallelization can’t help. The reduction of this fundamental constraint of the sequential computation has been the goal of many published works such as Bytenet or ConvS2S using convolution.

Figure: Sequence to Sequence model (Seq2Seq)

Transformer’s new approach is to completely eliminate recurrence and convolution and replace them with personal attention (self attention) to establish the dependencies between inputs and outputs. It’s the first type of architecture to rely entirely on attention to calculate representations of inputs and outputs. In addition, Transformer leaves more room for parallelization.

Attention mechanism (recall)

Intuitively, this mechanism allows the decoder to “look back” at the entire sentence and selectively extract the information it needs during decoding.

Specifically, the attention mechanism allow connections between the hidden states (output vector) of each encoder and decoder, so that the target word is predicted based on a combination of vectors, rather than just the hidden state of the decoder, this mechanism gives the decoder access to all the hidden states of the encoder.

The decoder must make a single prediction for the next word, a complete sequence can not be sent, but some kind of summary vector has to be transmitted instead. What matters is that it asks the decoder (the mechanism) to choose the hidden states to use and those to ignore by weighting the hidden states, the decoder then receives a weighted sum of hidden states to use to predict the next word. Only a small part of the source sentence is relevant for the translation of a target word (see reference [6]).

The following figure represent a visualisation of attention mechanism. The transparency of the blue link represents how much the decoder give attention to a coded word.

Figure: Attention mechanism (https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/)

Treatment with RNN

With RNN, you have to go word by word to access to the cell of the last word. If the network is formed with a long reach, it may take several steps to remember, each masked state (output vector in a word) depends on the previous masked state. This becomes a major problem for GPUs. This sequentiality is an obstacle to the parallelization of the process. In addition, in cases where such sequences are too long, the model tends to forget the contents of the distant positions one after the other or to mix with the contents of the following positions.

Figure: LSTM Architecture ( https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)

Whenever long-term dependencies are involved, we know that RNN (LSTM / GRU) suffer from an endangered gradient problem (Vanishing Gradient Problem).

The following visualization shows the progression of GNMT (Google neural machine translation, Google Translate) when translating a Chinese sentence into English. The network encodes the Chinese words in the form of a list of vectors, each vector representing the meaning of all the words read until now “Encoder”. Once it read the whole sentence, the decoder generates the English sentence, by processing word after word (decoding operation).

To generate the translated word at each step, the decoder gives more attention to a weighted distribution on the most relevant coded Chinese vectors, in order to generate the English word with attention. The transparency of the blue link represents how much the decoder give attention to a coded word. (see reference [3]).

Figure: Treatment with GNMT(RNN), see reference [3]

Treatment with CNN

Autoregressive models (which predict future values from past values) involve sequential calculations, this requires a big processing power. In order to reduce these costs, models such as ByteNet and ConvS2S use CNNs that are easy to parallelize, which isn’t possible with RNN, although they seem to be better suited to sequence modeling. Convolution enables parallelization for graphics processor processing.

Early efforts were trying to solve the dependency problem with seq2seq convolutions for a solution to the RNN. A long sequence is taken and the convolutions are applied. The disadvantage is that CNN approaches require many layers to capture long-term dependencies in the sequential data structure, without ever succeeding or making the network so large that it would eventually become impractical.

The ability of a model to learn dependencies between positions decreases rapidly with distance, which makes it critical to find another approach that can parallelize these sequential data, and that’s where Transformer comes in.

Figure: Treatment with CNN (ConvS2S), see reference [4]

The decoders are generally trained to predict sentences based on all words preceding the current word (same for others architectures like Transformer). So only the encoder can be parallelized.

See this links for more information.

and the research paper: https://arxiv.org/abs/1705.03122

Treatment with Transformer

The Transformer presents a new approach, it proposes to encode each position and to apply the mechanism of attention in order to connect two distant words, which can then be parallelized, thus accelerating learning. It applies a mechanism of self-attention.

To calculate the attention score, the Transformer compares the score to all other words(their score) in the sentence. The result of these comparisons is a score of attention for each word of the sentence. These attention scores determine the contribution of each of the other words to the next representation.

Figure: Treatment with Transformer, see reference [2]

The decoder predicts the sentences according to all the words preceding the current word.

The Transformer also allows the resolution of the coreference.

The idea of ​​the transformer is to extend the mechanism of traditional attention, instead of calculating the attention once, it is calculated several times (multihead attention), which solves the problem of coreference as well as other problems. In the following figure, the attention was well recognized for the 2 sentences for Transformer, but the 2 nd sentence is not well translated with Google Translate.

Figure: Coreference resolution with Transformer, see reference [2]

for more information, consult this link:

Long term dependencies

Learning long-range dependencies is a major challenge in many sequence transductions tasks. A key factor affecting the ability to learn from such dependencies is the length of paths that forward and backward signals must traverse in the network. The shorter these paths are between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies.

As the length of the sequence increases, RNN establishes long-term dependencies, but its loss of information has become very serious. The attention to the RNN sequence also seems to collect a lot of information, but there is a very large overlap of information (the content in different positions of the sequence has different effects on the output of the final encoder). On the other hand, the information in different positions of the input sequence in the CNN has the same effect on the output of the encoder (the Transformer is similar). To solve this problem, ConvS2S and Transformer put the location information directly into the model entry.

In Transformer, to the extent that individual attention is applied to each word and word in the set, regardless of their distance, the longest possible path is a path that allows the system to capture the distant dependency relationships.

There are basically three types of dependencies. The dependencies between (see reference [6]):

  • The input and output tokens.
  • The input tokens themselves.
  • The output tokens themselves.

The difficulties encountered with the Seq2Seq model

  • The complexity of total calculation per layer.
  • The amount of calculation that can be parallelized, as measured by the minimum number of sequential operations required.
  • The path length between long-range dependencies on the network.

Complexity comparison

Figure: Table of complexities, see reference [1]
  • n: is the length of the sequence.
  • d: is the dimension of the representation (512, 1024 in generally).
  • k: is the size of the kernel of convolutions.
  • r: is the size of the neighborhood in a limited personal attention.

In terms of computational complexity, the self-attention layers are faster than the recursive layers when the length of the sequence n is smaller than the dimensionality of the representation d, which is the case most often used with the representations of sentences used by the most recent models in automatic translations. To improve computational performance for tasks with very long sequences, it’s possible that personal attention is limited to consideration of a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r).

Auto-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recursive layer requires O(n) sequential operations.

The convolutional layers are generally more expensive than the recurrent layers, by a factor k, or the complexity is O(k*n*d²).

A single convolutive layer of core width k<n does not connect all the pairs of input and output positions. This requires a stack of convolutional layers O(n/ k) in the case of contiguous cores, and O(logk(n)) in the case of dilated convolutions, which increases the length of the longest path between two positions in the network.

Quality comparison

Figure: Result of translation with different systems from English to French and English to German a (WMT 14, newstest2014), see reference [1]

Experiments on two tasks of translation (a large number of sentence pairs are translated (36 million sentences for English-French), and the quality of translation score is calculated.) showed that these models generate better quality translations, while performing parallel calculations and requiring much less training time than the other models. The quality estimate is done using the BLEU score(for more information about BLEU score : https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213 ). On the English-French translation task of WMT 2014, Transformer sets a new modern BLEU score of 41.8. It took 3.5 days of training with 8 GPUs, while GNMT takes 1 week.

Conclusion

We saw how powerful the Transformer’s compared to the RNN and CNN for translation tasks. It has defined a new state of the art and provides a solid foundation for the future of many deep learning architectures to use the self-attention mechanism: GPT-2 and XL-NET are two examples of models using it.

Github

For more information, you can see my repos: https://github.com/Styleoshin/Transformer

Reference

[1] Document « Attention Is All You Need » https://arxiv.org/abs/1706.03762

[2] https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

[3] https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

[4] https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/#.XUG73eiiG73

[5] https://androidkt.com/attention-base-transformer-for-nlp/

[6] http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/

Reviewers

-عقبة الأيسر : https://medium.com/@OkbaLeftHanded_18875

-Amine Horse-man : https://medium.com/@AmineHorseman

--

--

Yacine BENAFFANE
Analytics Vidhya

Junior Software and Machine Learning Developer. I have an excellent experience in developpement, and implementation of various applications and softwares.