Why do Transformers yield Superior Sequence to Sequence(Seq2Seq)Results?

Published in

Saarthi.ai

11 min readMay 28, 2019

We all are familiar with the classic Recurrent Neural Network (RNN) and Lost Short-Term Memory (LSTM), that were regarded as the go-to architectures for translation. Not even two years back, LSTM based architectures formed the base of every NLP (Natural Language Processing challenge)

But on 12 Jun 2017, Google rocked the research community with their catchy groundbreaking paper “Attention Is All You Need”, introducing the Transformers, a neural network with zero recurrences and just uses the attention mechanism. This paper is a very seminal paper in deep learning literature. It was originally introduced for machine translation tasks, but it has spread into many other application areas.

In this article, we’re going to dive deep into the mighty Transformer, dissecting its architecture, and comparing it with the traditional LSTM approaches, to see how it outperforms such models.

Let’s get started!

“Attention is all you need” — Understanding the mechanism

We are going to look into why we need this so-called “attention”

But, before we move onto understanding attention, let’s first understand why we need it. Let’s begin with the concept of Machine Translation. As we know Machine Translation (MT) is simply a task of mapping one sentence to another sentence. As a sentence is made up of words it’s simply mapping a sequence to another sequence, essentially becoming what’s known as Sequence2Sequence modelling.

Previously, for translation tasks, seq2seq models, consisting of encoder and decoder were used. This involved usage of Recurrent Neural Networks, both for constructing the encoder and the decoder. The architecture used to memorize the entire sentence from the source language and regurgitate it into the target language.

This process works fine for shorter sentences, but as the length of the sentence increases the performance slumps. It is, therefore, difficult for a recurrent neural network to memorize a rather long sentence. The attention mechanism helps us overcome this difficulty. Let’s see how.

What is Attention?

Now, let us think of how a human being, that understands multiple languages, translates them. He takes a sentence, breaks it into parts, and then finally translates the parts rather than memorizing the whole sentence and then translating it.

Attention models, just like humans, try to translate a part of the sentence at a time which makes them much more efficient when compared to any RNN based seq2seq model.

Attention idea is one of the most influential ideas in deep learning. The main idea behind attention technique is that it allows the decoder to “look back” at the complete input and extracts significant information that is useful in decoding.

Then what‘s wrong with RNNs?

We learned previously that with the help of attention we can solve problems incurred with using RNNs in seq2seq modelling. RNNs has a few flaws which the Google Transformer addresses and solves. The shortcomings are listed as follows: -

1. The first flaw of RNN is its sequential nature. It means that each hidden state depends on the output of the previous hidden state. This becomes a huge problem for GPUs. As they have huge a computational power, they resent having to wait for data from the network to become available. This makes RNN unfit even with technologies like CuDNN which slow down the whole process for GPU.

2. The second is the long-range dependencies. We know that, theoretically, LSTMs can possess long-term memory, yet memorizing things for a long period of time is a challenge.

There is another problem that I will explain with an example. Let’s take a look at the following sentences –

“I am fatter than him.”
“I have other work than to write this.”

Both sentences mentioned above use “than” in two different ways, and in different contexts. Conventionally, attention models can handle dependencies between input and output tokens by giving the decoder access to the entire input sequence. But what happens while handling dependencies between input or output tokens themselves.

Self-Attention

Now, let’s discuss the idea of self-attention. Self-attention. also known as intra-attention, is an attention operation of a single sequence in order to calculate the representation of the very same sequence. This concept has been very useful in NLP tasks such as Text summarization, Machine Translation and Image Description Generation.

In the above example, we get to see that self-attention helps us to learn the inter-dependency between the current word and the precious part of the same sentence.

Dissecting the Mighty Transformer

Let’s roll back a bit to understand the basic concept of Transformers.

We are now familiar with RNN’s shortcoming is that it is not well versed in handling dependencies between input or output tokens themselves.

To handle this flaw, the Transformer just allows the encoder and decoder to see the entire input sequence all at once, directly modelling these dependencies using self-attention.

To handle this flaw, the transformer just allows the encoder and decoder to see the entire input sequence all at once, directly modelling these dependencies using attention.

*The path length is now independent of the length of the source and target sentences.*

This fundamental idea of transformer is implemented by its vital component, the Multi-Head Attention block.

Key, Value and Query

Before understanding the major component of the transformer let’s first understand the basic unit of data flow.

The transformer views the encoded representation of the input as a set of key-value pairs, (K, V) (K, V), both of dimension nn (input sequence length); in the context of Machine Translation, encoder hidden state consists of both keys and values.

In the decoder, the previous output is compressed into a query (QQ of dimension mm) and the next output is produced by mapping this query and the set of keys and values.

The transformer acquires the scaled dot-product attention: the output is a sum of the weighted values, where the weight allocated to every value is decided by the scalar product of the query with all the keys:

The basic attention mechanism is simply a dot product between the query and the key. The size of the dot product tends to grow with the dimensionality of the query and key vectors, so the Transformer re-scales the dot product to prevent it from exploding into huge values.

Multi-Head Attention

In Transformers, the attention mechanism is elucidated by figuring out the admissible set of information known as values which are dependent on some keys and queries. This helps the model to focus on pertinent information from what it is currently executing.

The encoder uses the embeddings of the source sentence for its keys, values and queries, whereas the decoder uses the output of the encoder for its keys and values, and embedding of the target sentence for its query.

Let’s again take a sentence “I like cats more than dogs”. We might want to capture the factor of comparing both the entities in the sentence but also not want to miss out on retaining the actual entities of the sentence. While performing a single attention weighted sum of the values, it would be difficult to capture various different aspects of the input.

In this example, the query is the word being decoded (“犬” which means dog) and both the keys and values are the source sentence. The attention score represents the relevance, and in this case, is large for the word “dog” and small for others.

Now, this is the issue which is solved by Transformer using the Multi-Head Attention block. This block computes multiple attention weighted sums instead of a single attention pass over the values — hence the name “Multi-Head” Attention.

To learn from a variety of representations, the Multi-Head Attention applies different linear transformations to the keys, values and queries for each “head” of attention as shown in the figure below.

The Encoder

The encoder generates an attention-based representation with the capability to locate a specific piece of information from a potentially infinitely-large context.

Its architecture mainly consists of:

· The stack of N=6 uniform layer

· Each layer has a straightforward position-wise fully connected feed-forward network and consist of a multi-head self-attention layer.

· Each sub-layer adopts a residual connection and a layer normalization. All the sub-layer’s output data is of the same dimension i.e. 512.

Transformer architecture: Encoder (*Image source:* *Vaswani, et al., 2017*)

A residual connection is basically just taking the input and adding it to the output of the sub-network and is a way of making training deep networks easier. Layer normalization is a normalization method in deep learning that is similar to batch normalization.

The above can be expressed in an equation:

, where Sublayer is the feed-forward network of multi-head attention network.

Encoder block actually just does a bunch of matrix multiplications which is followed by element-wise transformation. This operation makes the transformer super-fast, as everything is just parallelizable matrix multiplication. Thus, by piling these transformations on top of each other, we create a very powerful network.

The Decoder

The decoder can recapture from encoder depictions.

Its architecture mainly consists of:

· A stack of N = 6 identical layers

· Each layer has one sub-layer of a fully-connected feed-forward network and two sub-layers of multi-head attention mechanisms.

· Each sub-layer adopts a residual connection and a layer normalization.

· The first multi-head attention sub-layer is modified to the “masked multi-head attention”, to prevent positions from attending to subsequent positions, as we don’t want to look into the future of the target sequence when predicting the current position.

*Decoder (Image source:* *Vaswani, et al., 2017*)

The network now attends over the preceding decoder states and does a similar task as decoder hidden state which we see in trivial Machine Translation architectures.

We see there is a masked multi-head attention block which gets its name because we need to hide the inputs from the decoder to the future time-steps.

An important point we should keep in mind that the decoder predicts the sentences based on all the words before the current word. This can be demonstrated easily by the example taken above i.e. “I like cats more than dogs” has to be mapped to “私は犬よりも猫が好き” by the network. Here we train our model to predict that “犬” comes after “私は” when we feed the source sentence as “I like cats more than dogs”.

But if we pass the entire sentence at once to the decoder the model just repeats the sentence and learns nothing. This is prevented by masked multi-head attention block by masking the future tokens when we decode a certain word.

The Full Architecture

Finally, here is the complete view of the transformer’s architecture:

· Both the source and target sequences first go through embedding layers to produce data of the same dimension, equals 512.

· To preserve the position information, a sinusoid-wave-based positional encoding is applied and summed with the embedding output.

· A SoftMax and linear layer are added to the final decoder output.

*The full model architecture of the transformer. (Image source: Fig 1 & 2 in* *Vaswani, et al., 2017*.)

We can very well observe that the Transformer still uses the same encoder-decoder from the trivial seq2seq model used in NMT (Neural Machine Translation).

Towards our left is the encoder, and towards our right is the decoder. The inputs are given as embeddings of the input sequence, and the initial inputs to the decoder are the embeddings of the outputs up to that point.

Now, here we observe an additional encoding which is fed with the output embeddings called as positional encoding. Let’s discuss it right away.

Positional Encodings

Multi-head attention network cannot automatically make use of the word’s position, unlike RNNs. Also, without positional encoding the output of multi-head attention block for “I like cats more than dogs” and “I like dogs more than cats” would be more or less the same.

With positional encoding it explicitly wraps and encodes the position of the input words as vectors, and passes it to add to the input embeddings.

In the paper we can see the positional encoding is being used as:

, where i is the dimension and pos gives the position. The model easily learns to get the relative positions as each dimension of the positional encoding is a wave with a different frequency.

The model easily learns to get the relative positions, as each dimension of the positional encoding is a wave with a different frequency.

Results

Transformer has shown tremendous results being able to outperform its recurrent equivalents and LSTM-based models, in spite of overlooking the traditional recurrent architectures.

On the English-to-German and English-to-French WMT dataset translation tasks, the attention based Transformer achieves state-of-the-art BLEU scores (41.8 on EN-FR and 28.4 on EN-DE).

Image source: https://arxiv.org/pdf/1706.03762.pdf

Not only that, but because of its highly parallelizable nature, the Transformer is able to do this at a significantly-reduced number of FLOPs (floating-point operations) in training.

Hence, the Transformer is better and faster!

End Notes

The Transformer is a real rebel in the natural language deep learning scene because of how it eschews conventional network constructs while still outperforming existing systems. It has challenged a lot of folk wisdom about the need for attention in recurrent natural language models.

Transformer has now been elongated in brand new models, most recently Bidirectional Encoder Representations from Transformers or simply BERT.

But periodically will rise the question “Is attention really all you need?”.

Future work will answer that question!

Thanks for reading!

Leave us a comment and I’ll be happy to answer any questions. If you enjoyed this article, give it a clap.