The Transformers I

Topic: Attention is All You Need (I)

Tanli Hsu

7 min readApr 14, 2019

I could not stop smiling for five minutes after seeing this tweet:

https://twitter.com/gneubig/status/876620278880550916

JK. Let’s get to the real stuff.

Mind Mapping

This is the first time I introduce Mind Mapping into this blog. Mind mapping is a tool that I have been using for years without explicit knowing I am using it. This tool helps me draw the thinking flow as I read an academic research paper. Typically, I would have some general questions in mind before I get into a research paper, and the thinking flow should help me answer these questions:

What is the engineering problem they are trying to address?
Can this problem be broke down to sub-problems?
What are the main approaches before this paper is published?
How far did those approaches go?
What are the remaining, to-be-resolved parts?
What approach do they propose?
How does their approach differ from other’s?
What are the assumption and limitations?
What use case did they have in their experiment?
How about the performance? How did they define “performance”?

So, what’s the problem?

This paper attempted to address sequence transduction problem in Natural Language Processing. This problem does not only exist in NLP area but also other fields of study. As long as you find your system has sequential inputs and sequential outputs and you have to build a specific type of mapping between inputs and outputs, you might find yourself dealing with sequence transduction problem as well.

Though sequence transduction problem is not specific to NLP or its sub-disciplines, in this literature, the authors attempted to address sequence transduction problem in machine translation task, which is a common application in NLP.

Before Transformer paper is published, the mainstream approaches to address this issue were typically RNN/LSTM-based, integrated with encoder-decoder architectures and attention mechanism. One major issue introduced from these traditional approaches, however, is sequential computing. Typical RNN-based models generate a sequence of hidden states in which each hidden state is a function of its previous hidden state. This “sequential” nature inherited from RNN soon becomes a major obstacle in the seek of computation efficiency and parallelism.

Proposed solution?

In this literature, the authors proposed a new architecture that purely uses attention mechanism and encoder-decoder structure. As the super-famous figure shown below:

The left part is the encoder and the right part is the decoder. The full Transformer architecture is simply stack this model 6 times.

This model consists of four major components:

Embeddings and Softmax
Positional Encoding
Multi-Head Attention
Position-wise Feed-Forward Networks

Embeddings and Softmax

Pre-trained embedding layer and softmax functions are applied to this component. In terms of weight matrix, the design in this paper is used.

Positional Encoding

More detailed discussion is in this paper. In Transformer, the authors gave a set of simple encoding functions:

where pos is the position and i is the dimension.

Multi-Head Attention

A famous figure again:

To understand basic attention mechanism, I found [3] is a very helpful resource. If you happened to read simplified Chinese, [5] is a fairly complete article that provide all you need to know about attention mechanism. If you prefer reading in English, I found [6] is an easy-to-understand and well-written introduction. Another resource I found online is [9], which is written in Japanese. If you don’t speak Japanese but do speak traditional Chinese, [8] has translated (human translation I suppose) it for you.

Back to multi-head attention. From my understanding, multi-head attention is a paralleled implementation of simple scaled dot-product attention models. After each simple scaled dot-product attention model completes its computation, we simply concatenate the outputs.

Compare the simple dot-product attention and multi-head attention:

where W are projection matrix of Q, K, and V.

Position-wise Feed-Forward Networks

A relatively simple, fully-connected, feed-forward network:

FFN(x) = max(0, xW₁+b₁)W₂+b₂

The authors also provide an alternative view of this layer:

Another way of describing this is as two convolutions with kernel size 1.

Implementations?

OK, enough modeling from previous session. Now we can have a look on the detail of their implementations.

Training

Training Data and Batching:
- Standard WMT 2014 English-German dataset
- Sentences were encoded using byte-pair encoding (from this paper)
Hardware
- 8 NVIDIA P100 GPUs
- Base model: 12 hours
- Big model (describe in Table 3.): 3.5 days
Some training tricks:
- Adam Optimizer with β₁ = 0.9 and β₂ = 0.98 and ϵ = 10⁻⁹
- Learning rate trick

Regularization:
- Residual Dropout (from this paper) with dropout rate Pᵣₒₚₒᵤₜ = 0.1 for base model
- Label Smoothing (from this paper) with ϵₗₛ = 0.1

Results and Performance Evaluations?

In Machine Translation tasks, a common way to evaluation how a model performs is BLEU records. In this literature, the experiment BLEU record is shown in Table 3.

Comments and Criticisms

This literature has attracted enormous amount of attention from both academia and industrial world, and undoubtedly is a milestone in NLP world. Some criticisms, inevitably however, raised especially around the title “Attention is All You Need.”

When searching the related material for Transformer paper, I found [4] is a very comprehensive review on attention mechanism and the Transformer paper itself. In this article, the author made some criticism on this well-known Google paper. I would take some time to do human translation based on my understanding:

Due to the title “Attention is all you need”, it seems the authors intentionally avoid using CNN and RNN terminologies. However, like the author suggested, Position-wise Feed-Forward Networks actually can be describe as a 1-D convolution layer with kernel size 1.
Although attention mechanism does not directly related to CNN, some of the ideas (at least the ideas used in this paper) are “borrowed” from CNN: multi-head attention can be treated as a concatenation of single attention, which is similar to multi-kernels idea in CNN; Residual operations used in training phase can also be found in training CNNs.
Cannot properly model positional information. Positional encoding can in certain level alleviate, but not fully resolve the problem.
Not every task require understandings on global dependencies. For some tasks that only rely on local dependencies, pure attention model will not work as good.
To address this potential issue, Google also proposed a “restricted” version of self-attention: assuming current word only have dependencies on r neighboring words, the attention mechanism therefore only take place within this 2r+1 window. But this idea sounds exactly like the kernel window concept in CNN.

Code Study

Github Link

This piece of Keras code is adopted from [4], with necessary changes for testing in Google Colab environment.

Though the title of [4] implies that there is an implementation code of Transformer paper, it turns out there is only an implementation of multi-head attention layer and position encoding layer. At the time this blog post is written, these are also the only parts I have tested.

To fully replicate Transformer paper, the following works still remain:

Forming encoder-decoder structures with attention, FFN, and position encoding.
Stacking such structure 6 times into full-fledged Transformer
Applying such Transformer model to Machine Translation tasks
Evaluating the performance using BLEU records.

These pending action items shall be resolved upon next “week”.