Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other perspectives.
In 2017 A.D. I could spot two main trends:
- Faster, More Parallel — much of attention of the community was pointed at doing things faster, more parallel as a way to achieve speedup.
- Unsupervised — unsupervised approaches are very common in computer vision, but comparatively rare in natural language processing (only word2vec could be mentioned from major ideas in recent years). This year is marked by increased usage of unsupervised training.
This article is mostly about first one, for the second trend see the consecutive article. It also features speedup, which seems to be inevitable nowadays anyway.
Attention Is All You Need
This already famous paper marked the second coming of Feed-Forward networks to NLP. This work is from Google and some famous researchers like Jakob Uszkoreit and Łukasz Kaiser. The idea behind Transformer architecture featured in article is simple and brilliant: let’s forget about recurrency and all that stuff and just try to use attention to do the job. And this idea has actually worked!
But first lets remember that all current state of the art neural machine translation architectures work on recurrent networks. These networks intuitively are really suit for natural language processing tasks like machine translation, since they have explicit memory they keep during inference. This feature has the obvious bright side, but it also has accompanying dark side: since we keep the memory of what has been done before, we need to process the data only in that particular consecutive order. As a result the processing of the whole data is slow (e.g. in comparison to CNNs), and this exact issue address the authors.
Hence the Transformer architecture is Feed-Forward without any recurrency. Here, what they use instead to do the whole job is attention.
Let’s first refresh the standard Bahdanau’s approach to attention:
The idea of attention is that we need to focus on some relevant input in encoder to do better decoding. In the simplest case the relevance is defined as similarity of specific input to current output. This similarity in its turn could be defined as sum of some inputs with weights, where the weights are summing up to 1, and the biggest weight is corresponding to most relevant input.
In the figure, we could see the already classic Dzmitry Bahdanau’s approach: we have one input — the hidden states of encoder (h’s) and some coefficient to sum this hidden states with (a’s). These coefficients are not preset, they are generated from some other input different from encoder hidden states.
In contrast the authors of paper in question suggested so-called self-attention on the input data. The term “self” in the name refers to idea that attention is applied to the data on which is it being computed, in contrast to the standard approach where one uses some additional input to produce attention on the given input.
Furthermore this self-attention is named Multi-Head since it makes the same operation multiple times in parallel. This feature could be compared to convolutional filters if you’re looking for analog, i.e. each head has been focusing on different places in input. The other main feature of the attention from the paper is usage of three inputs (instead of two in standard approach). As you can see in the figure, at first we compute “sub-attention” on Q (query) and K (key) and after that combining this with V (value) from the input. This feature refers us to the notion of memory, which an attention actually is.
Aside the main feature of this architecture there are two secondary, yet significant features:
- positional encoding,
- masked attention for decoder.
Positional encoding — as we remember, whole architecture of the model is feed-forward, so there is no notion of sequence inside the network. To inject the knowledge of time sequences in the network the positional encoding was proposed. For me the usage of trigonometric functions (sines & cosines), which form position of word in the document, isn’t that obvious in this capacity, but it is working: this embedding in combination with actual word embedding (e.g. above mentioned word2vec) brings the sense of a word and its relative position to our network.
Masked attention — simple yet important feature: again, since there is no notion of sequence inside the network, we need to somehow filter the propositions of network for future words which are actually unavailable when we do the decoding. So as you may have spotted on the picture of attention, there is a place for mask, which figuratively speaking crosses out the words which position if in the future to the current one.
All these features allowed this architecture to not only work, but even improve the state of the art in machine translation.
Parallel Decoder for Neural Machine Translation
The latter feature was unsatisfying for the authors of this paper, written by Richard Socher’s group from Salesforce Research. So this masked attention for decoder was just not good enough for them in terms of speedup got from the parallel encoder, and they decided to take the next step: “Why can’t we make a parallel decoder, if we already have a parallel encoder?” That is only my speculation, but I bet the authors had a similar question in their minds. And they have found a way to solve the issue.
They called it Non-Autoregressive Decoding and the whole architecture Non-Autoregressive Transformer, which means, that now not a single word is dependent on another one. This is an exaggeration, but not that big, after all. The idea here is that the encoder in this architecture produces so-called fertility rate for each word it sees. This fertility rate is used to generate the actual translation for each word, based only on word itself. This could be thought of like we have a standard alignment matrix for machine translation:
As you can see, some of the words could refer to more than one word, and some seem to not refer to any word in particular. Thus the fertility rate just slices this matrix into pieces where each piece is for specific word in the source language.
Therefore we have the fertility, but this is not enough for the wholly parallel decoding. As you can see we need some more attention layers — positional attention (which refers us again to positional encoding) and inter-attention, which replaced masked attention from original Transformer.
Unfortunately, giving such a boost in speed (up to 8x in some cases), the Non-Autoregressive Decoder takes a few points of BLEU in return. So there is room for improvement!
In the next part we’ll discuss other important works, considering unsupervised approaches, in the first place.