Advances in NLP in 2017 (part II)

7 min readJan 19, 2018

The first part is devoted to the main fresh idea in our field — Transformer. If you are interested in Feed-Forward networks disrupting RNNs’ monopoly, welcome to the first part.

Unsupervised Machine Translation

The next part of this article is devoted to task which seemed impossible even a few years back: unsupervised MT. The papers which will be discussed are:

The last paper seems to not belong here. But as they say — do not judge book by its cover. All three papers have the same main idea behind. In a nutshell it could be expressed like this: we have two auto-encoders for different text sources (i.e. different languages, or texts of different styles) and we just switch the decoder parts for them. How does it work? Let’s try to sort it out.

Auto-Encoder (left); Encoder-Decoder (right)

The auto-encoder is an encoder-decoder where decoder decodes to original space. This means that input and output belong to the same language (or style). So if we have some text, we train an encoder to produce a vector containing all the info needed by decoder to reconstruct the original sentence. In the ideal case the reconstructed sentence will be just the same. But in many cases it is not exactly so. We need to somehow measure the similarity between the source and reconstructed sentences. And the machine translation field has an answer for such a challenge. There is a standard formula which is used to measure the similarity, it is called BLEU.

BLEU — the BLEU score measures how many words and ngrams (n consecutive words) overlap in a given translation and a reference translation. The most commonly used BLEU version is BLEU-4, which considers words, bigrams, trigrams and 4-grams. It also uses a penalty for too short translations.

As you may guess it is not differentiable, so some other way to train our translator is needed. For auto-encoder this could be a standard cross-entropy, but it’s not enough for translation. Let’s skip this for now and go on.

OK, now we have a tool to build our auto-encoder. The next thing which we need to do is to train two of them: one for source language (style), and another one for target language. And also we also need them crossed over: make the decoder for the source language restore encoded target strings, and vice versa.

Here is the tricky part: in auto-encoder (or any encoder-decoder) in the middle we have a so-called hidden representation — a vector in some high-dimensional space. And if we want two our auto-encoders to be compatible, we need their hidden representations to be in the same space. How it is reachable? Through some additional loss for these two auto-encoders. This loss comes from a discriminator, which in its turn refers us to GANs.

GAN — the Generative Adversarial Network. The idea of GAN could be expressed as “a network is playing with itself and trying to deceive”. There are three main components in GANs: Generator — to produce representations of some input in a way that such representations resembles ground truth as much as possible, Golden Source — to produce ground truth, and Discriminator — to tell where its input comes from: Generator or Golden Source, and we “punish” Generator if it can. Vise versa, we “punish” Discriminator if it can’t, so both Generator and Discriminator are getting better during the training.

In our case discriminator (L_adv on the figure) has to tell where the input comes from —target or source language (for machine translation), target or source style (for style transfer task). On the figure above we can see two auto-encoders represented as separate blocks — encoders and decoders. They have a link between them in the middle, where the discriminator is placed. Training two auto-encoders in parallel with such additional loss leads the model to making hidden representations in both encoders similar (upper part of the figure), so the rest is clear — just replace the original decoder with its counterpart from the neighbouring auto-encoder (lower part of the figure) and enjoy your model translating from one language to another.

All three papers have this idea behind the scene, with specific details in each case. The explanation above comes mostly from Unsupervised Machine Translation Using Monolingual Corpora Only, so I should also mention the previous work of the same authors, since its results are used in their work in question:

Word Translation Without Parallel Data

The idea of this work is simple and brilliant:

Two Vector Spaces Are Pulled on Each Other

Let’s say that we have word embeddings for two different languages. (Suppose that we work with texts from the same domain, say news or fiction books.) We can assume that the vocabularies of news corpora in a pair of languages are close: for the majority of words in the source corpus we can find their translations in the target news corpus — like president, taxation or ecology will be definitely presented in the news on both languages. So why we can’t we just juxtapose these words and pull one vector space on another? And we actually can. Even more, we can find a function to transform the whole space (and the dots i.e. vectors for words in it) to some other space, where these dots will be placed on dots in another space. In this work the authors show that this could be done in an unsupervised manner which means we have no need in explicit dictionary.

The Style-Transfer from Non-Parallel Text by Cross-Alignment paper is placed in this section since languages could be treated just like different styles of text and authors mention it in the paper. Also this work is interesting because they have their code published.

Controllable Text Generation

This section is really close to the previous one, but still a little bit different. The works in question here:

In the first paper we have a different approach to style transfer which is closer to controllable generation, so it is placed here instead of previous section where belongs its double. The idea of controllable generation could be illustrated by this figure:

Here we see again the auto-encoder on text, but with additional information: the hidden representation of input (here it will be sense) is enriched by additional feature vector. This feature vector is encoding some specific properties of text, e.g. sentiment or grammar tense.

As you can see on the picture, we also have a discriminator in addition to auto-encoder. There could be even more than one discriminators if we have multiple properties encoded at once. So at the end we have a composite loss function — the reconstruction loss from auto-encoder and an additional loss for specified text features. Therefore, reconstruction loss here has somewhat different meaning — it represents only sense of a sentence, not its features we force to be explicit.

Simple Recurrent Unit

And the last but not least section. It is devoted again to speed of computation. Despite the fact that in the first part we discussed ground-breaking newcomers from Fully-Connected nets, all the field to this day works on Recurrent networks. And another well-known fact is that RNNs are much slower than CNNs. Or aren’t they? To answer this question, let’s explore this paper:

Training RNNs as Fast as CNNs

I think that authors of this work tried to answer the question: why RNNs are so slow? What makes them be like that? And they found the key: RNNs are sequential, this is in their nature. But what if we could do as little of sequentionality as possible? Let’s say that (almost) everything is independent of its previous state. Then we could process all the inputs in parallel. So our task is to throw out all unneeded dependencies on previous states. And that is where it comes to:

As you can see, only two equations depend on previous state. And this equations work with vectors, not the matrices. All the heavy mathematics could be done independently in parallel. And at the end we just add few multiplications to handle the sequential nature of data. This setup proves to be great, check it out yourself:

The Simple Recurrent Unit (SRU) speed is approaching that of CNN!

Conclusion

We can see that in 2017 the field has its own disruptors, like Transformer, and breakthroughs like Unsupervised Machine Translation, and also — the for common good the fast RNNs (which are faster than FasterRNNs, it you know what I mean). I’m looking into 2018 with aspiration of a new breakthroughs and advances in ways I still cannot imagine.

Advances in NLP in 2017 (part II)

Unsupervised Machine Translation

Controllable Text Generation

Simple Recurrent Unit

Conclusion

Written by Valentin Malykh