An Adversarial Review of “Adversarial Generation of Natural Language”

Or, for fucks sake, DL people, leave language alone and stop saying you solve it.

[edit: some people commented that they don’t like the us-vs-them tone and that “deep learning people” can — and some indeed do — do good NLP work. To be clear: I fully agree. #NotAllDeepLearners ]

[update: I added some clarifications based on responses to this piece. I suggest reading them after reading this one. ]

[update: Yann LeCun responded on facebook, followed by my response to Yann’s]

I’ve been vocal on Twitter about a deep-learning for language generation paper titled “Adversarial Generation of Natural Language” from the MILA group at the university of Montreal (I didn’t like it), and was asked to explain why.

Some suggested that I write a blog post. So here it is. It is written in somewhat of a hurry (after all, I do have some real work to do), and is not academic in the sense that it does not have references and so on. It may contain tons of typos. But I fully stand behind all of its content. We can discuss in the comments (medium has comments, right? I never actually used it).

While it may seem that I am picking on a specific paper (and in a way I am),
the broader message is that I am going against a trend in deep-learning-for-language papers, in particular papers that come from the “deep learning” community rather than the “natural language” community.

There are many papers that share very similar flaws. I “chose” this one because it was getting some positive attention, and also because the authors are from a strong DL group and can (I hope) stand the heat. Also, because I find it really bad in pretty much every aspect, as I explain below.

This post is also an ideological action w.r.t arxiv publishing: while I agree that short publication cycles on arxiv can be better than the lengthy peer-review process we now have, there is also a rising trend of people using arxiv for flag-planting, and to circumvent the peer-review process. This is especially true for work coming from “strong” groups. Currently, there is practically no downside of posting your (often very preliminary, often incomplete) work to arxiv, only potential benefits.

I believe this should change, and that there should also be a risk associated with posting to arxiv before or in conjunction with peer review.
Critical posts like this one represent this risk. I would like to see more of these.

Why do I care that some paper got on arxiv? Because many people take these papers seriously, especially when they come from a reputable lab like MILA. And now every work on either natural language generation or adversarial learning for text will have to cite “Rajeswar et al 2017'’. And they will accumulate citations. And reputation. Despite being a really, really poor work when it comes to language generation. And people will also replicate their setup (for comparability! for science!!). And it is terrible setup. And other people, likely serious NLP researchers, will come up with a good, realistic setup on a real task, or with something more nuanced, and then asked to compare against Rajeswar et al 2017. Which is not their task, and not their setup, and irrelevant, and shouldn’t exist. But the flag was already planted.

So, let’s start with dissecting this paper.

The Attitude

As I said on twitter, I dislike pretty much everything about this work. From the technical solution they propose down to the evaluation. But what bothers me most is the attitude and the hubris.

I’ve been working on language understanding for over a decade now, and if I learned something since I started its that human language is magnificent, and complex, and challenging. It has tons of nuances, and corners, and oddities, and surprises. While natural language processing researchers, and natural language generation researchers — and linguists! who do a lot of the heavy lifting — made some impressive advances towards our understanding of language and how to process it, we are still just barely scratching the surface on this.

I have a lot of respect for language. Deep-learning people seem not to. Otherwise, how could you explain a paper title such as “Adversarial Generation of Natural Language”?

The title suggests the task is nearly solved. That we can now generate natural language (using adversarial training).
Sounds exciting! But. If you look at the actual paper, and scroll to the end, you’ll see tables 3, 4 and 5, containing some examples of the generated sentences from the model. They include such impressive natural language sentences as:

* what everything they take everything away from 
* how is the antoher headache
* will you have two moment ? 
* This is undergoing operation a year .

These are not even grammatical!

Now, I realize that adversarial training is hot right now, and that adversarial training for sequences of discrete symbols is hard.
Maybe the technical solution proposed by this paper (we’ll get to it below) indeed improves the results tremendously over the previous, even less functioning trick.
If that’s the case, then this is probably worth noting. Other people will see it, and improve even further. Science! Progress! Ok, I am all for that.

But please, don’t call your paper “Adversarial Generation of Natural Language’’. Call it what it really is:
 “A Slightly Better Trick for Adversarial Training of Short Discrete Sequences with Small Vocabularies That Somewhat Works’’. 
What, you say? This sounds boring? No one will read this? Maybe, but this is what the paper is doing.
I am sure the authors can come up with a more appealing title. But it should reflect the actual content of the paper, and not pretend to “solve natural language’’.

[BTW, if, like in this paper, we don’t care for controlling the generated text in a meaningful way, we already have good methods for generating passable text: sampling from an RNN language model, or from a variational auto-encoder. These were shown on numerous papers to produce surprisingly grammatical texts, and even scale to large vocabularies. If you were living under a rock, start with this classic post from Andrej Karpathy. For some reason, these are not even mentioned in the paper, let alone compared with.]

This paper is not alone in flag-planting and extreme over-selling. Another related recent example is Controllable Text Generation, from Hu et al. Controllable, you say? How nice!
In effect, they demonstrated that they can control two factors of the generated text: its sentiment (positive or negative) and its tense (past, present and future). And generate sentences of up to 15 words.

This is, again, both over-selling and grossly disrespecting language.
For some context, the average sentence length in a wikipedia corpus I have around here is 19 tokens. Many are way longer than that. Sentiment is much more nuanced than positive or negative. And English has a somewhat more elaborate time system than past, present, and future.

So ok, Hu et al created an actor-critic-VAE framework with some minimal control options, and made it work with some short text fragments. Is “Controllable Text Generation” really the most descriptive title here? (Although, to be fair, they did not say Natural Language in the title, only in the abstract, so that’s something I guess).

[edit: Zhiting Hu commented on his paper in the responses to this post. I agree with all his points. By re-reading what I wrote, I see that it came across as too harsh on their work. I want to clarify: Hu et al is much much better than the Adversarial Generation paper. It does have flaws: I still think the title is too broad, and that not discussing the reason for the short sentences or acknowledging it as a limitation is a problem. I also have some serious issues with the evaluation. But this is not nearly as bad as the Adversarial Generation paper I discuss here at length.]

Another example is the bAbI corpus from Facebook. It was created as a toy set, it is super artificial and limited when it comes to language, yet many recent work
evaluate on it and claim to ``do natural language inference’’ or something in those lines. But harping on babi is beyond the scope of this post.

The Method

The previous section discussed the “Natural Language” aspect of the title (and we’ll see more of that in the next section).
Now let’s consider the “Asversarial” part, which relates to the innovation of this paper.

Recall, that in GAN training, we have a generator network and a discriminator network, that are trained jointly. The generator tries to generate realistic outputs, and the discriminator tries to separate the generated outputs from real examples. By training the models together, the generator learns to deceive the discriminator and hence to produce realistic outputs. This works amazingly well for images.

To summarize the technical contribution of the paper (and the authors are welcome to correct me in the comments if I missed something), adversarial training for discrete sequences (like RNN generators) is hard, for the following technical reason: the output of each RNN time step is a multinomial distribution over the vocabulary (a softmax), but when we want to actually generate the sequence of symbols, we have to pick a single item from this distribution (convert to a one-hot vector). And this selection is hard to back-prop the gradients through, because its non-differentiable. The proposal of this paper is to overcome this difficulty by feeding the discriminator with the softmaxes (which are differentiable) instead of the one-hot vectors. That’s pretty much it.

Think about it for a moment. The discriminator’s role here is to learn to separate the training sequences (sequences of one-hot vectors) from of softmax vectors produced by the RNN. It needs to separate one-hot vectors from non-one-hot-vectors. This is… kind of a weak adversary. And has nothing to do with natural languageness.

Let’s also think for a moment about the effect of this discriminator on the generator: the generator needs to fool the discriminator, and the discriminator attempts to distinguish the generator’s softmaxe outputs from one-hot vectors. The effect of this would be to make the generator produce near-one-hot vectors, that is, very concentrated distributions. I am not sure if this is really what we would like to guide our natural langage generation models towards. Think about it for a while and try to see how you feel about it. But if we do think that very sharp distributions is something that should be encouraged, there are easier ways of achieving that (temperature. priors). Do we know that the proposed model is doing more than introducing this kind of preference for spiky predictions? No, because this is never evaluated in the paper. It is not even discussed.

[late addition: Dzmitry Bahdanau, in the comments, points that the adversary may be more effective and less naive than what I am saying, because of its being a Wasserstein GAN. This may well be, I’d trust Dzmitry’s opinion on this more than my own, I am not an expert in this. But I still would like to see the point about spikiness being mentioned and evaluated explicitly. It is possible that the W-GAN is doing more than it seems, but show me the experiments to support that!]

So, the natural language is not really natural, and the adversary is not really adversarial. Now to the evaluation.

The Evaluation

The model is not evaluated. Like, at all. Definitely not on natural language. And it is clear that the authors have no idea what they are doing.

To quote the authors (section 4): 
“We propose a simple evaluation strategy for evaluating adversarial methods of generating natural language by constructing a data generating distribution from a CFG or P−CFG”

Well, guess what. Natural language is not generated by a CFG or a PCFG.

They then describe the two grammars they used: a toy one (which they at least admit is toy!) having 248 production rules (!!), with a vocabulary of 45 tokens (!!!), of which they generate sentences of 11 words (!!!). Yeah. Aha. Impressive.

But wait, let’s actually look at the grammar file (from a homework assignment by Jason Eisner at Hopkins, described as “a very, very, very simple grammar that you can extend”). Out of its 248 production rules, only 7 are actually production rules. Yes. 7. The remaining rules are lexical rules, i.e. mapping pre-terminal symbols to vocabulary items. But wait, the vocabulary size is 45. 45+7=52. Where are the other 196 rules? Well, at least 182 of them are rules mapping the `Misc` symbol to some word. The `Misc` symbol does not participate in the grammar, and is meant for the students who do the homework assignment to extend. So the authors in effect used a grammar with 52 production rules (not 248 as claimed), where only 7 of which are real rules. They didn’t even bother to look at the grammar or describe it correctly. AND it’s extreme toy.

Now, for the second grammar. This is derived from the Penn Treebank corpus. The details are not clear from the paper, but they do say that they restrict the generation to the 2,000 most common words in the corpus. Here is a typical sentence from this corpus, when words that are not in the top 2,000 are replaced with an underscore:

“ _ _ _ Inc. said it expects its U.S. sales to remain _ at about _ _ in 1990 .”

That’s 20 words, by the way.

Here’s another sentence:

“_ _ , president and chief executive officer , said he _ growth for the _ _ maker in Britain and Europe , and in _ _ markets . “

For this more complex grammar, they evaluate the model by looking at the likelihood assigned by the grammar to their generated sample.
I’m not sure what’s the purpose of this evaluation and what it tries to show — it clearly does not measure the quality of the generation — but they say that:

“While such a measure mostly captures the grammaticality of a sentence, it is still a reasonable proxy of sample quality.”

Well, no. Corpus-derived PCFGs like that do not capture the grammaticality of sentences at all, and this is not a reasonable proxy of sample quality if you care about generating realistic natural language.

These guys should really have consulted with someone who worked with natural language before.

They also use a Chinese Poetry corpus. Well, that’s natural language, right? Yes, aside from the fact that they don’t look at complete poems, but only at separate lines from these poems, where each line is treated in isolation. AND they only use lines of length 5 and 7. AND they don’t even look at the generated lines, but evaluate them using BLEU-2 and BLEU-3. For those of you who do not know BLEU, BLEU-2 roughly means counting the number of bigrams (two-word sub-sequences) that they generate that also appear in the reference text, and BLEU-3 means counting the number of three-word sub-sequences. They also have a weird remark about evaluating each generated sentence against all the sentences in the training set as a reference. I didn’t fully get that part, but its funky, and very much not how BLEU should be used.
They say this is the same setup that the previous GAN-for-language paper they evaluate against use for this corpus
But of course.

On the simple grammar (52 production rules, vocab of 45 words), their model was able to fit 5 word sentences (wow), and their more complex models almost, but not quite, managed to fit the 11 word ones.

The Penn Treebank sentences were not really evaluated, but by comparing the sample likelihood over epochs we can see that it is going down, and that one of their model achieves better scores than some GAN baeline called MLE which they don’t fully describe, but which appeared in previous crappy GAN-for-language work. Oh, and they generate sentences of length 7.
I already said before that the likelihood under a PCFG is pretty much meaningless for evaluating the quality of the generated sentences. But even if you care about this metric for some reason, I bet a non-GAN baseline like an non-tuned Elman RNN will fair way, way, way better on this metric.

The Chinese Poetry generation test again compares results only against the previous GAN work, and not against a proper baseline, and reports maxmimal BLEU numbers of 0.87. BLEU scores are usually > 10, so I’m not sure what’s going on here, but in any case their BLEU setup is weird and meaningless to begin with.

And then we get tables 3, 4, and 5 in which they show actual generated sentences from their model. I hope they are not cherry-picked, but they are all really bad.
I quoted some samples above, but here are a few more:

* I’m at the missouri burning the indexing manufacturing and through .
* Everyone shares that Miller seems converted President as Democrat .
* can you show show if any fish left inside .
* cruise pay the next in my replacement .
* Independence Unit have any will MRI in these Lights

Somewhere when referring to the tables containing these sentences, the paper says:

“The CNN model with a WGAN-GP objective appears to be able to maintain context over longer time spans”.

“Adversarial generation of natural language”, indeed.

A Plea

I do not think this paper could get into a good NLP venue. At least I would like to hope so. But, sadly, I do believe it could easily get into a machine learning venue such as ICLR, NIPS or ICML, despite (or maybe because of) the gross overselling, and despite having no meaningful evaluation, and no meaningful language generation results.
After all, “Controllable Text Generation” by Hu et al got accepted into ICML, with many of the similar flaws. To be fair to Hu et al, I think their work is much better than the one discussed here. But the current one is about GANs and adversaries, which are sexier, so it probably has about the same chance.

So here is my plea to reviewers, especially ML reviewers, if you got thus far in reading this. When evaluating a paper about natural language, try keeping in mind that language is complex, and that people have been working on it for a while. Don’t allow yourself to be bulshitted by large claims pretending to solve it, while actually doing tiny, insignificant, toy problems. Don’t be swayed by sexy models if they don’t solve a real problem, or when a much simpler, much more scalable baseline exists, but is not mentioned. And don’t be impressed by the affiliations of authors on arxiv papers, or their reputations, especially when they attempt to work
at a problem domain that they (and you..) don’t really understand. Try to learn the subject and think things through (or ask for someone else to review it). Look at their actual evaluations, not their claims.
And please, oh please, don’t request NLP researchers who do work in a realistic setup and a nuanced task to compare themselves to setups and evaluations that were established by “pioneering” poor quality work just because it exists on arxiv or at some ML conference.

To be clear, I’m OK with working on simplified toy setups for really hard problems, this is a good way to get progress. But such work needs to
be clear about what its doing, not claiming to do something it is clearly not.

And for authors: respect language. Get to know it. Get to appreciate the challenges. Understand what the numbers you report are measuring, and if they really fit what you are trying to show. Look at the datasets and resources you are using, ffs, and understand what you are doing. If you attempt to actually work with language, do so in a realistic setup (at least full length sentences, and a reasonable vocabulary size), and consult with someone who knows this area. If you do not care about language or don’t attempt to solve a real language task, and really only care about the ML part, that’s fine too. But be honest about it, to your audience and to yourself, and don’t pretend that you do. And, in either case, position your work in the context of other works. Note the obvious baselines. And most importantly, acknowledge the limitations of your work, in the paper. That’s a strength, not a weakness. And that’s how science progresses.