Some Highlights from NAACL 2018
I attended NAACL 2018, which ended last week. Here are some trends I observed:
I think progress in summarization is primarily bottlenecked by a lack of good summarization datasets and evaluation metrics, rather than a lack of adequate models and training objectives, so it was nice to see a number of works addressing the former.
On the data side, I was particularly impressed by Newsroom (Grusky et al., 2018), a new news dataset of 1.3 million article-summary pairs, an order of magnitude larger than the commonly used CNN/Dailymail dataset. They also propose a number of metrics to quantify how extractive and compressive a summary is. These metrics aren’t designed to replace BLEU or ROUGE, but seem useful as a common problem I’ve heard anecdotally is that many summarization models, particularly models with some sort of copy mechanism, primarily copy text from the source.
On the evaluation side, there was a number of works proposing alternative evaluation mechanisms to n-gram overlap based metrics. Zopf (2018) proposed human ranking between pairs of summaries, while Peyrard and Gurevych (2018) propose to train a model to directly predict human scores of summarization quality. I haven’t yet had time to read either of these in depth, but I’m glad people are working on this issue.
Finally, on the modeling side, there was a good deal of work tackling the problem of summaries that are fluent but repetitive, incorrect, or otherwise semantically poor. Work in this vein primarily focused on designing objective functions to encourage better semantic quality.
- Pasunuru and Bansal (2018) train a model to predict the saliency (importance) of words in the source and another model to predict if generated summaries are entailed by the reference summary.
- Celikyilmaz et al., (2018) use a semantic cohesion loss between sentences of the generated summary, formulated as the cosine similarity between the hidden states of the generator at the ends of sentences.
A closely related trend is a greater reliance on using reinforcement learning to train models, which is partly due to the fact that some of these newly designed objective functions are non-differentiable. On the other hand, it seems standard now to fine-tune your model on ROUGE, despite the fact that Paulus et al., (2017) showed that the small gains in ROUGE come at the cost of summary quality as evaluated by humans.
The term “adversariness” conflates two separate but somewhat related threads of research, both of which were carried over from vision:
- Adversarial training, popularized by GANs, uses a discriminator to distinguish between the outputs of a generator model and real data, forcing the generator to output examples similar to real data. Examples of this direction at NAACL include applications to machine translation, question answering, knowledge base completion, and domain adaptation (almost always for text classification).
- Adversarial examples, popularized recently by Szegedy et al. (2013) and in NLP by Jia and Liang (2017), broadly seeks to (usually programmatically) develop examples that will break machine systems despite being easy for humans. Examples of this direction at NAACL include applications to POS tagging and reading comprehension.
Both of these types of adversariness were well-represented at NAACL, though NLP research tends to focus more on the former than the latter. This disparity is partly due to the difficulty of directly transferring techniques in vision for generating adversarial examples to language. Also, it is not immediately clear why adversarial text examples would be more dangerous (e.g. getting past a spam filter versus making a self-driving car fail to stop) than are adversarial vision examples, beyond testing and breaking model robustness.
Of papers utilizing adversarial training at NAACL, I particularly liked Discourse-Aware Neural Rewards for Coherent Text Generation (Bosselut et al., 2018), though the authors would probably say this setting is cooperative and not adversarial. An open problem in text generation is that generated text is often locally fluent by globally incoherent. It is difficult to write down a mathematical definition of coherence or design an objective function to encourage it, so instead the authors propose to train a discriminator to score coherence. They use some clever heuristics so that they do not need human annotations, and train the generator using reinforcement learning. I think training a model to learn an objective function of human judgments is a promising direction, though Percy Liang at the GenDeep workshop had an interesting thought experiment about it: if you could do human-in-the-loop training for coherence or discourse infinitely long, would that solve text generation?
Of papers utilizing adversarial examples at NAACL, I particularly enjoyed Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (Iyyer et al., 2018). In this work, the authors generate adversarial examples for sentiment analysis by training a network to generate syntax varying paraphrases of an input sentence. They train this network by:
- taking a Czech-English bitext
- backtranslating (a trendy thing to say/do) the Czech sentences into English to create an English paraphrase corpus (originally done by Wieting and Gimpel, 2017)
- labeling the translations using a parser to create an English paraphrase corpus with constituency parse annotations
- training the paraphrase network to generate a paraphrase with the target syntax using the parse annotations.
The result of this process is a network for generating paraphrase that can be controlled by also inputting the desired syntax. They verify that the examples generated by their paraphrase network are largely paraphrastic and have the desired syntax. In experiments, their method out-fools baselines.
This approach is neat for a number of reasons. First, it’s fun and satisfying to see a work draw on so many different subfields of NLP (using a parser to annotate translation data to train a controllable paraphrase network to break a sentiment model). Second, a lot of researchers recently have pointed out the brittleness of many NLP systems; controllable generation seems like a useful way to cheaply and quickly generate data with a wider range of linguistic variation (although this becomes a chicken-and-egg problem).
Bonus: comments from Sam Bowman
Sam is one of my advisors at NYU and always has insightful things to say. Here is a selection of things he said about NAACL:
- ELMo works really well! (From the ELMo paper and many workshop papers/results mentioned in person)
- fastText basically does everything we can want for word encoding in most languages. The subword/character workshop saw plenty of methods that didn’t beat fastText encoding by any real margin
- People are thinking a lot more about bias in models. (See: The two coref papers with identical titles.)
- People are lot more skeptical about improvements in test set results (when presented without more context). See: GenDeep
NYU at NAACL 2018
Finally, I didn’t present any work, but friends and colleagues at NYU had some great work presented at NAACL:
- Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages (Kann et al., 2018)
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (Williams et al., 2018)
- ListOps: A Diagnostic Dataset for Latent Tree Learning (Nangia and Bowman, 2018)
- Do latent tree learning models identify meaningful structure in sentences? (Williams et al., 2018)
- Training a Ranking Function for Open-Domain Question Answering (Htut et al., 2018)