EMNLP’18: Cookie Monsters, Blackboxes, and Document-Level NMT

This year, a total of 10 Unbabel AI researchers descended upon Brussels to attend the Conference on Empirical Methods on Natural Language Processing (EMNLP), and to present 5 papers between us, spanning not just EMNLP but the co-located workshops WMT and BlackboxNLP.

Three things stood out for us:

Contextual word representations are now everywhere.

This was already pretty clear after the ELMo paper at NAACL this year (materialized as a gift in the conference bag from AI2), and became even clearer with the recent BERT models deployed by Google AI, and made multilingual just a few days after EMNLP. With 64 TPU chips x 4 days training time (equivalent to around 100 days on an affordable machine with 4 GPUs), these models are growing to become Cookie Monsters.

There’s a lot of interest in interpreting neural networks.

The BlackboxNLP workshop was a huge success (600+ participants). One of the EMNLP best paper awards was given to a paper (LISA) which includes linguistic supervision on an end-to-end task with self-attention layers.

Machine translation is finally moving into document-level.

Several papers are now addressing evaluation, pointing out flaws in sentence-level human parity claims, and looking into document structure and translation of multilingual conversations.

Our friend ELMO. Sadly, by the time he got into the conference bag, he had already been surpassed by his friend BERT.

Neural Machine Translation: What’s Next?

The first two days of EMNLP were devoted to tutorials and workshops, among which WMT, the Conference on Machine Translation.

“Which NMT framework should I use?”

NMT frameworks have proliferated in the last couple of years (OpenNMT, T2T, Sockeye, Fairseq, Marian, …) and with the fast pace the field is evolving, an MT practitioner is faced with a difficult choice problem. One thing we observed from this year’s shared tasks is that most of the winner systems picked Marian as the underlying NMT system — also our choice at Unbabel for use in production. While the community supporting Marian is (currently) much smaller than other frameworks, it’s getting a lot of traction from the research community since it’s fast, accurate, and always up-to-date with state-of-the-art architectures.

Quality estimation and automatic post-editing help other tasks.

This year, we co-organized the Quality Estimation Shared Task, where Alibaba achieved the highest scores in the word-level and sentence-level tracks, with an improved version of the predictor-estimator model. One drawback — shared by the quality estimation (QE) and automatic post-editing (APE) tracks — is that most of the positive results hold only when the underlying MT system is phrase-based. When we move to NMT (a much more realistic scenario) all scores drop considerably, barely outperforming the baselines (e.g., in APE there was a gain of 6 TER points over the baseline in PBMT versus 0.5 TER for NMT). This is where future research in this area will likely be headed to.

A second note is that ideas and models originally proposed for QE and APE are making its way into helping other tasks. A paper in the main conference, Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement, effectively combines a glass-box form of APE (“iterative refinements”) with non-auto-regressive MT, resulting on a cascade of fast decoders that tweak each others’ translations. In another paper, Chollampatt et al. adapt the aforementioned predictor-estimator model above to Grammatical Error Correction, setting a new state of the art on the CoNLL-2014 Shared Task (restricted) dataset. Related to this, Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection devise an encoder-decoder model for generating grammatically wrong sentences to augment the training data for the grammatical error detector.

Data filtering and better use of monolingual data are key.

This year, there was a new Parallel Corpus Filtering task in WMT, whose goal is to filter good quality sentence pairs from a massive crawled parallel corpus. This is extremely important in practice: as many industry practitioners painfully know, one of the main causes of critical translation mistakes (overlooked in BLEU scores) is noisy training data. Current research is only scratching the surface, but we suspect Quality Estimation will play an important role here in the future.

In FAIR’s Understanding Back-Translation at scale, they compare using search versus random sampling to generate backtranslated examples, and find random sampling works best. This is in favor of seeing backtranslation as an instance of sequence level knowledge distillation, where one model teaches another (that this can be applied iteratively with great success has been shown in Phrase-Based & Neural Unsupervised Machine Translation, which won one of the best paper awards).

Document-level is the future.

Besides the shared tasks, WMT also has a research track. We had a paper on data selection for domain-adapted NMT (a collaboration between Unbabel and DCU), and another one where we propose a new model to translate bilingual conversations — a collaboration with Sameen Maruf and Gholamreza Haffari from Monash University.

This is addressing a very realistic problem. Imagine a customer service chat between a Portuguese customer and an English agent. In chat, messages are usually short, cross-referent, and often the best translation for an English word has been uttered already by the other speaker, who speaks Portuguese. Current sentence-level MT systems don’t exploit this, and Sameen’s paper takes a first step in that direction.

In fact, going beyond sentences towards document-level MT is an emergent research trend, with many interesting papers in WMT and EMNLP. One big open problem that needs to be solved to accelerate progress in this area is figuring out document-level evaluation. Current automatic metrics such as BLEU overlook phenomena such as pronoun translation and lexical consistency.

Mathias Muller presented an interesting paper on evaluating context-aware pronoun translation which provides a dataset of contrastive examples; following the same trend, albeit mostly focusing on intra-sentential phenomena, Šoštaric et al. provides a dataset with multiple pronoun coreference examples while manually evaluating SMT and NMT systems; and contrastive examples have also been considered in Targeted Syntactic Evaluation of Language Models. On the other hand, Guillou et al., in the main conference, provided some negative results concerning automatic evaluation for document level.

The good news is that WMT is advocating for document-level human evaluation in future shared tasks. Two related papers, Läubli et al. at the EMNLP main conference and Toral et al. at WMT, question the recent claim that “machine translation has achieved human parity,” showing that those claims no longer hold if we consider the scope of a document. In fact, the acclaimed “human parity” only holds under very specific conditions.

Toral et al. explores in depth these requisites in order to provide a more broad understanding of current MT systems’ limitations, showing that the evaluation in Hassan et al. was biased towards MT output: not only the dataset contained a vast amount of “translationese”, which introduced most of the bias, but also the expertise of the human evaluators was not properly considered. As a result, Toral et al. proposes using professional translators as much as feasible for human evaluators, and that the community should push for document-level evaluation.

Understanding and Interpreting Neural Networks

One of the big surprises (in a good sense) was the “Blackbox NLP” workshop which was totally packed (600+ attendees!) The workshop organizers (Tal Linzen, Afra Alishahi and Grzegorz Chrupała) did an amazing job at putting this together. It’s interesting to see that so many NLP researchers are into the topic of analysis and interpretability of neural networks.

We had a talk in this workshop, presented by Vlad Niculae and co-authored by Ben Peters, on Interpretable Structure Induction via Sparse Attention. This shows how sparse attention mechanisms such as those obtained from sparsemax, fusedmax, and SparseMAP transformations, can identify latent structure and accommodate prior knowledge as structural bias, leading to better interpretability. This line of work is also related to another paper Vlad presented in the main conference, Towards Dynamic Computation Graphs via Sparse Latent Structure, which uses SparseMAP as a hidden layer to select a sparse set of latent structures and use them to create a dynamic computation graph.

Yoav Goldberg gave a very nice invited talk on Trying to Understand Recurrent Neural Networks for Language Processing, where he enumerated a few fundamental questions that can lead to a better understanding of LSTMs’ capabilities and their limitations: what kinds of linguistic structures can be captured by an RNN? How is the model architecture capturing the phenomena that lead to final decisions? When do models fail and what can’t they do? (e.g. LSTMs can count, while GRUs can’t.) What is the representation power of different architectures? Related to this, on the main conference, Rational Recurrences from Peng et al. prove an equivalence between weighted finite state automata and several recently proposed constrained RNN versions, shedding some light into understanding recurrent models.

In his invited talk Learning with Latent Linguistic Structure, Graham Neubig described three recent research directions. The first one (“multi-space variational encoder-decoders”) is suitable for learning with latent structured representations. When applied to morphology re-inflection, it yields meaningful latent continuous variables. “Tree-structured Latent Variable Models” (or StructVAE) applies structured variational auto-encoders to semi-supervised semantic parsing. Finally, “Unsupervised Learning of Syntactic Structure with Invertible Neural Projections” uses structured priors and invertible transformations for part-of-speech and dependency grammar induction.

Several papers in this workshop tried to obtain some interpretability from neural machine translation models, for example using Operation Sequence Models as in Stahlberg et al., which are able to provide hard alignments and edit information, or studying what is captured by self-attention models. A related relevant paper in EMNLP was Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. It compares self-attention, RNNs and CNNs on contrastive translations to see what they capture. The (somewhat surprising) conclusion is that Transformers are actually not better than RNNs in capturing long-range dependencies (e.g. subject-verb agreement), but excel at word sense disambiguation.

In Linguistically-Informed Self-Attention for Semantic Role Labeling (also known as LISA), one of the best paper awards, the goal is to provide some linguistic supervision within the self-attention layers. By supervising one of the attention heads to attend to the syntactic parents, they are able to improve on the state of the art in this task. A nice property of this framework is that it allows to inject high-quality parses at test time to get further performance improvements.

A similar idea was employed in Deriving Machine Attention from Human Rationales: they supervise the attention mechanism with the spans given as rationale for the label by annotators. Interestingly, the authors claim that asking annotators to provide the rationale annotation gets us to a specified accuracy level at a lower cost than simply requesting more labeled data. Our take on the success of these two models is that in low-resource settings, it is hard to learn a good attention mapping. It can then help to introduce domain knowledge to supervise the attention, via either rationales as in this paper, or dependency parses as in the LISA paper.

The Cookie Monster: Cross-Lingual Transfer and Contextual Representations

Although BERT has not yet been presented in any conference (it was only published on ArXiv a few days before EMNLP), it was a common conversation topic in the coffee breaks. In particular, about the trend of engineering increasingly powerful contextual word representations transferable across tasks, and the consequences this will have in various NLP tasks. Related to this is the ability of obtaining shared representations in multiple languages.

XNLI: Evaluating Cross-Lingual Sentence Representations introduces a multilingual version of the Multi-Genre Natural Language Inference Corpus (MultiNLI). The training data is in English, but the test data is provided in 15 different languages, including low-resource languages such as Swahili and Urdu. This may become a benchmark dataset for cross-lingual transfer learning. In fact, a few days after the conference, Google AI released a multilingual version of BERT evaluated on XNLI.

Semi-Supervised Sequence Modeling with Cross-View Training uses labeled and unlabeled data in a co-training fashion. The unlabeled data is used to train auxiliary models that only see part of the input and try to match the predictions of the main supervised model. This highly resembles the masking idea used in BERT, but with a method that the authors claim to be around 100x faster to train. Though there is no direct comparison yet in terms of accuracy, it is a clever take on using unlabeled data.

Unsupervised Cross-lingual Transfer of Word Embedding Spaces is yet another method for aligning word embeddings in different languages. It learns an invertible transformation that minimizes the Sinkhorn distance between the two distributions (an optimal transport distance).


We could not end this post without praising the increasing quality of the tutorials in NLP conferences. Two of the tutorials the Unbabel team enjoyed the most were the Harvard NLP one on Deep Latent Variable Models for Natural Language and the AI2 one on Writing Code for NLP Research.

Deep Latent Variable Models

The Deep Latent Variable (DLV) Tutorial delivered by Harvard’s NLP group really stood out by the clarity of its exposition. It highlighted both the promises and challenges of using DLV models for NLP, and is wholly recommended for beginners and seasoned latent variable modelers alike.

“It will be my go-to resource whenever I need to look up the derivation of the famous Evidence Lower Bound for Variational Inference yet again. :)” says Sony Trénous, Junior Research Scientist at Unbabel.

A major challenge in praxis is the phenomenon of posterior collapse: The ELBO objective is composed of two terms that pull in opposite directions:

  • a KL term that ties the posteriors to the prior,
  • a reconstruction term that potentially benefits from an informative posterior.

If the KL term is zero, the posterior contains no information and you have a straight up non-latent generative model. In theory, there exists an equilibrium where some information is encoded in the posterior to help with reconstruction. If (as is usually the case) a strong generative model is employed and the latent structure is non-trivial to learn, the optimization procedure can get stuck in an uninteresting local optimum where your posterior encodes no information and the generative model does all the work.

Two approaches to mitigate this were mentioned:

  • Annealing on the KL term to allow the model to learn an informative mapping in the latent space before introducing the pull back to the prior.
  • Using a posterior distribution that allows you to fix the KL divergence as a hyperparameter, trading in this degree of freedom for more stable training (The von Mises-Fisher distribution was given as an example). As even with KL annealing the posterior often transitions from informative to non-informative within few optimization steps, this suggests an interesting remedy.

A useful suggestion for researchers is to always report both terms of the ELBO separately with your results. This allows to distinguish between improvements due to a better generative model vs an efficient modeling of the latent space.

Writing Code for NLP Research

We were greatly inspired by the Writing Code for NLP tutorial, hearing about the AllenNLP developers’ experience in building high level frameworks. Having been there ourselves, it was encouraging to hear how others produce well-engineered code that is both suitable for production and further research.

The main take-aways were:

  • Write tests, but for new experimental code, just test that it loads data, runs forward, backward, saves, loads, and not that it outputs exact numbers (because we expect they will keep changing).
  • Ablation: everything you experiment your model with should be a configurable input argument to your system (as opposed to hard coded values). This will make it a lot easier to understand where gains come from, without added effort.
  • Always write readable code: proper variables, methods’ naming, encapsulation, tensor shape in comments, etc.
  • What are the right abstractions for NLP?
  • things that you use a lot (train a model, map words to indexes)
  • things that require some significant amount of code (train loop)
  • things that have many variations (ways to embeddings, seq2seq etc.)
  • things that reflect high level thinking (text, tags, spans, etc.)

The presenters announced that they will soon release a public beta of an interesting tool called Beaker for managing datasets and models, and which will support docker (among other things). Looking forward to this!

All in all we had a great time at EMNLP 2018. Here are some other blog posts we enjoyed about the event:

Note: this post has been written collectively by the AI tribe members who attended EMNLP.