Unbabel AI: Highlights from ACL 2019

10 min readAug 23, 2019

The Annual Meeting of the Association for Computational Linguistics (ACL) was held in Florence, Italy, from July 28th to August 2nd. ACL is the top conference in Computational Linguistics and Natural Language Processing, and this year it was co-hosted with the Conference on Machine Translation (WMT), which made it even more interesting. Unbabel had a strong presence in both ACL and WMT:

We had 14 Unbabelers participating in the conference, representing our AI teams in Lisbon and Pittsburgh.
We presented 5 papers in total: 3 at ACL (a long, a short, and a demo paper) and 2 at WMT (both shared task winners, one for Quality Estimation and another for Automatic Post-Editing).
We had our first Unbabel AI social event! It was a great success.
And, we got the best demo paper award for our OpenKiwi system demonstration!

We had a great time and thought we’d share some of our impressions from the conference.

Latent Structures, Compression, and Interpretability

The first day was devoted to tutorials. Together with our collaborators from DeepSPIN (Vlad Niculae, Tsvetomila Mihaylova, and our visitor from NYU Nikita Nangia), we presented a tutorial on Latent Structure Models for NLP.

Three reasons why we like latent variable models are:

Language is structured. Latent variables give us an opportunity for injecting prior knowledge into the model as structured induction bias.
By inspecting the latent variables, we can see inside the “black box” and get some interpretability — for example, we may understand better what triggers a wrong decision and use this information to design a better model. Or, if we have a system where humans and machines work in tandem (as we do at Unbabel), we can use these explanations in a symbiotic manner.
Latent variable models provide a way for “compressing” information through a bottleneck. This may lead to smaller models, less expensive to train, therefore with less energy consumption. This will likely be an important concern in the years to come, and an exciting research direction. Hopefully, bringing latent structure inside neural network models may get us closer to something like “Transformer-XS” as opposed to “Transformer-XL.”

The tutorial covered several strategies for dealing with discrete latent variables, including methods based on stochastic variables and reinforcement learning, gradient surrogates such as straight-through estimators, and continuous relaxation methods such as structured attention networks and SparseMAP. We provided a unifying perspective over these methods, and the building blocks they require.

Overview of Latent Structure Models for NLP.

Related to one of the points above, Energy and Policy Considerations for Deep Learning in NLP quantifies the financial and environmental costs of training state-of-the-art NLP models. The numbers are quite scary: training a big Transformer with neural architecture search emits 6 times as much carbon as the averaged lifetime of a car including fuel. Training BERT on GPU is roughly equivalent to a trans-continental (US coast-to-coast) flight. More efficient strategies for hyperparameter tuning (Bayesian or random search) can alleviate this, but will it be enough? I would like to see more research on energy efficient models.

As at EMNLP last year, interpretability and explainability were hot topics. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned does a great job at looking what the different attention heads do in Transformer models. It turns out many are redundant and can be pruned without harming performance. Similarly, in What Does BERT Look At?, in the BlackboxNLP workshop, the authors look at the 144 attention heads of BERT (12 layers times 12 heads per layer) to understand what BERT is doing when transferred to a downstream task. Some heads attend to positional offsets (either the previous or next word), heads in higher layers tend to attend a lot to the separator tokens, others seem to capture some of the syntactic dependency relations remarkably well.

Our paper, Sparse Sequence-to-Sequence Models, uses sparsity both in the attention mechanisms and in the output layer for machine translation and morphological inflection tasks (we had a follow-up paper in SIGMORPHON, awarded the Interpretability Prize). We propose a new family of transformations called “entmax,” parametrized by a scalar alpha: when alpha = 1, we recover softmax; when alpha = 2, we recover sparsemax; for any alpha > 1, the transformation is “sparse”, meaning that it generally results in a sparse probability distribution. The best results were obtained with alpha between 1 and 2. (Note: we have a follow-up paper appearing on EMNLP where we learn alpha to obtain adaptively sparse attention heads in Transformers; stay tuned.)

Translation hypotheses with non-zero probability produced at each time step by a sparse sequence-to-sequence model, for the German source sentence, “Dies ist ein weiterer Blick auf den Baum des Lebens.” When consecutive predictions consist of a single word, we combine their borders to showcase auto-completion potential. The selected gold targets are in boldface (taken from Sparse Sequence-to-Sequence Models).

Is Attention Interpretable? is part of a series of papers questioning the ability of attention mechanisms to lead to better interpretability. As an alternative to gradient analysis, this paper ablates features, one at a time, to measure their “informativeness,” starting with the features that receive the largest attention weights. The results are a bit mixed: they suggest that attention weights somewhat predict the importance of input features to the model decision, but that it is by no means a fail-safe indicator. Other interesting papers related to interpretability include Saliency Driven Word Alignment interpretation, which derives word alignments from “saliency,” defined as the gradient of the generated word with respect to each source word, Towards Explainable NLP: A Generative Explanation Framework for Text Classification, where the model learns jointly to predict and explain the prediction, and Interpretable Neural Predictions with Differentiable Binary Variables, which proposes a latent model that mixes discrete and continuous behavior, via the so-called “Hard Kuma” distribution.

Simultaneous Translation

Liang Huang, from Baidu Research, gave a nice invited talk on the recent advances and remaining challenges of simultaneous translation — the goal is generating a translation in real time, maintaining both short latency (like human simultaneous interpretation) and good quality (like human written translation). This is a very challenging task: even professional interpreters can only recover, on average, 60% of the source. A big challenge is word order difference between source and target languages: SOV is common in Japanese and German, SVO in English, a mix of the two in Chinese. In 2018, Baidu achieved a practical breakthrough on this task by using a simple “prefix-to-prefix” approach (with a fixed latency of k words), which allows controllable latency at test time, and an implicit anticipation on target side. Fixed latency, though, has some limitations: it can be either too aggressive (small k) or too conservative (large k, high latency). There were two papers at ACL proposing adaptive latencies, one from Google and another from Baidu. The task is far from solved though: current challenges include coping with ASR noise (homophones), code-switching, instantaneous speech-to-speech directly (no text), and creating better datasets for training.

Longer Contexts, Deeper, Bigger Transformers

An important open problem in Transformers is how to expand their attention spans without blowing up the memory footprint. In Adaptive Attention Span in Transformers, a novel self-attention mechanism was proposed that learns the optimal attention span, significantly extending the maximum context size used in Transformers, while maintaining control over memory footprint and computational time. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context significantly expands the context size used in Transformers by caching previously encoded hidden states (made possible by using relative positional attention). This overcomes a limitation of vanilla Transformers (no information flow across segments, i.e., limited memory). The increased context leads to state-of-the-art results in language modeling. There is a significant amount of follow-up work on Transformer-XL, including the XLNet model. The system (from Microsoft) who achieved the best results in the English-German document-level news translation task at WMT, Towards Large-Scale Document-Level Neural Machine Translation, also encodes large context sizes of up to 1000 subwords, significantly outperforming a sentence-level system both in terms of automatic and human evaluation. Finally, Learning Deep Transformer Models for Machine Translation outperforms a Transformer-Big model (which is wide but not very deep) through a careful use of layer normalization and a novel way of passing the combination of previous layers to the next one (“dynamic linear combination of layers”).

Avoiding Exposure Bias and Min-Risk Training

The standard way of training models for neural machine translation is by using perplexity as the loss function. This leads to exposure bias — models are given the reference translation as context, rather than their own predictions (this is called “teacher-forcing”). This is different from what happens at test time, when each decision is conditioned on the history of past decisions made by the model itself. This mismatch between training and test-time conditions can lead to error propagation and suboptimal performance, making it hard to recover from early mistakes. Bridging the Gap between Training and Inference for Neural Machine Translation — awarded the best paper — puts together some previously known ideas with some additional tricks. At training time, it generates word-level oracles (with Gumbel-max sampling) and sentence-level oracles (by reranking the candidate translations in the beam with a sentence-level metric). Then, it randomly decides to pick the context from the ground-truth or the oracle by sampling with decay, as in scheduled sampling. We used a similar technique in the Student Research Workshop paper Scheduled Sampling for Transformers.

In Beyond BLEU: Training Neural Machine Translation with Semantic Similarity, the authors propose “SIMILE,” a metric of Semantic Similarity for NMT. They use minimum risk training to optimize NMT models directly for this metric. Unlike BLEU, this metric encourages diversity and accounts for semantic textual similarity, leading to better performance. In Self-Regulated Interactive Sequence-to-Sequence Learning, a self-regulation training strategy is proposed that allows different types of feedback (corrections, error markups, and self-supervision) to have different costs and effects on learning. The problem is cast as a learning-to-learn problem leading to improved cost-aware sequence-to-sequence learning. Learning Neural Sequence-to-Sequence Models from Weak Feedback with Bipolar Ramp Loss addresses the case where supervision by gold labels is not available and consequently neural models cannot be trained directly by maximizing likelihood. The paper presents several objectives for two separate weakly supervised tasks, machine translation and semantic parsing, showing that objectives should actively discourage negative outputs in addition to promoting a surrogate gold structure — this “bipolarity” is naturally present in ramp loss objectives, and motivates a novel token-level ramp loss objective.

Translation Quality Estimation

As argued in a previous article, quality estimation is the missing piece in machine translation. The ability of providing a confidence/quality measure for translated text may have a key impact when combining human and machine translation, when filtering noisy parallel data, and for quality control.

At ACL, we presented a system demonstration paper for a general purpose quality estimation system, OpenKiwi: An Open Source Framework for Quality Estimation. In this paper, we present OpenKiwi, a Pytorch-based open-source toolkit for quality estimation, which we featured in a previous blog post. We’re excited to have received the best demo paper award for this work!

At WMT, we participated and won the Quality Estimation shared task with a system developed based on OpenKiwi, which we adapted to incorporate BERT and XLM pre-trained models, and for which we developed new ensemble strategies.

Automatic Post-Editing

Automatic Post-Editing is the task of automatically correcting (fine-tuning) the output of a machine translation system. This is useful when the system was trained on out-of-domain data.

In our ACL paper A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning, we proposed a fast and accurate approach for automatic post-editing which leverages BERT pre-trained models, leading to a new state of the art in this task. Inspired by the approach described in this paper, we participated and won the Automatic Post-Editing shared task at WMT, achieving the best results according to both automatic metrics and human assessments.

Another interesting paper on this topic at WMT was APE at Scale and its Implications on MT Evaluation Biases, which demonstrates that a large-scale APE model trained on synthetic data generated using round-trip translations is effective at improving translation quality of state-of-the-art NMT models, improving on the best models that won the translation shared task at WMT.

Low-Resource NMT and Domain/Dynamic Adaptation

While previous work claimed that NMT models do not outperform their corresponding phrase-based SMT models for low-resource settings, Revisiting Low Resource Neural Machine Translation: A case Study shows that well-tuned NMT models in such settings can do so. In Domain Adaptive Inference for Neural Machine Translation, the authors investigate adaptive ensemble weighting for NMT with several domain-specific NMT models that were trained in advance. This improves performance on data from a new and potentially unknown domain without sacrificing performance on the original domain (avoiding catastrophic forgetting). Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation does dynamic adaptation using translation memories, via a simple source concatenation method. Other interesting papers in this space were Domain Adaptation of Neural Machine Translation by Lexical Induction and Training Neural Machine Translation to Apply Terminology Constraints.

To be continued: MT Summit and MT Marathon

All in all we had a great time at ACL 2019!

And the story is not over yet. This was only the first journey in a busy Summer for the Unbabel AI teams: the next stop is the MT Summit in Dublin, and a few days later the MT Marathon in Edinburgh. Stay tuned!