ICML+ACL’18: Structure Back in Play, Translation Wants More Context

André Martins
Aug 17, 2018 · 16 min read

​​A few weeks ago I attended the (ICML 2018) in Stockholm and, right after, the (ACL 2018) in the opposite side of the world: Melbourne. Interestingly, the combination of temporal proximity and geographical distance between these two conferences is becoming a tradition — last year it was ICML in Australia, and ACL in Canada.​​

​​This year, we presented and , and were in the . This post shares some of my thoughts about both conferences.​​

ICML’18: Structured, Deep, Generative​​

Like , ICML is growing really fast, with 10 parallel tracks (it used to be 4 when I first attended ICML 10 years ago). Not surprisingly, a large fraction of the presented papers had to do with neural networks — their architecture, their training, their learned representations. A few notes follow, with a focus on structured prediction, deep generative models, and representation learning.

View of central Stockholm from Södermalm, close to the ICML venue.

Structured Prediction

​​In the Structured Prediction track, presented our paper (a joint work withand ) — a new technique for structured inference that outputs a sparse set of structures, as opposed to a single one (as in MAP inference) or a dense distribution over all structures (as in marginal inference).

For example: the model may return only a handful plausible parse trees for a sentence, assigning zero probability to all the other, implausible, ones. SparseMAP is differentiable (we can plug it as a hidden layer in a neural network and run the usual gradient backpropagation) and efficient (thanks to an active set algorithm that evaluates SparseMAP by solving a sequence of MAP problems). We can regard it as the structured variant of . Pytorch is provided along with the paper.​​

​​Before Vlad, Nataly Brukhim gave a nice talk on , which plugs cardinality constraints for multi-label classification, handled through Dykstra’s projection algorithm. Interestingly, I suspect this model can be regarded as a special case of our SparseMAP framework — the constraint corresponds to a “budget” factor, solvable in linear time and with closed-form gradient (likely more efficient than their proposed approach of unrolling Dykstra’s). Another was presented the day after by : a way of using sparsemax in the update equations of dynamic programming algorithms, arriving at differentiable variants in-between sum-product and max-product.​​

SparseMAP as a structured version of sparsemax (figure extracted from ). Left: in the unstructured case (e.g. multi-class classification), returns a sparse distribution over classes, represented as a point in the boundary of the simplex. Right: in the structured case, returns a sparse combination of structures (e.g. dependency parse trees), illustrated as a boundary point of the marginal polytope.
Example of two parse trees returned by SparseMAP for an ambiguous sentence. All other trees get zero probability (figure extracted from ).

​​Deep Generative Models

​​Deep generative models, most prominently variational auto-encoders (VAEs) and generative adversarial networks (GANs), are in great hype these days. gave a beautiful invited talk () motivating deep generative models through the lens of information theory, energy minimization, and the minimum description length principle, all rooted in the equation F (free energy) = E (energy) — H (entropy).​​

​​Several conference papers presented new insights in VAEs and GANs. VAEs are usually trained by maximizing the evidence lower bound (ELBO), a tractable variational approximation to the likelihood. It has been observed, however, that this procedure often gives poor latent representations. As an alternative, proposes variational bounds of the mutual information between the input and latent variables, which are more general than ELBO. The result is a full rate-distortion curve that trades off compression and reconstruction accuracy, from which we can seek the Pareto optimal solutions.

Two other papers propose techniques to reduce the amortization gap of VAEs (this gap quantifies the suboptimality of the variational parameters found by inference networks, i.e., the encoder part of the VAE): proposes an iterative strategy akin to learning-to-learn, which augments inference networks with an extra loop to update the variational parameters with stochastic gradients, while proposes an hybrid approach between stochastic and amortized variational inference, leveraging differentiable optimization.​​

​​As for GANs, the central idea is, rather than approximating maximum likelihood training (equivalently KL divergence minimization), to replace it by a different objective: training the generator to create samples that fool a supervised discriminator. This boils down to a Jensen-Shannon divergence minimization problem (the original GAN paper) or the earth mover’s distance (as in Wasserstein GANs).​​

​​An advantage of GANs over other generative models is that they tend to generate sharp outputs (useful when multi-modal distributions are desired); however, they also suffer from mode collapse (meaning that they tend to generate only from a few modes). Their training dynamics, which boil down to finding a Nash equilibrium in a minimax problem, is not yet fully understood. gave a nice presentation shedding some light on this, trying to answer the question , providing an analysis of when GANs converge by analyzing the spectrum of the Jacobian of the gradient field of the GAN objective.​​

​​Another challenge (very relevant to us NLP researchers) is to make GANs generate discrete data like text (most research on GANs has focused on continuous outputs like images). This is much more challenging, as the GAN training set-up, based on alternating gradient updates, requires the output of the generator to be differentiable, hence continuous. Previously proposed solutions include policy gradient methods like REINFORCE or the Gumbel-softmax reparametrization trick. proposes an alternate solution, by combining a discrete auto-encoder with GAN-regularized continuous latent representations, with interesting results in textual style transfer tasks. It feels, however, that there is still a lot to do in this space.​​

​​There was also that I unfortunately could not attend, as I was flying to ACL.​​

​​Sequence to Sequence

​​Sequence-to-sequence learning has been dominated by autoregressive models (e.g. in a recurrent decoder, each emitted output symbol is fed back as input to the LSTM in the next time step, creating a dependency on previous output symbols). Non-autoregressive sequence-to-sequence models are an active topic of research — if they worked, decoders could be parallelized and made much faster. Existing work accomplishes this, but typically with or the need for .

​​A Google AI paper, , proposes a less autoregressive approach, where a small number of latent variables are generated autoregressivelly (on top of a Transformer network), and then the target words are decoded in parallel, in a non-autoregressive manner, conditioned on the latent variables and the source words (there seems to be more recent work which combines this approach with distillation, with improved results). Apparently it is crucial that the latent variables are discrete, although I don’t fully understand why.​​

​​Another machine translation paper, , from FAIR, improves understanding of sequence-to-sequence models by assessing how uncertainty caused by noisy training data is captured by the model distribution and how it affects search, proposing tools to assess model calibration.​​

​​Representation Learning and Test of Time

​​Another interesting paper was (and two other related papers: and ), from which I got to learn about this recent framework of embeddings in hyperbolic space. The common theme in this (mathematically beautiful) line of research is that embeddings on hyperbolic manifolds (as opposed to the Euclidean space) are suitable for representing hierarchical relations among objects (for example, trees can be embedded in the Poincaré disk, the 2-dimensional hyperbolic space, with arbitrarily low distortion). As many objects of interest, e.g., social networks, WordNet-like knowledge bases, entailment relations, etc. have a hierarchical structure, this geometry may be an appealing alternative to the usual flat Euclidean space. In this paper, they manage to represent the WordNet taxonomy with state-of-the-art precision with very compact two-dimensional representations.

​​ and received the Test-of-Time Award for their influential ICML 2008 paper “.“ I remember this paper very well (ICML 2008 was the first machine learning conference I attended), and the “cold” reception it got from most of the NLP audience, at a time where neural networks were everything but popular. I admit I was skeptical too — how could a single model, without any feature engineering, trained for weeks on the entire Wikipedia, achieve state-of-the-art scores in several NLP tasks?​​

​​There were some problems in the original paper with the evaluation in the semantic role labeling task (fixed in a later JMLR paper) and a great deal of extrapolation from making progress in this shallow task to being close to solve “all of semantics” (This may be what annoyed the NLP community the most).​​

​​While the last problem persists in the community to this day, time has definitely proved that this was a really valuable contribution and the award very well deserved. Ten years later, we’re all using continuous word representations trained on large datasets and training end-to-end models “from scratch.” Except maybe for some industry niches, feature engineering is long gone and progressively replaced by representation learning and other forms of engineering: architecture search, hyperparameter tuning, transfer learning. Perhaps another take-home message (and I may lose some friends here) is that the NLP community should be more open to contributions from other fields, even when they seem ignorant about language — a closed community is condemned to overfitting their own techniques.

ACL’18: Fertility, Context-Aware Translation, Linguistic Structural Bias

​​After a few long flights and one extra day and a half on the calendar, I arrived in Melbourne for ACL. The conference was held in the , a really nice venue (also one of the largest congress centers I’ve been to, rivaling with Sydney’s International Convention Centre Sydney, which hosted ICML 2017). The organization was spectacular. I hope more of these conferences come to Australia so that I have an excuse to go back.

Machine Translation and Generation

presented our paper based on his internship work at (joint work with Pedro Ferreira). The idea is to avoid some of the common mistake patterns in neural MT (under-translation and over-translation) by replacing the conventional softmax-based attention mechanisms by constrained attention: every time we generate a target word, we first upper bound how much attention each source word can receive (therefore avoiding repeatingly attending to the same words).

To this end, we keep track of how much cumulative attention each source word has received and use a fertility-based approach (akin to IBM models) to define the upper bound above. In addition, we encourage attention to be sparse, i.e. to give zero probability to unrelated words — this is accomplished via a novel constrained sparsemax transformation, which is cheap to compute and differentiable. We also propose to detect repetitions and dropped source words.

Chaitanya Malaviya presenting our ACL paper.

In , a new architecture is proposed that, instead of a single encoder for both languages, uses separate encoders and a shared latent space, along with two GANs (a local and a global one) to enhance the cross-language translation.

Unsupervised NMT, though, appears to have some limitations, as stated in , presented by : in particular, performance degrades considerably for morphologically rich languages (or simply when the source and target language are very dissimilar) and when embeddings are trained on different domains or using different algorithms.

This does not surprise me: I always found puzzling how bilingual embeddings can be induced from separate monolingual data by applying orthogonal (or more generally, linear) transformations. I understand the “local” structure of the embedding space is similar for both languages, but why should the “global” structure be similar? An orthogonal matrix has O(D²) degrees of freedom (D being the embedding size) but for practical problems the vocabulary size may well exceed D², so how can we have enough degrees of freedom in these rotations to superimpose the embeddings of related words? Maybe this works because the monolingual corpora are a bit parallel after all (which seems to be the case in Wikipedia), at least to the extent that the monolingual embeddings have similar structure to begin with. But this will break if their domain is different or if embeddings are learned with different algorithms.

A Google AI’s paper, , carries out extensive experiments to disentangle several of the sources of Transformer Networks’ improvement. They show that a “classic” recurrent NMT model on steroids, RNMT+ (which adopts some side innovations introduced in the Transformers paper, such as multi-head attention) is competitive to the non-recurrent architecture of the Transformer. Training may be tricky, though — Orhan Firat mentioned in the QA that things like layer normalization seemed crucial for stabilizing training both in Transformers and RNMT+.

addresses the fact that translation is a multi-modal, ambiguous process — there are typically many valid translations for a given sentence. They propose a VAE-like deep generative model of machine translation which allows sampling one of many-possible translations, by incorporating a chain of latent variables to account for lexical and syntactic variation in parallel corpora.

A really exciting research direction is in context-aware NMT — how to enable NMT models to work beyond the scope of a single sentence? This was the topic of in the (more on Rico’s talk below). There were several interesting talks in this topic, including Elena Voita’s talk , which allows the Transformer’s encoder to condition on adjacent sentences (with a very interesting empirical analysis about where it learns to attend), and Sameen Maruf’s talk on , which uses memory networks to store several sentences in a document and runs an iterative decoding algorithm based on block coordinate descent.

A few interesting papers concerning text generation include , from CMU, which proposes an adversarial latent variable model to rephrase text to contain specific stylistic properties (sentiment, gender and political slant), , from FAIR (awarded with an honourable mention), which learns a creative system to generate coherent and fluent stories given a short prompt (including a new dataset collected from Reddit), , from Stanford, which generates a sentence by sampling a prototype from the training set and refine it (a “semi-parametric” hybrid that mixes traditional left-to-right sentence generators and memory-based approaches), and , from UCL, which looks at the ability of language models to accurately predict numerals — a current weakness of language models they propose to fix. Very interesting!

Parsing and Argmax Differentiation

from MILA proposes a new minimalistic constituent parser which predicts solely a vector of real-valued scalars (“syntactic distances”), for each split position in the input sentence, specifying a ranking order in which the split points will be selected — then a binary tree can be induced from this ranking, recursively, in a top-down fashion.

from CMU is an alternative approach to the traditional left-to-right transition-based parsers, using stack pointers for top-down parsing — at each point, the pointer picks the next left or right child to pick as modifier (like in a head automaton) and proceeds recursively. It’s interesting to contrast this approach with transition-based parsers (which parse left to right) and easy-first parsers (which parse bottom up). Finally, the poster from Berkeley gain a few extra points by replacing a BILSTM encoder with a self-attentive architecture (akin to Transformer networks), achieving some impressive 93.55 F1 in the PTB with a single model and no external data, and 95.13 F1 with the full model with ELMO embeddings.

from UW presented , a nice paper addressing the problem of how to backpropagate through the non-differentiable argmax operation, which was awarded an honorable mention. It proposes a variant of the straight-through estimator called SPIGOT — it involves computing the argmax in the forward pass, then computing an update in the backward that involves an Euclidean projection onto the marginal polytope.

Turns out this projection is exactly what computes too (see the ICML paragraph above), which makes these two approaches very related: while SparseMAP avoids argmax by computing this sparse projection on the forward step and then backpropagating with the exact gradient, SPIGOT computes the argmax in the forward step and then backpropagates a surrogate gradient. The building blocks are the same, though, so SPIGOT can also benefit from SparseMAP’s active set algorithm to compute the Euclidean projection (making it usable for any problem where argmax is efficient). Overall, I prefer SparseMAP though, as its training is likely more stable since it’s based on the exact gradients (plus, it has a sparse structured prediction interpretation).

Another thought: it would be interesting to see how “marginal-SPIGOT” would perform (i.e. if we replace the Euclidean projection by a “KL projection,” which often boils down to sum-product dynamic programming): “marginal-SPIGOT” would be to as SPIGOT is to SparseMAP.

Awards and Workshops

The last session was devoted to best paper awards and the ACL lifetime achievement award, given to for his outstanding contributions to the areas of syntax and semantics, most noticeably the development of .

Earlier, in , winner of the best short paper award, a new version of the SQuAD dataset (a reference benchmark for reading comprehension) was introduced, combining the original questions with 50,000 unanswerable questions written adversarially by crowd-workers to look similar to answerable ones. This dataset seems much more challenging than the previous version, with a current gap of 20 F1 points between human and machine performance.

Queen Victoria Market in Melbourne.

In the workshop days, I attended ’s talk on the about “.” This was a beautiful talk inspired by ’s “duality of patterning,” which contrasts two levels of structure in language, namely how words are formed by combining distinctive and meaningless characters/phonemes, and how sentences are formed by composing significant and meaningful words. I’ll focus on the composition process.

Why should we bother with modeling linguistic structure at all? Where do RNNs actually fail? Chris’ talk showed two benefits of incorporating structural inductive bias: it makes our models more accurate (in particular, if we don’t have a lot of data), and it helps us understanding the biases and limitations of non-structured models.

One way of spotting those limitations is to seek what linguistic information RNNs can encode (e.g. by trying to expose their representations into something human-readable like an FST — the topic of ’s talk later in the same workshop). Chris’ take is a bit different: to understand what RNNs can and can’t do, compare the biases they learn with those of other structured models in terms of their “predictive preferences.” A really compelling example are subject-verb agreement tasks, which require predicting the verb is/are in sentences like “.” As the number of “attractors” increases (i.e. nouns before the verb which are not the correct subject), models have an increasingly hard time figuring out what is the correct subject to pick the right verb form.

This experiment reveals the sequential recency bias of RNNs, which require too many hops (as many as surface words) to reach the subject from the verb, ending up learning an over-simplistic first-noun heuristics — picking He as the subject. By contrast, syntactically informed models such as , equipped with a stack data structure, develop a bias for syntactic recency which reduces the number of hops to get the subject right.

Finally, I attended the , where I gave an invited talk: Here, I described various alternatives to softmax that can be used both in attention mechanisms and in output layers, including , , and the most recent (see above), along with a unifying perspective of these transformations as induced by .

gave an interesting talk advocating , with a nice historical perspective. A word of caution was given alluding to the recent — in work in progress, a new document-level evaluation is being carried out which shows that, if the context is taken into account (as opposed to ranking translations in a sentence-by-sentence basis), then human parity is still far. Before Rico, Jacob Devlin described the several engineering tricks and architecture decisions to fit a NMT model in a cellphone. Related to this, there was a where the won in all settings (more information about Marian inpresented in the ACL demonstration session). The workshop closed with a talk from on how to get .

Document-level evaluation of MT: human parity not there yet (from Rico Sennrich’s talk).

Take-Home Messages:

  • Sparsemax/argmax differentiation are opening up new research fronts in structured prediction and NLP.
  • Deep generative models are extremely promising, but still broken in many ways, and not yet able to handle discrete data convincingly.
  • Representation learning won the test of time in ML and NLP.
  • Linguistic structure (fertility, syntactic phrases) are making its way as induction bias in end-to-end models.
  • Machine translation is not solved, and the next big thing is how to take context into account, beyond sentence boundaries.

And that’s all folks!

See you soon in Brussels for EMNLP.

Note: I’d like to thank everyone at the and the team members for giving feedback on this blog post and reading pre-prints of some of these papers in reading meetings. Other notes have been published on these conferences: check out and , with a focus on reinforcement learning.

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.

André Martins

Written by

VP of AI Research at Unbabel and Invited Professor at the University of Lisbon.

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.