Neuromation researchers are currently attending ACL 2019, the Annual Meeting of the Association for Computing Linguistics, which is the world’s leading conference in the field of natural language processing. While we do have a paper here, “Large-Scale Transfer Learning for Natural Language Generation”, in this series we will be concentrating on some of the most interesting works we are seeing from others.
Since I am now attending this multi-day conference in person, I’d like to switch gears in my reviews a little. This time, let’s follow the conference as we experience it. Research conferences are usually organized into tracks: you go to a specific room where you listen to several talks united by some common topic. Thus, I will organize this “ACL in Review” NeuroNugget series into ACL tracks that I personally visited. I typed in the rough notes for this piece as I was sitting in on the talks at ACL and later just edited them lightly for readability. For every track, I will choose one paper to highlight in detail and provide very brief summaries of the rest.
So, without further ado, let’s get started! Following is part 1 of my personal experience of the ACL 2019 Conference.
The first track I visited was called “Machine Learning 1”, which was chaired by Chris Dyer, a researcher at DeepMind who is famous for his work in deep learning for NLP. The track collected papers that have ideas of interest for natural language processing and even beyond to machine learning in general. I will begin with the highlight paper and then proceed to describe the rest.
In line with other top NLP conferences, ACL has the laudable tradition of publishing all accepted papers online in the ACL Anthology. So, I will provide ACL Anthology links for the papers we discuss here, rather than arXiv or some other repository. All pictures are taken from the corresponding papers unless specified otherwise.
Augmenting Neural Networks with First Order Logic
As we know, neural networks are great but still suffer from a number of issues. First of all, they are basically end-to-end non-differentiable black boxes, and although they have a lot of hidden variables, these variables are hard to interpret and match with something that we could add external knowledge to. In this work, Tao Li and Vivek Srikumar (ACL Anthology) present a framework for adding domain knowledge to neural networks.
Consider this reading comprehension question answering example from the paper:
In this example, an attention-based model will try to learn an alignment between words in the question and in the paragraph. If you have a huge dataset and a large model, then, of course, mapping “author” to “writing” can occur naturally. If not, however, it is much easier if the network has some way to, e.g., align words that are labelled as similar in some knowledge base. Such knowledge bases, of course, exist and are readily available for most languages.
This is an example where we have domain knowledge about the problem in the form of rules: if the two words are considered related (a binary predicate holds) then align them. There are a lot of things you can encode with such rules if you allow the rules to be expressed in first-order logic, i.e., in the form of logical formulas that can contain predefined predicates. This way to define the rules is very expressible, easy for experts to state and for non-experts to understand, so it is a very natural idea to try to add them to neural networks.
First idea: let’s have named neurons! That is, suppose that some neurons in the networks are associated with an externally-defined meaning. That is, you have a neuron, say a, and another neuron, say b, and you want the result on the next layer, a neuron c, to be a logical conjunction of these two: c=a&b. Sounds like a very easy function to add to the computational graph… but it’s definitely not a differentiable function! Backpropagation will break on a logical “AND” as much as it would break on, say, the argmax function. So how can we insert the predicates and first-order rules into a neural network, where differentiability is paramount?
We need to soften the logical operations, replacing them with differentiable functions in such a way that the result reflects how much we believe the predicate to be true. Li and Srikumar propose to do this with distance functions: given a neuron y=g(Wx) for some activation function g, inputs x, and weights W, and assuming we want to express some conditional statement Z→Y, where Z is some formula over variables z and Y is the variable associated with y, we define a constrained neural layer as y=g(Wx+⍴d(z)), where d(z) is the distance function corresponding to the statement. The authors introduce such functions based on the Lukasiewicz t-norm; say, (NOT a) corresponds to (1-a) and (a&b) corresponds to max(0,a+b-1). Then, provided that your logical formulas do not introduce cycles in the computational graph, you can map them inside the neural network.
The authors augmented several models in this way: a decomposable attention model (Parikh et al., 2016) for natural language inference, BiDAF (Seo et al., 2016) for machine comprehension, and others. For example, in BiDAF the model contains encoded representations of paragraph and query, attention vectors, and the outputs. The logical rules go like this: if two words in the paragraph and query, p and q, are related in the ConceptNet knowledge base, Related(p, q), then align them in the attention layer. As for natural language inference, here the rule is that if Related(p, h) and the unconstrained network strongly believes that the words should align, then align them.
The authors show that the results improve across all experiments, but especially significant improvements result when you have relatively little data: the authors experimented by taking only 1%, 5%, 10% and so on of the same dataset for training. When using 100% of the dataset, there is even a slight deterioration in the results in the case of natural language inference, which probably means that the rules are a little noisy, and the network is already better off learning the rules from data.
Thus, the conclusion is simple: if you have a huge dataset, just believe the data. If not, you may be better off adding external domain knowledge, and this paper gives you one way to add it to our beloved neural black boxes. I hope to see more of this connection between domain knowledge and neural networks in the future: it has always sounded absurd to me to just throw away all the knowledge we have accumulated (although sometimes, like in chess, this is exactly the way to go… oh well, nobody said it would be easy).
Self-Regulated Interactive Sequence-to-Sequence Learning
This work by Julia Kreutzer and Stefan Riezler (ACL Anthology) presents an interesting way to combine various different supervision types. As you know, machine learning techniques are distinguished by supervision types: supervised, semi-supervised, unsupervised, reinforcement. This work tries to find the right balance between different supervision types and to teach a model to integrate multiple different types of supervision. For example, for sequence-to-sequence learning in machine translation, suppose you have a human “teacher” who can mark and/or correct wrong parts of a translated sentence. The solution is a self-regulation approach based on reinforcement learning: there is a regulator model that chooses the supervision type, which serves as the action for the RL agent. The authors successfully evaluated this approach on the personalization task (online domain adaptation from news to TED talks and from English to German) and on machine translation, where the supervision was simulated with reference translations. This looks a lot like active learning, so they also compared it (favorably) with traditional uncertainty-based active learning techniques.
Neural Sequence-to-Sequence Models from Weak Feedback with Bipolar Ramp Loss
In this work, Laura Jehl et al. (ACL Anthology) consider the often-encountered case when golden ground truth supervision is not available but there is still some weak supervision. For example, in semantic parsing for question answering, it’s much easier to collect question-answer pairs than question-parse pairs. Usually, people solve this problem by using metric-augmented objectives where the external metric can be computed with the “golden” target and can help guide the learning process, but where the metric is removed from the actual structure produced by the model and may be unreliable. The intuition is to use this metric to both reward “good” structures and punish “bad” structures according to this weak feedback.
The authors consider several different objectives: in minimum risk training (MRT), the external metric assigns rewards to model outputs. In the ramp loss objective, the function encourages answers that have high probability and a high feedback score (hope outputs) and discourages answers that have a low feedback score but still a high probability of appearing (fear outputs). In the new objective presented in this work, token-level ramp loss, the ramp loss idea is pushed down to individual tokens, encouraging tokens that occur only in the positive example, discouraging tokens from the negative example, and leaving untouched the tokens that appear in both.
The authors apply these objectives to an encoder-decoder model with attention pretrained with maximum likelihood estimation and report improvements for semantic parsing and weakly supervised machine translation for ramp loss over MRT and for token-level ramp loss over the regular ramp loss.
You Only Need Attention to Traverse Trees
Models based on self-attention have completely transformed the field of natural language processing over the last year. Transformer, BERT, OpenAI GPT and their successors have been instrumental in most of the recent famous NLP advances. In this work (ACL Anthology), Mahtab Ahmed et al. present an extension of the self-attention framework that worked so well for sequences to trees, proposing a Tree Transformer model that can handle phrase-level syntax from constituency trees. Basically, they use the attention module as the composition function in a recursive tree-based structure. I won’t go into much detail but here is their main illustration:
They evaluated their results on the Stanford Sentiment Treebank for sentiment analysis, the Sentences Involving Compositional Knowledge (SICK) dataset for semantic relatedness, and other tasks, getting results close to state of the art tree-structured models and significantly better than the regular Transformer.
The Referential Reader: A Recurrent Entity Network for Anaphora Resolution
In this work from Facebook Research, Fei Liu et al. (ACL Anthology) tackle the hard and interesting problem of anaphora resolution, that is, which concept or previously mentioned word/phrase does a given pronoun refer to? It (please resolve my anaphora here) is an interesting problem, and actually even the very best models are still far from perfect at this task — getting anaphora resolution right in corner cases would require some deep knowledge about the world. The authors present the so-called Referential Reader model: it identifies entity references and stores them in a fixed-length memory, with the update and overwrite operations for the memory available for the model. The memory is controlled with a GRU-based language model, with special saliency weights showing which memory cells are still important and which are safe to overwrite. So, in the example below, we first store the “Ismael” in the first cell and increase its saliency, then overwrite the second cell (with low saliency) with “Captain Ahab” (first “captain”, then updated with “Ahab”), and then resolve the “he” reference by choosing the first memory cell contents:
All of this is controlled with recurrent neural nets (I won’t go into the details of the control mechanism, check the paper for that), and the results significantly improve over current state of the art. Moreover, this approach can benefit from unsupervised pre-training via better language models, and the best results are achieved with state of the art BERT language models.
Adaptive Attention Span in Transformers
And now back to Transformers. The original Transformer is based on a multi-head attention system where each head receives query-key-value triples produced from the input vectors and then combines them into attention values. Sainbayar Sukhbaatar et al. (ACL Anthology) attempt to make self-attention more efficient. The problem they are trying to fix is that networks based on self-attention take into account a very long context (up to 3800 tokens for Transformer-XL), which leads to very high memory requirements for the models since you need to compute self-attention with every context word in every attention head. The authors note that some heads do use the long context (it has been shown that long context in general provides a lot of benefits here) but some do not, using much shorter “attention spans” of only a few tokens. Therefore, the authors introduce adaptive attention spans, adding a special masking function to limit the context length and learning its parameters, thus forcing the model to use shorter spans with special regularization.
The authors compare their models with regular Transformer-based networks and show that they can achieve the same or even slightly better results on held-out test sets with significantly smaller models. Another interesting observation is that in the resulting models, the early layers almost universally use very short attention spans, and only the later layers (say, layers 8–12 in a 12-layer model) can make good use of long contexts. Here are two examples: on the left, we see head B in a regular Transformer caring about much shorter context than head A; on the right, we see the actual attention spans across various layers:
With this, the first track was over; all of these papers were presented as 15–20 min talks, with less than two hours for all of the above, including questions and technical breaks for changing the speakers. So yes, if you try to actively understand everything that’s going on in your session of a major conference, it is a rather taxing exercise — but a rewarding one too! See you next time, when we continue this exercise with other, no less interesting tracks from ACL 2019.
Chief Research Officer, Neuromation