Today, we are still working our way through the ACL 2019 conference. In the fourth part of our ACL in Review series (see the first, second, and third parts), I’m highlighting the “Machine Learning 4” section from the third and final day of the main conference (our final installment will be devoted to the workshops). Again, I will provide ACL Anthology links for the papers, and all images in this post are taken from the corresponding papers unless specified otherwise.
In previous parts, I took one paper from the corresponding section and highlighted it in detail. Today, I think that there is no clear leader among the papers discussed below, so I will devote approximately equal attention to each; I will also do my best to draw some kind of a conclusion for every paper.
Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data
When interacting with a user, a conversational agent has to recall information provided by the user. At the very least, it should remember your name and vital info about you, but actually there is much more context that we reuse through our conversations. How do we make the agent remember?
The simplest solution would be to store all information in memory, but all the context simply would not fit. Moonsu Han et al. (ACL Anthology) propose a model that learns what to remember from streaming data. The model does not have any prior knowledge of what questions will come, so it fills up memory with supporting facts and data points, and when the memory is full, it then has to decide what to delete. That is, it needs to learn the general importance of an instance of data and which data is important at a given time.
To do that, they propose using the Episodic Memory Reader, a question answering (QA) architecture that can sequentially read input contexts into an external memory, and when the memory is full, it decides what to overwrite there:
To learn what to remember, the model uses reinforcement learning: the RL agent decides what to erase/overwrite, and if the agent has done a good action (discarded irrelevant information and preserved important data), the QA model should reward it and reinforce this behaviour. In total, the EMR has three main parts:
- data encoder that encodes input data to the memory vector representation,
- memory encoder that computes the replacement probability by considering the importance of memory entries (there are three different variations that the authors consider here), and
- value network that estimates the value of the network as a whole (the authors compare A3C and REINFORCE here as RL components).
Here is the general scheme:
To train itself, the model answers the next question with its QA architecture and treats the result (quality metrics for each answer) as a reward for the RL part. I will not go into much formal detail as it is relatively standard stuff here. The authors compared EMR-biGRU and EMR-Transformer (the names are quite self-explanatory: the second part marks the memory encoder architecture) on several question answering datasets, including even TVQA, a video QA dataset. The model provided significant improvements over the previous state of the art.
When I first heard the motivation for this work, I was surprised: how is it ever going to be a problem to fit a dialogue with a human user into memory? Why do we need to discard anything at all? But then the authors talked about video QA, and the problem became clear: if a model is supposed to watch a whole movie and then discuss it, then sure, memory problems can and will arise. In general, this is another interesting study that unites RNNs (and/or self-attention architectures) with reinforcement learning into a single end-to-end framework that can do very cool things.
Selection Bias Explorations and Debiasing Methods for Natural Language Sentence Matching Datasets
The second paper presented in the section, by Guanhua Zhang et al. (ACL Anthology), deals with natural language sentence matching (NLSM): predicting the semantic relationship between a pair of sentences, e.g., are they paraphrases of each other and so on.
The genesis of this work was a Quora Question Pairs competition on Kaggle where the problem was to identify duplicate questions on Quora. To find out whether two sentences are paraphrases of each other is a very involved and difficult text understanding problem. However, Kagglers noticed that there are “magic features” that have nothing to do with NLP but can really help you answer whether sentence1 and sentence2 are duplicates. Here are three main such features:
- S1_freq is the number of occurrences of sentence1 in the dataset;
- S2_freq is the number of occurrences of sentence2 in the dataset;
- S1S2_inter is the number of sentences that are compared with both sentence1 and sentence2 in the dataset.
When both S1_freq and S2_freq are large, the sentence pairs tend to be duplicated in the dataset. Why is that? We will discuss that below, but first let’s see how Zhang et al. generalize these leakage features.
The authors consider an NLSM dataset as a graph, where nodes correspond to sentences and edges correspond to comparing relations. Leakage features in this case can be seen as features in the graph; e.g., S1_freq is the degree of the sentence1 node, S2_freq is the degree of the sentence2 node, and S1S2_inter is the number of paths of length 2 between these nodes.
By introducing some more advanced graph-based features that are also kind of “leakage features” and testing classical NLSM datasets for this, the authors obtain amazing results. It turns out that only a few leakage features are sufficient to approach state of the art results in NLSM obtained with recurrent neural models that actually read the sentences!
The authors identify reasons why these features are so important: they are the result of selection bias in dataset preparation. For example, in QuoraQP the original dataset was imbalanced, so the organizers supplemented it with negative examples, and one source of negative examples were pairs of “related questions” that are assumed to be non-equivalent. But a “related question” is unlikely to occur anywhere else in the dataset. Hence the leakage features: if two sentences both appear many times in the dataset they are likely to be duplicates, and if one of them only appears a few times they are likely not to be duplicates.
To test what the actual state-of-the-art models do, Zhang et al. even created a synthetic dataset where the labels are all “duplicate” because the sentences are simply copied for positive examples and show that all neural network models are also biased: they give lower duplication scores to pairs with low values of leakage features even though in all cases the sentences themselves are simply copied — with perfect duplication all around.
So now that we have proven that everything is bleak and worthless, what do we do? Zhang et al. propose a debiasing procedure based on a so-called leakage-neutral distribution where, first, the sampling strategy is independent from the labels, and second, where the sampling strategy features completely control the strategy, i.e., given the features the sampling is independent of the sentences and their labels. In the experimental study, they show that this procedure actually helps to remove bias.
I believe that this work has an important takeaway for all of us, regardless of the field. The takeaway is that we should always be on the lookout for biases in our datasets. It may happen (and actually often does) that researchers compete and compare on a dataset where, it turns out, most of the result has nothing to do with the actual problem the researchers had in mind. Since neural networks are still very much black boxes, this may be hard to detect, so, again, beware. This, by the way, is a field where ML researchers could learn a lot from competitive Kagglers: when your only objective is to win by any means possible, you learn to exploit the leaks and biases in your datasets. A collaboration here would benefit both sides.
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index
In this work, Minjoon Seo et al. (ACL Anthology) tackle the problem of open-domain question answering (QA). Suppose you want to read a large source of knowledge such as Wikipedia and get a model that can answer general-purpose questions. Real world models usually begin with an information retrieval model that finds the 5–10 most relevant documents and then use a reader model that processes retrieved documents. But this propagates errors from retrieval and is relatively slow. How can we “read” the entire Wikipedia corpus (5 million documents instead of 5–10 documents) and do it quickly (from 30s to under 1s)?
The answer Seo et al. propose is phrase indexing. They build an index of phrase encodings in some vector space (offline, once), and then do nearest neighbor search in this vector space for the query. Here is a comparison, with a standard pipelined QA system on the left and the proposed system on the right:
What do we do for phrase and question representations? Dense representations are good because they can utilize neural networks and capture semantics. But they are not so great at disambiguating similar entities (say, distinguishing “Michelangelo” from “Raphael”), where a sparse one-shot representation would be better. The answer of Seo et al. is to use a dense-sparse representation that combines vectors from a BERT encoding of the phrase and a TF-IDF document and a paragraph unigram and bigram vector in the sparse part.
There are also computational problems. Wikipedia contains about 60 billion phrases: how do you do softmax on 60 billion phrases? Dense representations of 60B phrases would take 240TB of storage, for example. How do you even search in this huge set of dense+sparse representations? So, the authors opted to move to a closed-domain QA dataset, use several tricks to reduce storage space and use a dense-first approach to search. The resulting model fits into a constrained environment (4 P40 GPUs, 128GB RAM, 2TB storage).
The experimental results are quite convincing. Here are a couple of examples where the previous state of the art DrQA gets the answer wrong, and the proposed DenSPI is right. In the first example, DrQA concentrates on the wrong article:
In the second, very characteristic example, DrQA retrieves several answers from a retrieved relevant article and is unable to distinguish between them, while DenSPI can find relevant phrases in other articles as well and thus make sure of the correct answer:
In general, I believe that this approach of query-agnostic indexable phrase representations is very promising and can be applied to other NLP tasks, and maybe even beyond NLP, say in image and video retrieval.
Language Modeling with Shared Grammar
Here we had a hiccup (generally speaking, by the way, ACL 2019 was organized wonderfully: a big thanks goes out to the organizers!): neither of the authors of this paper, Yuyu Zhang and Le Song (ACL Anthology), could deliver a talk at the conference, so we had to watch a video of the talk with slides instead. It was still very interesting.
As we all know, sequential recurrent neural networks are great to generate . But they overlook grammar, which is very important for natural languages and can significantly improve language model performance. Grammar would be very helpful here, but how do we learn it? There are several approaches:
- ground truth syntactic annotations can help train involved grammar models but they are very hard to label for new corpora and/or new languages, and they won’t help a language model;
- we could train a language model on an available dataset, say Penn Treebank (PTB), and test on a different corpus; but this will significantly reduce the quality of the results;
- we could train a language model from scratch on every new corpus, which is very computationally expensive and does not capture the fact that grammar is actually shared between all corpora in a given language.
Therefore, Zhang and Song propose a framework for language modeling with shared grammar. Their approach is called the neural variational language model (NVLM), and it consists of two main parts: a constituency parser that produces a parse tree and a joint generative model that generates a sentence from this parse tree.
To make it work, the authors linearize the parse tree with pre-order traversal and parameterize the parser as an encoder-decoder architecture. This set-up allows for several different possible approaches to training:
- use a supervised dataset such as PTB to train just the parser part;
- distant-supervised learning, where a pre-trained parser is combined with a new corpus without parsing annotations, and we train the joint generative model on the new corpus from generated parse trees (either from scratch or with warm-up on the supervised part);
- semi-supervised learning, where after distant-supervised learning the parser and generative models are fine-tuned on the new corpus together, with the variational EM algorithm.
The resulting language model significantly improves perplexity compared with sequential RNN-based language models. I am not sure, however, how this compares with modern state-of-the-art models based on self-attention: can you add grammar to a BERT or GPT-2 model as well? I suppose this is an interesting question for further research.
Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning
Zhijiang Guo et al. (this is a Transactions of the ACL journal article rather than an ACL conference paper, so it’s not on the ACL Anthology; here is the Singapore University link) consider the problem of graph-to-sequence learning, which in NLP can be represented by, for example, generating text from Abstract Meaning Representation (AMR) graphs; here is a sample AMR graph for the sentence “He tries to affect a British accent”:
The key problem here is how to encode the graphs. The authors propose to use graph convolutional networks (GCN) that have been successfully used for a number of problems with graph representations. They used deep GCNs to encode AMR graphs, where subsequent layers can capture longer dependencies within the graph, like this:
The main part of the paper deals with a novel variation of the graph-to-sequence model that has GCN blocks in the encoder part, recurrent layers in the decoder part, and an attention mechanism controlled by the encoder’s result. Like this:
The authors improve state-of-the-art results on text generation from AMR across a number of different datasets. To be honest, there is not much I can say about this work because I am really, really far from being an expert on AMR graphs. But that’s okay, you can’t expect to understand everything on a large conference like ACL.
This concludes our today’s installment on ACL. That’s almost all, folks; next time I will talk about the interesting stuff that I heard during my last day at the conference at the ACL workshops.
Chief Research Officer, Neuromation