ACL 2019: Highlights and Trends

Maria Khvalchik
Semantic Tech Hotspot
8 min readAug 12, 2019

I was fortunate to attend the ACL 2019 Conference in Florence last week. It was held at ‘Fortezza da Basso’, a 14th-century fortress designed for the Medici family.

#wow #fortress #sofancy

But seriously, what other places in Florence could fit over 3,000 attendees? Only a fortress would save us, the registered attendees, from the unregistered ones. Just kidding. Though we did have security at the entrance, you know, just in case.

In this post, I’d like to outline current Natural Language Processing trends.

Top 10 areas for ACL submissions.

Some ACL Statistics

Okay, so this year there were 2,906 submissions (a 75% increase over ACL 2018 😳) and 660 accepted papers (447 long and 213 short).

It was a race with six parallel tracks and three renewable poster sessions a day!

Food for Thought à la carte

Here’s an insightful post by Mihail Eric where he describes the ongoing hype:

Pretrain then Finetune: A New Paradigm for NLP

Nowadays, it’s fair to say there is a new sheriff in town. With the advent of powerful pretrained representations, trained using some flavor of a language modelling objective such as ELMO, OpenAI GPT, and BERT, the de facto technique for NLP has become to take some sort of off-the-shelf model pretrained on gargantuan amounts of data and fine-tune to your task with some smaller in-domain corpus. Indeed, this strategy has successfully achieved tremendous SOTA results on existing NLP benchmarks.

For real, BERT was all-over the ACL! “BERTology” now is the term to refer to papers that study its properties, use it as you please 😉

Going Beyond the Pretrain-Finetune Paradigm

In general my feeling is that the bulk of models today are still solving datasets rather than the tasks. We are building models that have become surprisingly effective at picking up and exploiting dataset-specific biases. In the process, our evaluation metrics paint fairly misleading pictures. This reminds me of Goodhart’s law: When a measure becomes a target, it ceases to be a good measure. So how do we move forward?

→ We should change the existing benchmarks and create new ones!

Following this train of thought bounded to a city of Scientific Skepticism, researchers of the University of Maryland created 1,213 questions in collaboration with computers to identify flaws in machine-learning language models and described it in paper Trick Me If You Can: Human-in-the-loop Generation of Adversarial Examples for Question Answering.

Jordan Boyd-Graber, a CS associate professor and senior author of the paper, said:

“Most question-answering computer systems don’t explain why they answer the way they do, but our work helps us see what computers actually understand.”

Best papers

Are well described in this post by Synced. As Machine Learning and Machine Translation were the main giants at ACL, the best papers followed suit. However, it was a pleasant surprise to see loads of Question Answering and Semantics works, and they are the ones I attended the most.

Open-domain Question Answering papers

  1. After attending SIGIR 2019, it was exciting to see work by Google Research arguing on having an IR system to retrieve evidence candidates for open-domain QA since IR is mostly working as a black box. They claim that QA is fundamentally different from IR and show that it is possible to jointly learn the retriever and reader from question-answer string pairs without any IR system. So that retrieving was done on any text in an open corpus.
A subset of all possible answers given a question q. Unlike previous work using IR systems, the system learns to retrieve from all of Wikipedia directly.

2. Authors from CMU Haitian et al. propose an open-domain QA over the combination of a KB and entity linked text, which is appropriate when an incomplete KB is available with a large text corpus. They propose a novel model, GRAFT-Net, based on graph representation learning. The idea is to extract answers from a question-specific subgraph containing text + KB entities and relations. Their achievement: a competitive performance to SOTA methods in both text-only and KB-only settings, and outperforming baseline models when using text combined with an incomplete KB.

Left: GRAFT-Net considers a heterogeneous graph constructed from text and KB facts, and thus leverages the rich relational structure between the two sources. Right: Embeddings are propagated in the graph, and the final node representations are used to classify answers.

3. More research on incomplete KB can be found in Xiong et al. paper, where a new end-to-end question answering model is proposed. The model learns to aggregate answer evidence from an incomplete KB and a set of retrieved text snippets. KBQA WebQSP is used as a benchmark.

An example from WebQSP. Here the answer cannot be directly found in the KB. But the knowledge provided by the KB, i.e., Cam Newton is a football player, indicates he signed with the team he plays for.

Under the assumptions that the structured KB is easier to query and the acquired knowledge can help the understanding of an unstructured text, their model first accumulates knowledge of entities from a question-related KB subgraph; then reformulates the question in the latent space and reads the texts with the accumulated entity knowledge at hand. The evidence from KB and texts are finally aggregated to predict answers.

The SubGraph Reader utilizes graph attention networks to collect information for each entity in the question-related subgraph. The learned knowledge of each entity is then passed to the Text Reader to reformulate the question representation and encode the passage in a knowledge-aware manner. Finally, the information from the text and the KB subgraph is aggregated for answer entity prediction.

Code: https://github.com/xwhan/Knowledge-Aware-Reader

4. Yair Feldman and Ran El-Yaniv researched the task of multi-hop open-domain Question Answering. Their method for retrieving multiple supporting paragraphs, nestled amidst a large knowledge base, contains the necessary evidence to answer a given question.

Green: 1st reasoning hop. Purple: 2nd reasoning hop. Blue bold italics: the entity connecting the contexts.

It iteratively retrieves supporting paragraphs by forming a joint vector representation of both a question and a paragraph. The retrieval is performed by considering contextualized sentence-level representations of the paragraphs in the knowledge source. It achieves state-of-the-art performance over two well-known datasets, SQuAD and HotpotQA.

In the HotpotQA example shown in the image above, we infer in the first reasoning hop that the manager is Alex Ferguson. Without this knowledge, Context 2 cannot be retrieved with confidence, as the question could refer to any of the club’s managers. Therefore, an iterative retrieval is needed.

5. Liu et al. investigate cross-lingual OpenQA. They construct a novel dataset XQA for cross-lingual OpenQA research, publicly available at https://github.com/thunlp/XQA. It consists of a training set in English as well as development and test sets in eight other languages. Authors provide several baseline systems for cross-lingual OpenQA, including two machine translation-based methods and one zero-shot cross-lingual method (multilingual BERT). Experimental results show that the multilingual BERT model achieves the best results in almost all target languages, while the performance of cross-lingual OpenQA is still much lower than that of English.

Knowledge Graphs in NLP Architectures

Yes, the pre-trained language models are robust. However, what they learn is relatively unconstrained. What about going beyond this by infusing information from grounded knowledge sources? Many papers were trying to tackle this question at ACL.

  1. Bosselut et al. in the joined work of Microsoft Research and Allen Institute present the COMET — a framework for automatic construction of commonsense knowledge bases.
COMET learns from an existing knowledge base (solid lines) to be able to generate new nodes and edges (dashed lines).

This framework learns from language models to add new nodes and edges to the knowledge graph. To train the model, they interpret the task as a situation when we have triplets of {s, r, o} format, where s is the phrase subject of the tuple, r is the relation of the tuple, and o is the phrase object of the tuple. To solve the problem, we need to generate o using s and r.

They tested the system on ATOMIC and ConceptNet semantic networks. The results of the experiment showed that generated output has a high quality — 77.5% relations of ATOMIC and 91.7% of ConceptNet were found as correct by human judges.

Salesforce Research explored commonsense reasoning as well. In this work, they collected human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations (CoS-E). Then they used CoS-E to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation (CAGE) framework.

CoS-E dataset: https://github.com/salesforce/cos-e

2. Logan et al. (with an outstanding paper naming Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling) address the problem of language model’s inability to correctly reason words that rarely appear in the training data. For instance, the training data may contain a particular sentence featuring some low-occurrence word. Then, the trained model will give a preference to a word appearing more often in the training data even though this won’t be factually correct.

Linked WikiText-2 Example. The graph is built by iteratively linking each detected entity to Wikidata, then adding any relations to previously mentioned entities.

To solve this, they introduce the knowledge graph language model, a generative architecture that selectively copies facts from a knowledge graph relevant to an underlying context, outperforming strong baseline language models. The authors also introduced a new dataset, Linked WikiText-2, with the training part consisting of more than 41K entities and 1.5K relations annotated with Wikidata.

3. Zhang et al. paper was one of several from the conference, which applied knowledge facts to better align with reality. In this case, they used knowledge facts to create a new framework for language representation models.

Solid lines: the existing knowledge facts. Red dotted lines: the facts extracted from the sentence in red. Green dot-dash lines: the facts extracted from the sentence in green.

The two main challenges solved by the authors were: encoding of structured knowledge (i.e., how to represent the knowledge facts for language models) and heterogeneous information fusion (i.e., how to fuse training processes for language and knowledge representations).

The trained language model was applied to several tasks, such as entity typing and relation classification. In all cases, the results either very close or outperforming other systems, which notably includes BERT.

4. Wang et al. main topic looks very similar to the recent paper from SIGIR — about reasoning in the recommendations sourced from knowledge graphs.

A KG-aware recommendation in the music domain. Dashed lines between entities: the corresponding relations. Solid lines: user-item interactions.

In comparison to the other work, this one uses deep learning methods as a part of the algorithm. In particular, they propose a knowledge-aware path recurrent network which is able to generate a representation for each path by composing semantics of entities and relations. It is interesting because the methods using neural networks are typically difficult to interpret given their complexity. But in this case, the RNN itself is aware of the path.

The performance of the system is extensively evaluated on MovieLens and IMDb datasets with the results showing the high efficiency of the model.

Want to read more on Knowledge Graphs presented at ACL? Take a look at this post by Michael Galkin.

Summary

  • Lots of papers studying the properties of BERT or building on top of it.
  • Open-domain QA goes multi-hop and multilingual.
  • Knowledge graphs are gaining popularity back in the NLP domain.

My personal opinion

The variety and quality of presentations were perfect, and the community is diverse and inviting, so overall my experience was quite enjoyable :)

ACL 2020 will take place in Seattle. The website is already up and running https://acl2020.org/

--

--

Maria Khvalchik
Semantic Tech Hotspot

Researcher. Reading Comprehension, NLP, ML, and WN (whatnot)