Going Beyond SQuAD (Part 2)

SQuAD is all but solved but QA is not

Published in

deepset-ai

8 min readJun 29, 2020

Helping your models see text a little clearer (Image by Dariusz Sankowski from pixabay)

As mentioned in part 1, SQuAD’s success has garnered it a lot of attention and has become the de facto extractive QA dataset. That said, SQuAD is only one flavour of extractive QA and various papers have pointed out weaknesses in the dataset’s creation. As a result, there is now a new generation of datasets that are designed to avoid the artefacts found in SQuAD and they present new, harder challenges. Often they introduce new annotation schemes, scale up the extractive QA task, require different answer outputs or test a model’s ability to synthesise separate pieces of information. Let’s take a look at some of these datasets and see how they might form the backbone of more robust and complex QA systems.

A summary of a few popular extractive QA datasets

Annotators See Passage

One well documented weakness of SQuAD stems from the fact that question writers exposed to the context passage are primed to generate questions of a certain style. More specifically, questions created in this way have a high degree of lexical overlap with the document text and thus models trained on this data might rely too heavily on word matching. Various data generation procedures have been used to counteract this behaviour. For example, the creators of TyDi ask for a question that cannot be answered by a given wikipedia article. Natural Questions and DuReader use real search engine queries thus reflecting real world information needs but also ensuring that the phrasing of the question is not influenced by the wording in the answering passage. Other examples include TriviaQA where question and answer pairs are collected before context documents and also NewsQA where question writers could only see a news story’s headline and highlights but not its full text. While each datasets takes very different steps to avoid the shortcomings of SQuAD, all of them ask for a much higher level of lexical and syntactic robustness from the QA systems that are trained on them.

Examples from SQuAD where there is a lot of lexical overlap between question and document text

Longer Documents

In SQuAD, each sample presents a Wikipedia paragraph as a context passage that may or may not contain the answer to a given question. While this serves as a good test of a trained model’s natural language understanding capabilities, this procedure does not match well with the real life information demands of users. The real value of QA lies in the ability to extract an answer from large amounts of text which would be too time consuming for a human to read. The first step towards developing these kinds of models is feeding in longer examples. As can be seen in the table above, the passages in datasets like TriviaQA, NewsQA, HotPotQA and NaturalQuestions are significantly longer than those in SQuAD and thus constitute a tougher challenge.

Longer docs not only present a more challenging task to NLP models but also require an extra degree of engineering. Transformer based language models have a limit to the length of sequence that can be processed in one pass since computational complexity scales quadratically with the length of the input. The standard solution to this currently is to use a sliding window to divide a document into smaller passages (see this blog for details). But it is clear that there is a growing wave of research that is looking into more powerful forms of attention as the Sparse Transformer, the Reformer and Adaptive Attention papers all came out within the last year and a half. And so it’s a pretty safe bet that in the coming years, we’ll be able to fit longer sequences and more samples into a batch and this can only be good news for researchers and industry practitioners alike.

Open-Domain QA

A new frontier of QA is in developing systems that extract single answers from not just one but a collections of documents. While there is no dataset that explicitly offers extractive QA in an open-domain setting, every extractive QA dataset can be used in an open-domain fashion to different degrees of effectiveness. To do this, you simply need to gather all of a dataset’s documents into a store and look for the question’s answer in this corpus rather in a single document. This is in fact the set up of the NeurIPS Efficient Open-Domain Question Answering challenge which uses an open variant of the Natural Questions dataset and TriviaQA is also available in an unfiltered format to facilitate this.

While tempting to use SQuAD for such a purpose, it turns out that it is not the best choice in this setting for a few different reasons. Firstly, SQuAD annotators are tasked with writing questions for specific sections of text. This means that many of the questions in the dataset cannot be answered without reference to a particular paragraph, thus rendering them useless in an open-domain setting. Also, SQuAD contains only around 500 unique wikipedia articles from which its 150k question answer pairs are created. Without a big document corpus, there is little challenge in picking the right document. The high degree of lexical overlap between question and answer in SQuAD makes this even easier and this is reflected in the already very strong performance of baseline approaches like BM25 and TF-IDF, thus reducing open-domain QA on SQuAD to something little different from Closed Domain QA.

Examples from SQuAD which are unanswerable without reference to a specific paragraph context

I want to point out here that the distinction between document selection and answer span extraction is actually a very useful one and it is the core concept behind the commonly used Retriever-Reader pipeline. We at deepset have found this to be the best approach to scaling QA to the kinds of large document stores that are commonly found in industry and have distilled our code and learnings into the open-source framework Haystack. By combining light-weight Retriever models with transformer-based QA models we have been able to maintain both speed and performance and if you’re interested in trying out QA based search on your own collection of documents Haystack has you covered! We have a range of tutorials that will help you start building a QA system on your own data before eventually implementing cutting edge components like distilled Reader models and Dense Passage Retrieval.

The Retriever-Reader pipeline where the Retriever acts as a light weight document filter that passes on a subset of documents to the more thorough Reader component before an answer is finally extracted

Different Answer Output

The base functionality of an extractive QA system requires it to be able to identify the start and end of the text span which answers a given question. The standard implementation is a form of token classification where the model assesses the probability of each input token being either the start or end of such a span. With the release of SQuAD 2.0, the Stanford team included another 50K adversarial questions which cannot be answered by the supplied passage and which expect a NO_ANSWER prediction. Most systems encode this kind of answer by selecting the token at index 0 as both start and end of the answer span. For most models, this is fine since there is a special token at the beginning (e.g. the [CLS] token in BERT).

Natural Questions is a step forward from SQuAD in that it includes even more answer formats. For example, a single question may be answered by multiple disjoint Short Answer spans in the passage, a format that is particularly suited to cases where the answer is a list of entities which occur in different positions in the text. This multi-span style of QA cannot, however, be easily handled by the token classification method described above and so many implementations (e.g. this) train the model to select the shortest span which encapsulates all the answer entities. On top of this, Natural Questions requires systems to answer binary questions with a YES or a NO answer (not to be confused with NO_ANSWER). This can be achieved by adding a text classification head to an existing QA system so that extracted spans are also accompanied by a SPAN, YES, NO or NO_ANSWER prediction. If you’d like to try out a system that is capable of returning this multitude of answer types, please check out our FARM repo which contains an example implementation of a Natural Questions QA system!

Examples from Natural Questions where there are multiple answer entities that have intervening text

Reasoning and Synthesis

Question Answering in a real world setting is often a complex procedure requiring reasoning, world knowledge and synthesis of different information sources. While SQuAD does contain some questions that test these capabilities, Min et al. (2018) state that over 90% of the samples in SQuAD v1.1 can be answered using just a single sentence. The creators of HotPotQA tackle this very problem by ensuring that their annotators generate questions that can only be answered with reasoning over two separate paragraphs. NewsQA is also reportedly composed of a large proportion of questions that cannot be solved without reasoning. For a long time now, critics have questioned whether Neural Networks can be considered truly intelligent if they cannot exhibit logical deduction. The leaderboards of datasets such as these will be one measure of how advanced AI currently is.

Conclusion

There’s been a lot of renewed interest in Question Answering as models have been getting better and engineers are starting to recognise its potential in open-domain settings. SQuAD has been a great starting point for the field but to build true industrial grade search systems we require a combination of new models, new datasets and new infrastructure. The datasets mentioned above are all pushing the field forwards in different ways. If you are interested in staying abreast of the developments in extractive QA at scale, then be on the lookout for my next set of blog posts which will be focusing on Haystack, our open-source QA and search framework. We will start with an intro to the core abstractions in the framework before diving into the cutting edge research which Haystack implements in order to deliver the next generation semantic search.