Going Beyond SQuAD (Part 1)

Question Answering in Different Languages

Published in

deepset-ai

7 min readMar 5, 2020

The Rosetta Stone, a multilingual stone inscription that was essential to the decipherment of Egyptian Hieroglyphics (source)

If you’re interested at all in the task of Question Answering, you have probably heard about the Stanford Question Answering Dataset, better known as SQuAD. It has become the archetypal QA dataset and it tells the story of the latest boom in NLP Language Modelling technology.

Exact Match (EM) performance on SQuAD 2.0. Notice the big jump in Nov ’18 thanks to BERT (source)

As iterations and improvements have been made on BERT, the breakthrough model that electrified the field, SoTA on SQuAD has steadily risen such that the best performing models now surpass the human benchmark. Though SQuAD is now essentially solved, Question Answering is still a very challenging and active area of research. Whether you’re looking to extend your QA model to another language or a new QA style like yes/no answers, this two part series will introduce you to the latest datasets and explain how they can help you build your own cutting edge QA model.

What does SQuAD offer?

SQuAD belongs to a subdivision of QA known as extractive question answering, often also referred to as reading comprehension. Its data is formed from triples of question, passage and answer. When an extractive QA system is presented a question and a passage, it is tasked with returning the string span from the passage which answers the question (see diagram). Adversarial examples, where the provided paragraph does not answer the question, are added in SQuAD 2.0.

Samples from SQuAD are composed of a Question, Passage and Answer

We at deepset have been getting our hands dirty with SQuAD trained QA models and you can try one out here or even train your own using FARM, our transfer learning framework.

Spurred on by the success of the SQuAD task and the utility of its trained models, many of the leading NLP teams have released different open source datasets building upon the research of the Stanford team. In this article we will focus on the SQuAD equivalents that have been created for a range of languages other than English and talk in the next part about the ways in which you can extend your QA system to cover a broader range of questions and answer types.

There’s more than one way to phrase a question…

There is currently a very visible trend of non-English SQuAD replicas being released by teams around the world. Their contribution to NLP is twofold: not only does the research into the dataset replication process facilitate the creation of more non-English datasets, their analysis often gives us a deeper glimpse into the workings of QA systems. This kind of work also is a good reminder of the fact that English is neither synonymous nor representative of natural language. Part of the democratisation of NLP must involve the creation of resources and tools in languages other English and the field has much to gain by recognising the linguistic idiosyncrasies of English, such as its strict word order and little inflectional morphology, which are not always reflected in the structure of other languages.

A summary of SQuAD style QA datasets in different languages

Human Annotated Data

It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example, SberQuAD (Russian) and FQuAD (French) generate crowd sourced QA datasets that have proven to be good starting points for building non-English QA systems. KorQuAD (Korean) also replicates the original SQuAD crowd sourced procedure and provides some very interesting insight on how trained QA systems fare in comparison to humans on different types of questions. The authors of FQuAD find that with CamemBERT (a BERT model pre-trained on French), and a dataset that is a quarter the size of the original SQuAD dataset, they are still able to reach approximately 95% of the human F1 performance. The labour intensive nature of native crowd sourced data collection, however, is a limitation to generating a large scale datasets and this has motivated many teams to investigate ways to automatically translate SQuAD.

Comparison of Exact Match (EM) performance on the KorQuAD dataset by type of question-answer pair (Lim et. al. 2019)

Machine Translated Data

Machine Translated SQuAD datasets exists for Korean (K-QuAD), Italian , and Spanish. These are almost always more cost and time efficient especially considering the premium on crowd-sourcing non-English native speakers on platforms such as Mechanical Turk. We at deepset have also experimented with machine translation of SQuAD and have faced the same quality assurance issues that confronted the creators of the aforementioned datasets.

Chief amongst these is the issue of alignment. Though the translation of question and passage is straightforward, it is not always possible to automatically infer the answer span from the translated text since character indices have certainly shifted. Techniques to remedy this include inserting start and end markers that wrap the answer span in the hope that they are maintained after translation. But it is also worth noting that encoder-decoder attention components in modern machine translation models can function as a form of alignment. In cases where the dataset translation is done with full access to a trained model, attention weights can be interpreted as a form of free alignment (c.f. this method).

Finding the Right Mix

Considering that there is this trade-off between data quality and scale when choosing between human created and machine translated datasets, how can we ensure the best performance in our trained models? In the research literature, there are a few different teams who leverage both kinds of datasets in different ways.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance.

The creators of FQuAD for example have data in both styles and train three models: one using just the machine translated, one using just the human annotated and one using both kinds of data. Even though the machine translated data adds another 46,000 samples on top of the 25,000 human annotated, they find that a model trained on both performs slightly worse than one trained just on the human annotated data.

K-QuAD is also composed of a mix of machine and human created samples and the researchers behind it experiment with combinations of the data. Ultimately, they find that a mixture of both the human and de-noised machine translated data gives the best performance. And finally, the creators of the Arabic Question Answering dataset also experiment with a mixture of human and machine created samples and for them, the best performance comes from a full mixture of both.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance. If you only have around 5,000 such samples, augmenting this set with machine translated data may be worth while.

Multilingual Datasets

Parallel to the emergence of multilingual language models such as multilingual BERT and XLM-RoBERTa (which we evaluated here for German), we are also starting to see a set of multilingual QA datasets containing samples in more than one language. As recently as February this year, Google released the TyDi dataset which features 11 different languages, specifically chosen for the typological breadth that they cover. For example, Japanese is included since there is no spacing between words and Arabic, since its nouns exhibit a dual form alongside the singular and plural.

XQuAD is a subset of SQuAD translated into 10 different language by professional translators and MLQA leverages parallel sentences in wiki pages in different languages to create a dataset with 7 languages. The smaller scale of these datasets means that they are not suited to QA model training. Rather, they are designed to be evaluation sets for QA models which have been created in zero-shot fashion, that is, where QA capabilities have been transferred from another language. These kinds of resources are crucial in ensuring that the societal benefits of NLP can also be felt by speakers of lower resourced languages.

Conclusion

Interest in Question Answering is certainly growing. NLP practitioners from all over the world are actively exploring what’s possible using the latest Language Model architectures and are often creating non-English QA datasets. As systems get better and better at tackling the SQuAD style of QA, the field will need to think of new ways to challenge these models to becoming more flexible and robust. Already there is a new generation of open sourced datasets that require models to return answers of different lengths, yes/no responses and synthesise separate pieces of information.

In the next part of this series, we will look more closely at these capabilities, walking through how we might implement them and the datasets that will help us get there.