FQuAD: French Question Answering Dataset

To stimulate French AI: the first native French QA Dataset

Wacim Belblidia

Follow

Published in

Illuin

4 min readMar 4, 2020

--

Today, we introduce FQuAD, the first native French Question Answering Dataset. We finetuned the CamemBERT Language Model on the QA task with our dataset, and obtained 88% F1. To see it in action, check out our trained QA model !

Introduction

Question Answering

As the name indicates, the task of Question Answering (QA) consists of finding an answer in a particular text given a specific question.

Recent advances in NLP have improved the state-of-the-art on this task, with the models being trained & evaluated on datasets such as SQuAD. However, these models being very much language dependent, and the datasets being almost solely available in English, French QA has not seen the same growth as its English counterpart.

Extension to French (and other languages)

English QA datasets are big (100,000+ questions). Current low-resource approaches to create datasets in other languages include translating the English ones to the desired language, translating text and question to English during the prediction, or using a multilingual language model. However, none of these options get quite the same results as native crowd-sourced language specific datasets.

Dataset Collection

Paragraph Collection

In this work, we chose the latter option, and created a crowd-sourced French QA Dataset of more than 25,000 questions. Following SQuAD’s approach, we randomly sampled 145 articles from Wikipedia’s French quality articles, further split into paragraphs.

Question & Answer Collection

With our own annotation platform, crowd-workers were asked to create 3–5 question-answer pairs per paragraph

ILLUIN Technology’s QA Annotation Platform

This is where the current dataset stands (it’s being extended):

Additional answers collection

Human annotators may not all select the exact same text span when choosing the answer.

Quand fut couronné Napoléon ?
Possible selected answers in the text: mai 1804, en mai 1804, 1804

In order to decrease this annotation uncertainty, for the test set (dev set underway), we provided the crowd-workers with already created questions, and asked them to find the corresponding answer. This way, all test set samples have 3 answers from different annotators.

Some examples from the dataset

Dataset Analysis

We collected a diverse collection of questions and answers, which makes for robust QA models.

The crowd-workers were encouraged to ask difficult questions, and it shows in the dataset. Difficult implies linguistic variations between the question and its corresponding answer. This can mean using synonyms, changing the syntax, requiring world knowledge or information over multiple sentences to answer the question.

Question-answer relationships in 108 randomly selected samples from the FQuAD development set. In bold the elements needed for the corresponding reasoning, in italics the selected answer.

Experiments

With the dataset now constituted, the goal was to finetune a Language Model on the QA task. Thanks to the recently released French Language Models, CamemBERT and FlauBERT, the process of creating such models becomes straightforward with the HuggingFace library.

Results

We fine-tuned the Language Models on FQuAD, as well as a translated version of SQuAD and a combination of the two.

Test scores of different models trained and evaluated on FQuAD (ours) and SQuAD translated

Our best model, CamemBERTQA, reaches a performance of 88.0% F1 and 77.9% EM on the FQuAD test dataset. Interestingly, we can see a strong bias between translated and native french data, reflected in the scores on the respective test sets depending on which training set was used.

Performance Analysis

If we look at how the model performs on the different question & answer types, we see that it works best on structured data likes numbers or proper nouns, but still manages to produce adequate results on more complex types.

Performance on question/answer types. F1h and EMh are the human scores

Learning Curve

In order to better understand how the number of samples affected the model’s performance, we carried multiple runs with subsets of different training sizes, and evaluated on the same test set. As more samples are added, performance improves.

The dataset is growing as I’m writing this article, and we’ll also soon publish an updated baseline with state-of-the-art multilingual models as a comparison. Stay tuned for updates !

Test it yourself!

You can download the dataset, and check out our trained QA model demo :)