Building a simple Open Domain Question Answering pipeline in French

Published in

Illuin

19 min readMar 27, 2020

A talking robot, metaphor of Natural Language Processing — Image By Siarhei Tolak | Shutterstock

Natural Language Processing (NLP) is one of the fastest growing fields of artificial intelligence. It is centered around the comprehension of human language, represented through textual data, by computers. Question Answering (QA) is a NLP subfield in which an algorithm is given a short text (a context), and is trained to answer natural questions pertaining to that text by highlighting the most relevant passage to answer the question within the context text. This approach can be generalized to Open-Domain QA (or large Close Domain QA), with the added difficulty that the context text is not given and rather has to be retrieved from a much larger set of documents by the algorithm.

While the NLP research field is extremely active, most efforts are concentrated around the English language. However, it would be false to state that no efforts are made in other languages. Recently, state-of-the-art NLP architectures were adapted to the French Language (CamemBERT and FlauBERT) and opened the way for research on a variety of tasks that were not previously possible. Specifically, Illuin Technology trained the first French Question Answering model that yields performances on par with the best English QA models.

This blog post will focus on building a simple end-to-end Open Domain Question Answering pipeline leveraging this brand new technology. The goal is not to dive into the complexities of the models that are presented, nor to build a highly optimized OpenQA model, but rather to give an overview of what exists in French NLP today and to build a simple yet useful, first of its kind, feature in French Question Answering.

The pipeline will consist in two main steps, a document retrieval step, in which a paragraph long text is retrieved from a much larger corpus, and a question answering step in which the answer to the question is highlighted from the retrieved document. In order to illustrate the process, we will use as corpus a governmental 220 pages report discussing environment in France, and will also generalize this pipeline to answer questions on the entirety of the French Wikipedia articles.

Introduction

Document Retrieval

Bag Of Word

TF-IDF

Use Case: Document Retrieval

Question Answering

Word2Vec

FastText

ELMo

The BERT tidal wave

FQUAD

Use case: OpenDomainQA

Document Retrieval

While more complex methods exist, the simplest, fastest, and most widespread information retrieval techniques are based on frequency based schemes. In 2015, over 83 % of text-based recommender systems were based on frequential techniques, typically Wikipedia’s search engine which leveraged tf-idf weighting. All these methods are built upon the famous Bag-Of-Words method.

Bag Of Words

One of the simplest NLP algorithms is the Bag-Of-Words model (BOW). It consists in counting the occurences of words in a text without taking into account their order. The word frequencies can then be used as features to compare two documents, or for training downstream machine learning models, such as classifiers. This model is language independent and can be applied to any type of tests without any sort of prior training.

Frequency counts can be made more relevant through the use of techniques such as tokenization or stemming.

Basic tokenization is the process that aims at splitting textual data into individual words. Once each word in a text corresponds to a unique token, we can map them to a unique token id and feed them into Machine Learning algorithms. More advanced methods don’t necessarily pair each token with a single word but may rather create tokens built of a subword or at the contrary, more than one word.

Stemming is the process of reducing a word to its root form. The French words “pêcher”, “pêchons” and “pêcheur” all have the same stemmed form, “pêch”. Lemmatization is a similar process that is a bit more advanced in the sense that it requires an understanding of the word and the context. The goal of lemmatization is to yield the dictionary form of a word. For example, the french verb “suis” should be lemmatized as “être”, its infinitive version. Lemmatization is a non-trivial task and no perfect algorithm exists to achieve it.

The NLTK snowball module offers stemmers based on suffix-stripping algorithms in different languages, including French, and the spaCy library, offers lemmatizers as well based on language specific lookup tables that are available for download.

In both processes, the goal is to reduce the size of the vocabulary and cluster words derived from the same roots in order to help with frequency methods, such as BOW models.

TF-IDF

In most texts, BOW models can be affected by very frequent words (“de”, “le” , “un” in French for example) that are not document specific and thus that do not convey relevant information to be used for retrieval or classification tasks later on. TF-IDF is a weighting formula that gives more importance to terms that are specific to a particular document. Word frequencies in a document are balanced out with the ratio of documents in which the words appear, thus helping identify the document’s specific terms and yielding a better frequential representation of a text.

Use Case: Document Retrieval

Now that the basic principles of Information retrieval are understood, let’s apply it to our use case. The first step is to extract textual data from the PDF report. We use Tesseract to convert the PDF report to an HTML version, then the BeautifulSoup library to split the report page by page and remove the titles, figures and captions from our corpus.

soup = BeautifulSoup(open('data/env_fra_synthese2019.html'), \
                     'html.parser')page_json = {}for i,page in enumerate(re.split("Page \d+", str(soup))):
    
    tmp = []
    tmp_soup = BeautifulSoup(page, 'html.parser')
    
    for s in tmp_soup.find_all('span', {'style': re.compile(r'LiberationSans; font-size:9px')}):
        t = s.get_text()
        t = t.replace('\n', ' ')
        if len(t) > 2:
            tmp.append(t)
    
    page_json[i] = " ".join(tmp)with open('data/page.json', 'w') as outfile:
    json.dump(page_json, outfile)

JSON file with the extracted PDF information

If we consider all pages of the original report PDF as individual documents, we can apply the TF-IDF weighting scheme to determine the most relevant lemmatized words in our document.

We use the augmented term frequency formula to prevent bias towards longer documents :

Term frequency formula. tf(t,d) corresponds to the number of occurrences of a given term t in a given document d. This tf scheme is meant to prevent bias towards longer documents by normalizing each term’s frequency with the frequency of the most frequent word.

Inverse document frequency is calculated as such:

Inverse document frequency formula. N corresponds to the total number of documents in corpus D, and the fraction’s denominator is the number of corpus documents d containing term t.

The product of these two metrics yield the tf-idf weights of each word in each document of a given corpus.

It is important to note that this indexing process can be lengthy but should only be done once and stored for posterior use. Observing the highest weighted term in the documents can give a pretty good idea of the discussed themes.

As an example, page 43 in the report discusses biodiversity and endangered species and the 10 highest weighted terms are ‘ours’ , ‘lynx’, ‘cantonner’, ‘pyrénéens’, ‘noyau’, ‘herbivore’, ‘prédateur’, ‘loup’, ‘comté’ and ‘intergouvernemental’. Page 94 discusses green economy and jobs, and the highest weighted terms are ‘recrutement’, ‘cdd’, ‘employeur’, ‘intention’, ‘animation’, ‘déposer’, ‘initiale’, ‘spécialement’, ‘attractif’.

Sidenote: Considering the entire report as a single document and using a huge corpus comprised of random Wikipedia articles yields tf-idf weights for the document that can be represented by this word cloud.

In order to retrieve the most relevant document from the corpus, we will retrieve the document d that maximizes the sum of the weighted terms from the query.

D is the set of documents in the corpus, Q the set of terms of the query.

Let’s test our simple system out :

question = "Quelle part de communes francaises ont subit une catastrophe naturelle récemment ?"retrieved context ='Plus de quatre cinquièmes des communes françaises ont été reconnues en état de  catastrophe  naturelle   au  moins  une  fois  entre  1982  et  2015.  Cela  représente  un  cumul  de  108 900  reconnaissances de communes en état de catastrophe naturelle, dont un peu moins des trois  quarts au titre des inondations et un cinquième au titre de la sécheresse.   Le coût des différents périls couverts par le régime d’indemnisation de ces catastrophes s’élève  à environ 33 Md€ entre 1990 et 2017, soit en moyenne 983 M€ par an. Les inondations (55 %) et la sécheresse (33 %) en représentent près des neuf dixièmes.   Les  risques  technologiques  recouvrent  en  particulier  les  risques  sur  la  santé,  la  sécurité  et  l’environnement, ainsi que les risques industriels, nucléaires, chimiques, induits par les activités  humaines.   Les  sources  de  risques  technologiques  sont  notamment  :  les  installations  industrielles,  les  installations  nucléaires,  le  transport  de  matières  dangereuses,  les  sites  miniers  (on  parle  de  « l’après-mine »), les grands barrages.   Fin 2018, 18 000 communes exposées aux risques technologiques sont recensées en France.  Parmi  ces  communes,  545  (soit  3 % des  communes  à  risque)  sont  concernées  par  au  moins  trois risques technologiques, 3 434 par deux types de risques technologiques.   La moitié de ces communes sont situées dans les départements de l’Isère, du Pas-de-Calais, de  la Loire, de l’Aube, du Gard, du Rhône et des Bouches-du Rhône.'

Other more advanced options exist, such as the Okapi BM25 ranking algorithm based on the same concepts.

This information retrieval step is here demonstrated on a PDF report corpus, but can easily be adapted to be used with the entirety of Wikipedia by leveraging their built-in article retrieval search engine, based partly on tf-idf weighting. This is as easy as using the Python wrapper for the MediaWiki API.

import wikipedia
wikipedia.set_lang("fr")question = "Où est né Napoléon?"
relevant_title = wikipedia.search(question, results=10)['Jean-Christophe Napoléon',
 'Napoléon III',
 'Louis-Napoléon Bonaparte (1856-1879)',
 'Napoléon Ier',
 'Napoléon-Jérôme Bonaparte',
 'Napoléon-Louis Bonaparte (1804-1831)',
 'Victor Napoléon',
 'Villa Cyrnos',
 'Napoléon II',
 'Louis Napoléon']

It is interesting to note that “Jean-Christophe Napoléon” is probably not the Napoléon that the user was thinking of when asking for a place of birth, and the API is very sensible to the different words present in the question. It seems the system is much more robust by first retrieving the name entities from the question (SpaCy) and querying the API without the rest of the words.

import spacy
nlp = spacy.load('fr_core_news_md')question = "Où est né Napoléon?"query = " ".join([str(x) for x in nlp(question).ents])
query = query if len(query) > 0 else questionrelevant_title = wikipedia.search(query,results=10)['Napoléon Ier',
 'Napoléon III',
 'Napoléon II',
 'Louis-Napoléon Bonaparte (1856-1879)',
 'Louis Napoléon',
 'Victor Napoléon',
 'Jean-Christophe Napoléon',
 'Saint-Napoléon',
 'Maison Bonaparte',
 'Le Sacre de Napoléon']

Question Answering

The real beauty of this pipeline lies in the Question Answering step. QA is an ongoing research effort that has been revolutionized with the rise of embeddings and more recently Transformer networks. Understanding what led to state-of-the-art QA models today is essential to gain insight into how they work. To skip the theory and head right down to the use case, click this hyperlink.

Embeddings like Word2Vec map words to a N-dimensional continuous latent space

Word Embeddings

Many of the modern NLP methods stem from Mikolov’s groundbreaking research in 2013. He popularized the representation of words using continuous vectors (Word2Vec) that Bengio introduced in 2001. At the time, words in NLP were usually represented by a unique index and had to be one-hot encoded to be fed into neural networks. While relying on such techniques to encode words in a given vocabulary had shown encouraging results in a variety of tasks, they led to huge and sparse representations of the input sequences, and were unable to capture similarities words may have: synonyms were as distant in their representations as any two words. Representing words as vectors in a meaningful latent space of fixed dimensionality allowed similar words to be mapped into neighboring regions of the latent space, eg. vec(“dog”) is close to vec(“puppy”). Most surprisingly, it also captured more complex relationships between objects. It was shown for example that vec(“France”)-vec(“Paris) + vec(“Rome”) = vec(“Italy”). The mapping proposed is known as word2vec.

Equivalent relationships between words in the latent space | https://arxiv.org/pdf/1301.3781.pdf

Representing words as dense vectors is usually not a goal in itself, but improves the performance of downstream tasks such as document classification, sentiment analysis, semantic understanding…

In order to construct a vectorized representation of a vocabulary, Mikolov introduced two novel strategies through the Continuous Bag-Of-Words (CBOW) model, and the Skip-Gram model.

In the CBOW model, a simple neural network model is trained on predicting a word knowing the rest of the words in a context window. In the Skip-Gram model, words from the context window are predicted given the central word. In both cases, the vector representations of words from the input vocabulary are initially random, and are progressively refined during training using gradient descent methods.

These techniques do not explicitly take into account the ordering of the context words. To remedy this in an efficient manner, it is possible to create a vocabulary with tokens constructed from more than one word, known as n-grams. The idea is that n-grams partially capture word ordering and allow to leverage more information from the sequence. Using the CBOW or Skip-Gram models but with n-grams is shown to yield equivalent results to methods in which the word positions are explicitly defined, while being much more computationally efficient. This method is used by the FastText algorithm, with n-grams often chosen with n=2 (bigrams). FastText, unlike word2vec, gains the ability to form good sentence representations by averaging word features and at the time, rivaled deeper neural methods while being more efficient.

Both methods can be used in any language, given they are trained with sufficient data (web crawls, Wikipedia dumps, etc.)

Contextualized representations

ELMo

Up to now, we have only talked about pre-trained word representations that do not depend on the context of the word or the n-gram in the sentence context. The word “bank” for example would have the same vector representation whether it was used in the sentences “I put money in the bank” and “we hiked along the river bank”. The Allen Institute for Artificial Intelligence published a paper, “Deep contextualized word representations” in 2018 meaning to change this. In their proposed representation, each word token is assigned a representation that depends on the entire surrounding sentence. Although the idea had already been explored in previous works, their proposed architecture, ELMo, that leveraged biLMs, bidirectional deep LSTM models trained with Language Model objectives (next word prediction) achieved state-of-the-art results in a variety of downstream NLP tasks. It was shown that depending on the depth of the internal LSTM states, different language features were captured; higher level states understood a word in its context, while lower level could model the syntax. To combine all these features, ELMo’s final vector representation of a word in its context is a linear combination of the representations from the top layers of the biLMs as well as the intermediate representations coming from the hidden internal layers. ELMo vectors are meant to be used as additional features that can be added to existing models to improve their performance. A French version of the ELMo model trained on Wikipedia dumps and Common Crawl data can be found at https://github.com/HIT-SCIR/ELMoForManyLangs.

Going Deeper: Transformers

In the ELMo architecture, LSTMs are the backbone required to learn contextualized representations of words. In December 2017, the paper “Attention is all you need” introduced the concepts of transformer networks. Transformer networks are encoder-decoder networks that rely on self-attention mechanisms to help the model focus on the most relevant words in the context. For every word in the input sequences, self attention determines the most relevant context words, which are then given more importance upon encoding the contextualized word vectors. This allows models to learn long term dependencies in the input sequences that recurrent neural networks like LSTMs or Gated Recurrent Networks were not able to do. The authors show that the attention mechanism is sufficient in itself and there is no need to add recurrent connections or convolutions to the models.

The word “it” is linked with some words more than others | http://jalammar.github.io/illustrated-transformer/

In transformer networks, a sequence of words is thus translated into a context-dependent sequence of vectors in the encoder. These vectors are later used by the decoder to generate an output sequence word by word. Words that are generated depend on the encoder output, as well as on the previously generated words through attention mechanisms. Transformers were immediate breakthroughs in sequence to sequence tasks such as Machine Translation.

The BERT tidal wave

The BERT (Bidirectional Encoder Representations from Transformers) model, released by Jacob Devlin’s team in 2018 is based on transformer encoders. BERT is a framework that is meant to be used in two steps; first pre-training the transformer encoder backbone on unlabeled textual data, then modifying the model by adding an additional output layer and fine-tuning it on task-specific data. The costly pre-training process is a one-time effort, and the model can then be easily trained on a variety of downstream tasks, thus generalizing the transfer learning process, frequent in computer vision, to NLP tasks.

In the backbone, input tokens are split using an algorithm called “WordPiece” and the wordpiece embeddings are learned during the training process. WordPiece reduces the vocabulary size by splitting infrequent words in subwords or even individual characters, while frequent words keep their integrity. These embeddings are then transcribed through feed-forward and attention layers into unique contextualized vector representations. The specificity that BERT introduces mostly lies in the pre-training process. BERT, as the title suggests, generates its representations by training a language model that generates its predictions by leveraging information from tokens before and after the token the model aims to predict. This is achieved by randomly masking WordPiece tokens and training the model to predict the identity of the masked tokens (Masked Language Model). The authors show that using bi-directional inputs is an improvement upon similar models that only relied on unidirectional language models (OpenAI GPT). An additional training objective of the BERT model is to predict whether a pair of sentences follow each other in the original text or not (Next Sentence Prediction).

Once pre-trained, the model can then be slightly modified by adding an additional output layer and fine-tuned on a specific task. This simple step allows BERT to achieve state-of-the-art results in a variety of tasks such as question-answering (QA), paragraph continuation (SWAG), Natural Language Inference (NLI), sentiment analysis, semantic equivalence (MRPC, QQP).

Fine-tuning and Pre-training procedures for BERT model. For QA fine-tuning, the only difference is that a Start and End vector are introduced to predict the span of the answer in the original paragraph. | https://arxiv.org/pdf/1810.04805.pdf

Upgrading BERT

The RoBERTa model improves upon BERT by tweaking the training hyper-parameters and using larger mini-batches, relying on dynamic masks that are generated each time a sequence is fed to the model, removing the Next Sentence Prediction task they show to be detrimental and training on full sentences. It relies on a Byte-Pair Encoding BPE instead of the original WordPiece tokenizer choice.

The DistilBERT model mainly aims to reduce the model size to improve inference speed and allow the model to be used on devices with limited computing capacities. It is trained using knowledge distillation, a technique in which a smaller “student” network aims to reproduce the behavior of a bigger “parent” network. The authors show they are able to reproduce 99% of the performance capacities with a much smaller model.

The authors from the ALBERT model observed that bigger models often led to better results. In order to scale up the BERT model without an explosion of the number of parameters, they aim to improve the efficiency of the BERT architecture by leveraging weight sharing between layers and by using matrix factorization on the embedding parameters to reduce their number. Reducing the number of parameters as such (18x decrease) allows the ALBERT model to imitate BERT models that would be much deeper with a similar number of parameters. Finally, the Next Sentence Prediction task is replaced by a sentence ordering problem (SOP) in which the goal is to determine the original ordering of two consecutive sentences in the original document.

CamemBERT

CamemBERT is a French version of the RoBERTa model. It differs slightly with its use of whole word masking (as opposed to subword token masking in the original model), and a SentencePiece tokenizer, extension of the WordPiece concept. It is trained on web data from the OSCAR dataset a filtered version of the French Common Crawl corpus. The French OSCAR dataset that was used amounts to over 138 GB of textual data. CamemBERT is the first effort to adapt the BERT model family in French, and provides baselines for a variety of NLP tasks. The model and loading code are available at https://camembert-model.fr/, in TorchHub, as well as in the HuggingFace transformer library https://github.com/huggingface/transformers.

FlauBERT

FlauBERT is a another parallel, government subsidized, effort to adapt BERT to the French language. It is similar in all points to the RoBERTa model, apart from a pre-processing step to tokenize full words that is specific to French before using the BPE encoding process. It is trained on a variety of French sub-corpus ranging from web crawls to written books. It is interesting to note that the training data amounts to 71 GB of data, about half of what was used for the CamemBERT pretraining. The model and code are available at https://github.com/getalp/Flaubert and a HuggingFace implementation of the model exists as well.

FQuAD

Two barriers prevented the development of efficient Question Answering models in French until recently. The first was the lack of state-of-the-art transformer networks pre-trained specifically on a French Corpus. The second was the lack of an annotated, quality dataset in French comprised of sets of Context-Questions-Answers necessary to fine-tune a pretrained model. The first barrier fell in late 2019 with the release of CamemBERT and FlauBERT, while the second barrier has been shattered just a few months later in February 2020 with the open release of FQuAD, the French equivalent of SQuAD, a dataset with thousands of human annotated questions and answer pairs pertaining to Wikipedia contexts.

Training a model became possible,and by finetuning the CamemBERT model, the authors from the FQuAD paper set a new baseline for efficient Question Answering models in the French language. A demonstration platform is available at https://fquad-demo.illuin.tech/.

Use case: Question Answering on the retrieved context

Using our document retrieval system, we can feed the question and the retrieved context to the Question Answering model finetuned on FQuAD.

First, we install the dependencies. The FQuAD pretrained model architecture is built on top of the CamemBERT model in the Hugging Face transformers library, itself built on top of the RoBERTa Hugging Face implementation.

import torch
from transformers.modeling_camembert import CamembertForQuestionAnswering
from transformers import CamembertTokenizer

We can now define a QA model class by using the CamembertQA model architecture available on Hugging Face. Tokenizers are language specific so we download the CamemBERT tokenizer, and the model weights from Illuin Technology’s QA model on Hugging Face.

model_path = 'illuin/camembert-base-fquad'tokenizer = CamembertTokenizer.from_pretrained(model_path)
model   = CamembertForQuestionAnswering.from_pretrained(model_path)

The only thing left to do is to run the model with a question and context and retrieve the answer text from the tokenized input using model predictions on the most probable start and end logits.

def evaluate(question,text):
  input_ids = tokenizer.encode(question, text)
  start_scores, end_scores = model(torch.tensor([input_ids]))
  all_tokens = tokenizer.convert_ids_to_tokens(input_ids)  start = torch.argmax(start_scores)
  end = torch.argmax(end_scores)+1  return ''.join(all_tokens[start:end]).replace('▁',' ').strip()

The presented version of the code doesn’t scale to larger input sequences (512 tokens maximum between the text and the question) and only allows the retrieval of the most probable answer. To implement a robust version of the pipeline, it is recommended to use the evaluate() function (https://github.com/huggingface/transformers/blob/master/examples/run_squad.py) provided in the SQuAD examples of the Hugging Face repository.

Let’s try the full pipeline with a question that the report contains the answer to.

question = "Quels facteurs influent sur l'espérance de vie à la naissance ?"

First, a context is retrieved using a BOW technique with TF-IDF weighting. The model correctly identifies the page with the anwer to be page 172.

Ainsi, 12 % des adultes vivent dans un foyer en situation d’insécurité alimentaire pour raisons  financières. Ces personnes sont plus jeunes que les autres et en majorité des femmes. Malgré  un revenu supérieur en moyenne au seuil de pauvreté, elles semblent devoir plus souvent faire  face seules à des dépenses élevées, notamment pour le logement, mais aussi à des contraintes  importantes en termes d’accès aux soins et à l’alimentation .   L’état de santé des populations marqué par la situation économique   L’espérance de vie à la naissance représente la durée de vie moyenne d’une génération fictive  soumise  aux  conditions  de  mortalité  de  l’année.  Elle  intègre  les  conséquences  de  divers  facteurs : mortalité prématurée, qualité de l’offre de soins (accès aux soins, densité médicale),   comportements à risque, etc. La structure des emplois constitue également un déterminant.   La situation économique des ménages est un facteur susceptible de limiter, voire d’empêcher,  l’accès  aux  soins.  Le  taux  de  renoncement  aux  soins  pour  des  raisons  financières  permet  d’estimer la part de la population concernée par ces barrières. En 2014, 2,3 % de la population  (soit  environ  1,5 million  de  personnes)  déclare  avoir  renoncé  à  des  soins  pour  des  raisons  financières  au  cours  des  douze  derniers  mois  (Insee,  SRCV-Silc).  Depuis  2010,  cette  part  augmente peu à peu (+ 0,6 point sur la période observée).   Selon l’enquête SRCV (Insee), en 2016, 1,4 % des ménages pauvres ne disposent pas d’une  douche ou  d’une  baignoire  (contre  0,6 % dans  le  reste  de  la  population).  1,1 % des  ménages  pauvres  n’ont  pas  d’eau  chaude  (0,4 %  chez  les  autres).  8,8 %  des  ménages  pauvres  ne  possèdent pas de système de chauffage central ou électrique (3,7 % chez les autres). De plus,  la surface moyenne du logement est 62 % plus importante chez les propriétaires que chez les  locataires (Insee, enquête Logement 2013).   L’enquête  SRCV  2016  révèle  aussi  que  920 000  ménages  vivent  dans  des  logements  composés  d’une  seule  pièce.  38 %  de  ces  ménages  sont  en  situation  de  pauvreté.  Plus  de  60 000 enfants vivent avec leur famille dans une seule pièce.

The QA model can then run given the question and the context and retrieve the correct answer.

Answer: "mortalité prématurée, qualité de l’offre de soins (accès aux soins, densité médicale), comportements à risque"

This method’s bottleneck lies in the information retrieval step and queries that are syntaxically too different from the text contained in the corpus documents will not yield great results.

As previously stated, the same technique can be generalized to information retrieval on the Wikipedia platform. The same bottleneck exists, as the information retrieval step is highly dependant on the Wikipedia search engine, based on weighted frequency methods.

Open-domain Question Answering search engine built with the previously detailed pipeline

Conclusion

While the pipeline is relatively simple, the model manages to perform fairly well on a diverse range of open questions. Improvements in the retrieving steps could go a long way in creating a more robust model, but steps could also be taken in the question answering stage to answer giving more realistic confidence intervals. Running the pipeline over the n-best retrieved documents and reranking the answers based on the combined score of the retrieving step and the confidence score given by the question-answering model could also help improve the accuracy of the pipeline. Using a question answering model trained with questions that have no answer in the given context (SQuAD2) could also help determine that the retrieved document was incorrect and help improve robustness. Work is underway in this direction.

It is easy to imagine the industrial applications of such a pipeline and a more complex version could easily be used to precisely query large information databases, user manuals, websites…

Building a simple Open Domain Question Answering pipeline in French

Table Of Contents

Introduction

Document Retrieval

Question Answering

Document Retrieval

Question Answering

Contextualized representations

Conclusion

Written by Manuel Faysse