Open-domain question answering with DeepPavlov

Published in

DeepPavlov

6 min readFeb 4, 2019

The ability to answer factoid questions is a key feature of any dialogue system. Formally speaking, to give an answer based on the document collection covering wide range of topics is called open-domain question answering (ODQA). The ODQA task combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer span from those articles). An ODQA system can be used in many applications. Chatbots apply ODQA to answer user requests, while the business-oriented Natural Language Processing (NLP) solutions leverage ODQA to answer questions based on internal corporate documentation. The picture below shows a typical dialogue with an ODQA system.

Figure 1. A typical dialogue with an ODQA system

There are several approaches to the architecture of an ODQA system. A modular ODQA system consists of two components, the first one (the ranker) should be able to find the relevant articles in a database (e.g., Wikipedia), whereas the second one (the reader) extracts an answer from a single article or a small collection of articles retrieved by the ranker. In addition to the strictly two-component ODQA systems, there are hybrid systems that are based on several rankers where the last ranker in the pipeline is combined with an answer extraction module usually via reinforcement learning.

Next, I would like to show how you can use the DeepPavlov Wikipedia-pretrained ODQA system. In addition, I will describe how you can train the ODQA system on your data. The code used in this article can be accessed on Colaboratory via the link. Moreover you can deploy the ODQA models (and many others) on Amazon Web Services using EC2 virtual machine by following the steps from the documentation. Furthermore you can check out our demo.

Model description

The architecture of the DeepPavlov ODQA skill is modular and consists of two components: a ranker and a reader. In order to answer any question, the ranker first retrieves a few relevant articles from the article collection, and then the reader scans them carefully to identify the answer. The ranker is based on DrQA [1] proposed by Facebook Research. Specifically, the DrQA approach uses unigram-bigram hashing and TF-IDF matching designed to efficiently return a subset of relevant articles based on a question. The reader is based on R-NET [2] proposed by Microsoft Research Asia and its implementation by Wenxuan Zhou. The R-NET architecture is an end-to-end neural network model that aims to answer questions based on a given article. R-NET first matches the question and the article via gated attention-based recurrent networks to obtain a question-aware article representation. Then, the self-matching attention mechanism refines the representation by matching the article against itself, which effectively encodes information from the whole article. Finally, the pointer networks locate the positions of answers in the article. The scheme below shows DeepPavlov ODQA system architecture.

Figure 2. The DeepPavlov-based ODQA system architecture

DeepPavlov’s ODQA system has two pretrained Wikipedia-based models. The first one is based on the English Wikipedia dump from 2018–02–11 (5,180,368 articles) and the second one is based on the Russian Wikipedia dump from 2018–04–01 (1,463,888 articles). The models treat Wikipedia as a collection of articles and do not rely on its internal graph structure. As a result, our approach is generic and could be switched to other collections of documents, books, or even daily newspapers.

The ODQA models are described in the separate configuration files under the deeppavlov/configs/odqa folder. In DeepPavlov configuration files define data processing pipeline and consist of four main sections: dataset_reader, dataset_iterator, chainer, and train. More about the configuration files can be found at Simple intent recognition and question answering with DeepPavlov.

The input of the en_odqa_infer_wiki configuration is the user request (question_raw) and the output is the best_answer of the model. The model’s output depends on the ranker, defined as the first component of the ODQA pipe, and the reader, defined as the last component of the ODQA pipe.

Ranker configuration

The ranker configuration files are located in the deeppavlov/configs/doc_retrieval folder.

The dataset_reader section of the ranker’s configuration defines the source of the articles. The source can be of the following dataset_format:

wiki — the Wikipedia dump extracted by wikiextractor,

txt — the path to the separated text files,

json — JSON files, which should be formatted as a list with dicts that contain the title and doc keywords.

The first component of the ranker’s pipeline defines the document hashing vectorizer with the tokenization parameters:

lemmas = True — enable lemmatization,

ngram_range = [1,2] — tokenize the documents into unigrams and bigrams.

As an output, the ranker returns the list of the top_n document.

The detailed description of the ranker configuration parameters can be found in the DeepPavlov documentation.

Reader configuration

The reader configuration files are located in the deeppavlov/configs/squad folder. The reader component aims to find an answer to a question in the given articles retrieved by the ranker component. The reader implementation is based on the R-NET reading comprehension system developed by Microsoft Research. The detailed description of the reader configuration parameters can be found in the DeepPavlov documentation.

Model requirements

In order to interact with the model, first install the model’s requirements.

python -m deeppavlov install deeppavlov/configs/odqa/en_odqa_infer_wiki.json

The DeepPavlov ODQA system has two Wikipedia-based models. The English Wikipedia model requires 35 GB of local storage, whereas the Russian version takes up about 20 GB. The Wikipedia dumps can be rebuilt by steps described in the documentation. Both models require about 24 GB of RAM. It is possible to run them on a 16 GB machine, but the swap size should be at least 8 GB.

Interaction with the model

You can interact with the models via the command line:

python -m deeppavlov interact deeppavlov/configs/odqa/en_odqa_infer_wiki.json -d

where -d denotes downloading all files required by the models.

Alternately, you can run the model via the Python code:

Training the model

Both components of the ODQA system should be trained separately. However, if you want to run ODQA on your data, you need to fit only the ranker component. The reader component is pretrained on the Stanford Question Answering Dataset (SQuAD). SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable [3]. That said, the DeepPavlov documentation fully describes how to train the reader component on your data.

As a training corpus, I will use the PLoS Computational Biology corpus [4]. It consists of 300 computational biology articles, each of them stored in a separate txt file. For simplicity, we will use the same configuration files that is used for the Wikipedia-based ODQA system; however, we strongly encourage you to create custom configuration files for your own models.

First, download the dataset and extract it into a folder.

The docs list contains the top 30 relevant articles for the query cerebellum. Then, let’s build the ODQA models and run the query:

Conclusion

In this article, I’ve described the ODQA model of the DeepPavlov framework. This model is based on a two-component approach. While designing an ODQA architecture, you should always seek balance between the model’s performance and its resource requirements. By using our system, you can achieve good performance with reasonable resources. Nowadays, all the best-performing models are based on the BERT architecture (Bidirectional Encoder Representations from Transformers) [5]. However, the BERT-based systems require huge computational resources.