Developing QA Systems for any Language with DeepPavlov

Vasily Konovalov
Jul 8, 2020 · 7 min read

DeepPavlov is a conversational AI framework that contains all the components required for building chatbots. It’s free and easy to use. This article describes how to use DeepPavlov’s Question Answering models for English and other languages.

But first, let’s talk a bit about the use case. In this article, we expect that you are a software engineer who is responsible for building Conversational AI experiences for your company. Today, these experiences are mainly encapsulated in the form of a chatbot that might be available in the widget on a website, or through the user’s favorite instant messaging application. A typical chatbot is designed to address a variety of customer needs, including automation tasks (e.g., order pizza, reserve a table in a restaurant, et.c.), answering questions, providing support by switching to human operators, etc.

While automation tasks are typically covered by using goal-oriented dialogs with intents and slot filling, Question Answering (QA) can be achieved by using the Reading Comprehension approach that seeks for an answer in the given text.

Intro Into Reading Comprehension

The Natural Language Processing (NLP) community has been working on this task for quite a while. Question Answering on SQuAD dataset is a task to find an answer on a question in a given context (e.g., a paragraph from Wikipedia), where the answer to each question is a segment of the context:

A demo from

There are several datasets designed using the SQuAD format, including but not limited to to English SQuAD dataset (Stanford), Russian SberQuAD, and Chinese DRCD. The SQuAD paper introduced a logistic regression-based model that achieved an F1 score of 51.0% (which was a significant improvement over a simple baseline 20%). However, at the time, it was far behind the human performance of 86.8%. Nowadays, the top-performing model already outperforms humans and achieves F1 of 93.011%.

This huge jump in performance became possible due to the development of BERT. It is undoubtedly a breakthrough in Natural Language Processing. BERT is a transformer-based technique for pretraining contextual word representations that enables state-of-the-art results across a wide array of natural language processing tasks. The BERT paper was acknowledged as the best long paper 👏 of the year by the North American Chapter of the Association for Computational Linguistics. Google Research released several pre-trained BERT models, including the multilingual, Chinese, and English-language BERT. It can be used for a wide variety of language tasks, while only adding a small layer to the core model.

Using DeepPavlov’s QA Models

At DeepPavlov, we integrated BERT into solutions for the three popular NLP tasks: text classification, tagging, and question answering. For QA over your data, you need to provide the model with the batch of contexts and questions.

Thankfully, to try out our QA model in real life, all you need is to install the framework, use the code below to satisfy the model’s requirements, and provide your context and question:

pip install deeppavlov
python -m deeppavlov install squad_bert

Using QA Models with Other Languages

After learning about DeepPavlov’s QA model for the English language, let’s take a look at a slightly different use case. Consider a situation where the company you work for wants to enter a new market or expand its business to another company. As a software engineer responsible for making your company’s conversational AI available for the new customers, you have to find a way to quickly provide QA experience to them.

Fortunately, BERT-based QA models can be adapted to other languages. Let’s take a look at what should be done:

  1. Prepare language-specific BERT. It is well-known that pre-training BERT is a fairly expensive process (four days on 4 to 16 Cloud TPUs), not every established business can afford it. Fortunately, there are already pre-trained BERT models for English, Chinese, and Multilingual BERT.
  2. Collecting a language-specific QA dataset is a tough process that usually cannot be done without applying crowdsourcing platforms like Amazon Mechanical Turks.

Do We Really Need to Train a Language-Specific BERT?

Let’s figure out whether we really need to train language-specific BERT or already pre-trained Multilingual BERT (M-BERT) (that was trained on 104 languages on Wikipedia) can provide us with decent performance. To answer this question we test two BERT based models on three lexical non-overlapping languages (English, Russian, Chinese).

We train the BERT-based models with the following settings:

validation_patience = 10

We compare two models in two metrics the first one that measures exact match (EM) and the second one that measure lexical overlap between prediction and ground truth model (F1)

First we need to train two models on the English SQuAD dataset. In order to do so, install all requirements with the install command then run train command.

python -m deeppavlov train squad_bert

In many cases you don’t need to train the model from scratch; you can use an already pre-trained model and evaluate it on data by running the evaluate command.

python -m deeppavlov evaluate squad_bert -d
The models’ performance (F1/EM) on SQuAD test set (English)

In English, we see the almost comparable performance, which is not surprising because the training set for M-BERT is dominated by English.

Then we repeat the same experiment but for the Russian SberQuAD dataset. In addition, we trained a Multilingual BERT model on English training data and tested it on the Russian test set.

The models’ performance (F1/EM) on SberQuAD test set (Russian)
python -m deeppavlov evaluate squad_ru_bert -d
python -m deeppavlov evaluate squad_ru_rubert -d

The RuBERT-based model outperforms the M-BERT based model by a small margin while trained on the SberQuAD training set.

Lastly, we perform the same experiment but for the Chinese DRCD dataset.

The models’ performance (F1/EM) on DRCD test set (Chinese)
python -m deeppavlov evaluate squad_zh_bert_mult -d
python -m deeppavlov evaluate squad_zh_bert_zh -d

The DCRD dataset is based on Chinese Wikipedia, it contains Latin symbols that were not properly handled by the character-based tokenization used for pre-training Chinese BERT. This leads to the almost comparable performance of the M-BERT based model with a language-specific Chinese BERT-based model.

Based on the result we can make a conclusion that M-BERT based models perform comparably with language-specific BERT while trained on the same language-specific training set.

Does Using The Language-Specific Training Data Help?

If there is no need to train language-specific BERT do we still need a large language-specific training set or M-BERT trained on already available English SQuAD and some target language data already yields comparable performance?

We build learning curves to measure how the language-specific monolingual dataset contributes to the performance of the M-BERT based model. The learning curves for Russian are depicted in Figure 1. First, we define two boundaries, the upper bound is the M-BERT based model performance trained solely on the entire SberQuAD dataset (red dashed line). The lower bound is the M-BERT based model performance trained on the English SQuAD dataset (green dashed line).

Figure 1. Learning curves on SberQuAD (Russian)

The -SQuAD curve denotes the model trained on the part of the SberQuAD dataset and the +SQuAD curve denotes the model trained jointly on the same part of the SberQuAD dataset and on the entire English SQuAD dataset.

Adding an entire English SQuAD train to the training set significantly improves performance in comparison to the model trained only on the part of the SberQuAD set. Moreover, the model trained on the joint dataset with 5.000–10.000 language-specific monolingual training samples is only a few points behind the model trained on the entire language-specific training set.

Similar findings hold for Chinese as depicted in Figure 2.

Figure 2. Learning curves on DRCD (Chinese)


We’ve learned a few important things today. First, the BERT-based approach to question answering can be a huge time-saving method for building QA experiences for your customers, and with the DeepPavlov getting the system up and running is the matter of several lines of code. Second, the BERT-based approach can be used across a variety of different languages, and that M-BERT based models perform comparably with language-specific BERT while trained on the same training set. We also showed that M-BERT based models trained jointly on widely available English training data and some language-specific instances achieve comparable performance.

You can find out more about the technical details in our paper. Also, do not forget to check out other DeepPavlov’s models here. Feel free to test our BERT-based models by using our demo. And DeepPavlov has a dedicated forum, where any questions concerning the framework and the models are welcome.


An open-source library for Conversational AI

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store