Introducing DeepPavlov Library 1.0.0 — an Open-Source Natural Language Processing Library

Vasily Konovalov
DeepPavlov
Published in
8 min readNov 30, 2022

DeepPavlov Library is a conversational open-source library for Natural Language Processing (NLP) and Multiskill AI Assistant development. This article describes our first major release 1.0.0. This release is based on PyTorch and leverages Transformer and Datasets packages from HuggingFace to train various transformer-based models on hundreds of datasets. This article describes how to use our new Transformer-based models, including text classification, sequence classification, and question answering.

Install DeepPavlov Library

DeepPavlov Library is an open-source NLP framework. It contains all essential state-of-the-art NLP models that can be used alone or as a part of DeepPavlov Dream — an open-source Multi-Skill AI Assistant Platform. The library contains various text classification models for topic classification, insult detection, and intent recognition. DeepPavlov’s sequence classification models allow you to recognize named entities and classify part-of-speech tags. Our question-answering models can provide you with an answer based on a given textual context, integrated knowledge base, or Wikipedia.

But first, you should install the DeepPavlov Library by running it.

pip install deeppavlov==1.0.1

The DeepPavlov Library supports Python 3.6 — 3.9

How to use DeepPavlov Library

The DeepPavlov models are organized in separate configuration files under the configuration folder. A config file consists of five main sections: dataset_reader, dataset_iterator, chainer, train, and metadata. The dataset_reader defines the dataset’s location and format. After loading, the data is split between the train, validation, and test sets according to the dataset_iterator settings.

The chainer section of the configuration files consists of three subsections:

  • the in and out sections define input and output to the chainer,
  • the pipe section defines a pipeline of the required components to interact with the models,
  • the metadata section describes the model requirements along with the model variables.

The transformer-based models consist of at least two components: the Preprocessor that encodes the input, and the Classifier itself.

The parameters of Preprocessor are shown below:

{
"class_name": "torch_transformers_preprocessor",
"vocab_file": "{TRANSFORMER}",
"do_lower_case": true,
"max_seq_length": 64,
"in": [ "x" ],
"out": [ "bert_features" ]
}

Here vocab_file contains the variable that is defined in the metadata section of the configuration file. The variable TRANSFORMER defines the name of the transformer-based model from the Hugging face models repository. For example, bert-base-uncased points out to the original BERT model that was introduced in the paper. Besides the original BERT model, you can use the distilBert if you have limited computational resources. Moreover, you can use any of Bart, Albert models.

The torch_transformers_classifier parameters are shown below:

{
"class_name": "torch_transformers_classifier",
"n_classes": "#classes_vocab.len",
"return_probas": true,
"pretrained_bert": "{TRANSFORMER}",
"save_path": "{MODEL_PATH}/model",
"load_path": "{MODEL_PATH}/model",
"optimizer": "AdamW",
"optimizer_parameters": { "lr": 1e-05 },
"learning_rate_drop_patience": 5,
"learning_rate_drop_div": 2.0,
"in": [ "bert_features" ],
"in_y": [ "y_ids" ],
"out": [ "y_pred_probas" ]
}

Here:

  • bert_features is the input to the component that represents encoded by the Preprocessor the input strings,
  • the pretrained_bert parameter is a transformer-based architecture, the same that was defined in the Preprocessor,
  • the save_path and load_path parameters define where to save the model and where to load them from in case of training and inference correspondingly,
  • the learning_rate_drop_patience parameter defines how many validations turns with no improvements to wait until the training is done,
  • the learning_rate_drop_div parameter defines the divider of the learning rate when the learning_rate_drop_patience is reached.

You can interact with the models defined in the configuration files via the command-line interface (CLI).

python -m deeppavlov interact topics_distilbert_base_uncased [-d] [-i]

Here -d downloads the required data, such as pretrained model files and embeddings, flag -i installs all model’s requirements.

You can train a model by running it with the train parameter. The model will be trained on the dataset defined in the dataset_reader section of the configuration file:

python -m deeppavlov train topics_distilbert_base_uncased [-d] [-i]

The more detailed description of these and other commands can be found in our docs.

DeepPavlov Library for Text Classification

Let’s demonstrate the DeepPavlov BERT-based text classification models by using the topic classification model. The model is trained on the DeepPavlov Topic dataset that covers 33 topics including but not limited to Animals&Pets, Art&Hobbies, Artificial Intelligence, Beauty, Books&Literature, and many others. More information about DeepPavlov Topic can be found on the dataset page.

To interact with the model, first you need to run the build_model command. The download=True parameter indicates that we want to build an already pretrained model:

from deeppavlov import build_model, configs, evaluate_model
model = build_model('topics_distilbert_base_uncased', download=True, install=True)
model(["What do you think about the Arrival movie?", "Do you like listening to Sting!"])
# [['Movies&Tv'], ['Music']]

You can evaluate the model by running evaluate_model. The performance for topic classification model is measured in three metrics ROC-AUC, Accuracy, and F1-macro:

from deeppavlov import evaluate_model
scores = evaluate_model('topics_distilbert_base_uncased')

Let’s check how the text classification model performance depends on the transformer architecture.

You can always use different version of transformer by specifying TRANSFORMER variables in the metadata section, for example, albert-base-v2, distilbert-base-uncased, bert-base-uncased. Then you can retrain the model and check the results

from deeppavlov import train_model
model = train_model('topics_distilbert_base_uncased')
evaluate_model(model)

If you want to learn more about our DeepPavlov Topic dataset, please check the paper.

DeepPavlov Library for Named Entity Recognition

DeepPavlov Library contains a bunch of sequence classification models that can be used for sequence classification tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging.

For example, we want to extract persons’ and organizations’ names from the text. Then for the input text:

NER output from the DeepPavlov demo page

The annotation for this task is usually done through the BIO encoding scheme, in which the B-* tag is assigned to the first token of the entity, the I-* tag marks the following tokens of the entity and the O tag is used for non-entity tokens. Note that the set of entity types may vary depending on the dataset or a specific task.

Configuration file ner_ontonotes_bert defines a model that is fine-tuned on NER Ontonotes datasets that supports 18 entities including the standard ones: PERSON, ORGANIZATION, LOCATION. Ner_ontonotes_bert configuration is based on bert-base-cased — the original English BERT-cased released by Google. Casing is an important feature for detecting named entities, because usually named entities start with an uppercase letter.

Interacting with the model via Python

from deeppavlov import build_model, configs
ner = build_model('ner_ontonotes_bert', download=True, install=True)
ner(["Elon Musk founded Tesla in 2003", "Pichai was selected to become the next CEO of Google on August 10, 2015"])
# [[['Elon', 'Musk', 'founded', 'Tesla', 'in', '2003'],['Pichai', 'was', 'selected', 'to', 'become','the', 'next', 'CEO', 'of', 'Google', 'on', 'August', '10', ',', '2015']],[['B-PERSON', 'I-PERSON', 'O', 'B-ORG', 'O', 'B-DATE'],['B-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'I-DATE']]]

You can interact with the model via CLI too

python -m deeppavlov interact ner_ontonotes_bert [-d] [-i]

In addition to the English model we’ve fine-tuned the multilingual one on the Ontonotes dataset that supports 18 entities and 103 languages. The multilingual model is based on multilingual BERT that is able to transfer knowledge between languages, for example you can fine-tune a model on one language and evaluate the model on another.

from deeppavlov import build_model, configs
ner_mult = build_model('ner_ontonotes_bert_mult', download=True, install=True)
ner_mult(["Curling World Championship will be held in Antananarivo", "Чемпионат мира по кёрлингу пройдёт в Антананариву"])
# [[['Curling', 'World', 'Championship', 'will', 'be', 'held', 'in','Antananarivo'], ['Чемпионат', 'мира', 'по', 'кёрлингу', 'пройдёт', 'в', 'Антананариву']],[['B-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'O', 'B-GPE'], ['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'B-GPE']]]

As you see, multilingual BERT fine-tuned on the original English Ontonotes dataset is capable of detecting named entities in other languages. More about multilingual named entity recognition models you can find in our article here.

As I mentioned DeepPavlov Library contains all essential components for building AI dialogue assistants. But dialogue assistants use a variety of communication channels including communication via voice and via chat. And in many cases input via voice recognition module comes without proper casing or completely in lowercase mode. However, truecasing (that is capitalization where needed) is a crucial feature for detecting named entities. In order to cope with this there are two options either restore truecasing or adapt NER model to cope with improper casing. Our ner_case_agnostic_mdistilbert configuration file defines a model that is able to detect named entities in text with improper casing. For example,

from deeppavlov import build_model, configs
ner_caseagnostic = build_model('ner_case_agnostic_mdistilbert', download=True, install=True)
ner_caseagnostic(["elon musk founded tesla in 2003"])
# [[['elon', 'musk', 'founded', 'tesla', 'in', '2003']],[['B-PER', 'I-PER', 'O', 'B-ORG', 'O', 'O']]]

In addition, ner_case_agnostic_mdistilbert is bilingual and supports Russian and English languages. The model is successfully used in our DREAM AI Assistant. If you want to learn more about our case-agnostic NER model check our paper Multilingual Case-Insensitive Named Entity Recognition.

DeepPavlov Library for Question Answering

One can use DeepPavlov for extractive Question Answering (QA). Question Answering can be achieved by using the Reading Comprehension approach that seeks for an answer in the given text. The Natural Language Processing (NLP) community has been working on this task for quite a while. Question Answering on SQuAD dataset is a task to find an answer on a question in a given context (e.g., a paragraph from Wikipedia), where the answer to each question is a segment of the context:

A demo from demo.deeppavlov.ai

There are several datasets designed using the SQuAD format, including but not limited to the English SQuAD dataset (Stanford), Russian SberQuAD, and Chinese DRCD.

from deeppavlov import build_model, configs
model = build_model('squad_bert', download=True, install=True)
model(["In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called showers."],["Where do water droplets collide with ice crystals to form precipitation?"])
# [['within a cloud'], [305], [1.0]]

Model returns an answer, position in characters and confidence. The BERT-based approach to question answering can be a huge time-saving method for building QA experiences for your customers, and with the DeepPavlov getting the system up and running is the matter of several lines of code. If you want to learn more about DeepPalov QA models check our article.

Conclusion

We hope this was helpful and that you’ll be eager to use DeepPavlov for your own Natural Language Understanding use cases. You can read more about us on our official blog. Also, feel free to test our BERT-based models by using our demo. Please star ⭐️ us on the Github page. And don’t forget that DeepPavlov has a dedicated forum, where any questions concerning the framework and the models are welcome. We really appreciate your feedback, please let us know what you ❤️ and what you 💔 about DeepPavlov.

Acknowledgement

This release became possible due to the efforts of our fabulous team: Fedor Ignatov, Dmitrij Euseew, Anastasia Chizhikova, Dilyara Zharikova, Daniel Kornev.

Announcing DeepPavlov Contributor Program

DeepPavlov Library will be turning 5 years coming February since its very first release, and today by shipping the long-awaited v1.0.0 we are happy to see it being easily used across the globe, both on-premises and in the cloud. Hence, we are thrilled to announce our DeepPavlov Contributor Program. It is a fantastic opportunity to join us in our incredible adventure towards the big dream of building AI assistants that can understand us, teach us, learn from us, and help us to become better.

We already have some inspiring stories of contributors to our DeepPavlov Library, and we welcome you to learn more about the program here.

--

--

Vasily Konovalov
DeepPavlov

Machine Learning and Natural Language Processing expert.