Introducing DeepPavlov — an Open-Source NLP Library

Vasily Konovalov
DeepPavlov
Published in
6 min readAug 12, 2021

DeepPavlov is a conversational open-source library for Natural Language Processing (NLP). DeepPavlov is designed for development of production ready complex conversational system and prodiving reseach in the area of NLP. Starting from version 0.16 ✅, DeepPavlov is now based on PyTorch though some backwards compatibility with older TF-based models still exists. This article describes how to use our new Transformer-based PyTorch models including, text classification, sequence classification, and question answering. The models are based on the Transformer library from Hugging Face. The library enables developers to use a wide variety of transformer-based models, moreover, we support Datasets from Hugging Face with hundreds of datasets to train your model. The full list of newly created DeepPavlov configuration files and other features can be found in the DeepPavlov 0.16 release notes. The code from the article can be used in our Colab notebook.

Install DeepPavlov Library

DeepPavlov Library is an open-source framework for NLP. It contains all essential state-of-the-art models for developing chatbots, including but not limited to text classification, sequence classification, question answering. But first, you should install DeepPavlov by running.

pip install deeppavlov

We support Linux and Windows platforms, Python 3.6 and Python 3.7 🐍

Intro to DeepPavlov

The DeepPavlov models are organized in separate configuration files under the config folder. A config file consists of five main sections: dataset_reader, dataset_iterator, chainer, train, and metadata. The dataset_reader defines the dataset’s location and format. After loading, the data is split between the train, validation, and test sets according to the dataset_iterator settings.

The chainer section of the configuration files consists of three subsections:

  • the in and out sections define input and output to the chainer,
  • the pipe section defines a pipeline of the required components to interact with the models,
  • The metadata section describes the model requirements along with the model variables.

The transformer-based models consist of at least two major components:

  • Preprocessor that encodes the input
  • Transformer-based model itself.

The parameters of Preprocessor are shown below:

{"class_name": "torch_transformers_preprocessor","vocab_file": "{TRANSFORMER}","do_lower_case": true,"max_seq_length": 64,"in": [ "x" ],"out": [ "bert_features" ]}

Here vocab_file contains the variable that is defined in the metadata section of the configuration file. The variable TRANSFORMER defines the name of the transformer-based model from the Hugging face models repository. For example, bert-base-uncased points out the original BERT model that was introduced in the paper. Besides the original BERT model, you can use the distilBert model if you have limited computational resources. Moreover, you can use any of Bart, Albert models.

The second component is based on the Transformer models but differs depending on the task. Below is the torch_transformers_classifier that used for text classification, for example, insult detection:

{"class_name": "torch_transformers_classifier","n_classes": "#classes_vocab.len","return_probas": true,"pretrained_bert": "{TRANSFORMER}","save_path": "{MODEL_PATH}/model","load_path": "{MODEL_PATH}/model","optimizer": "AdamW","optimizer_parameters": { "lr": 1e-05 },"learning_rate_drop_patience": 5,"learning_rate_drop_div": 2.0,"in": [ "bert_features" ],"in_y": [ "y_ids" ],"out": [ "y_pred_probas" ]}

Here:

  • bert_features is the input to the component that represents encoded by the Preprocessor the input strings,
  • the pretrained_bert parameter is a transformer-based architecture, the same that was defined in the Preprocessor,
  • the save_path and load_path parameters define where to save the model and where to load them from in case of training and inference correspondingly,
  • the learning_rate_drop_patience parameter defines how many validations turns with no improvements to wait until the training is done,
  • the learning_rate_drop_div parameter defines the divider of the learning rate when the learning_rate_drop_patience is reached.

You can interact with the models defined in the configuration files via the command-line interface (CLI). However, before using any of the built-in models you should install all of its requirements by running it with the install command. The model’s dependencies are defined in the requirements section of the configuration file:

python -m deeppavlov install insults_kaggle_bert_torch

Here insults_kaggle_bert_torch is the name of the model’s config file.

To get predictions from a model interactively through CLI, run

python -m deeppavlov interact insults_kaggle_bert_torch [-d]

Here -d downloads the required data, such as pretrained model files and embeddings.

You can train a model by running it with the train parameter. The model will be trained on the dataset defined in the dataset_reader section of the configuration file:

python -m deeppavlov train insults_kaggle_bert_torch

The detailed description of these and other commands can be found in our docs.

DeepPavlov for Text Classification

Let’s demonstrate the DeepPavlov BERT-based text classification models using the insult detection problem. This problem involves predicting whether a comment posted during a public discussion is considered insulting to one of the participants. This is a binary classification problem with only two classes: Insult and Not Insult.

To interact with the model, first, you need to build_model. The download=True parameter indicates that we want to use an already pre-trained model:

You can evaluate the model by running evaluate_model. The performance for text classification model is measured in three metrics ROC-AUC, Accuracy, and F1-macro:

You can always use different version of transformer by specifying the TRANSFORMER variables in the metadata section, for example, albert-base-v2, distilbert-base-uncased`, `bert-base-uncased`. Then you can retrain the model and check the results.

You can try out our text classification models on the Demo page.

DeepPavlov for Named Entity Recognition

DeepPavlov Transformers-based models can be used for sequence classification tasks such as Named Entity Recognition (NER) and Part of Speech (POS) tagging. For example, we want to extract persons’ and organizations’ names from the text. Then for the input text:

Yan Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

B-PER I-PER O O B-ORG I-ORG

Where B- and I- prefixes stand for the beginning and inside of the entity, while O stands for out of tag or no tag. Markup with the prefix scheme is called BIO markup. This markup is introduced for distinguishing consequent entities with similar types.

You can interact with the modela via the CLI.

python -m deeppavlov install ner_ontonotes_bert_torch.jsonpython -m deeppavlov interact ner_ontonotes_bert_torch.json -d

and via the Python code

Usually, for every Transformer based model, you can find two versions: the English version and the multilingual version (trained in 103 languages). The multilingual transformer can transfer knowledge between languages, for example, you can fine-tune a model on one language and evaluate the model on another. More about language transfer you can find here.

python -m deeppavlov install ner_ontonotes_bert_mult_torch

Then interact with it via the Python code

You can try out our named entity recognition models on the Demo page.

DeepPavlov for Question Answering

One can use DeepPavlov for extractive Question Answering (QA). Question Answering can be achieved by using the Reading Comprehension approach that seeks an answer in the given text. The Natural Language Processing (NLP) community has been working on this task for quite a while. Question Answering on SQuAD dataset is a task to find an answer to a question in a given context (e.g., a paragraph from Wikipedia), where the answer to each question is a segment of the context.

CONTEXT:
In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.
QUESTION:
Where do water droplets collide with ice crystals to form precipitation?

ANSWER:
within a cloud

Intstall the model’s requirements

python -m deeppavlov install squad_torch_bert

Then interact with the model via the Python code

Model returns an answer, position in characters and confidence.

You can try out our question answering models on the Demo page.

Conclusion

We hope this was helpful and that you’ll be eager to use DeepPavlov for your own Natural Language Understanding use cases. You can read more about us on our official blog. Also, feel free to test our BERT-based models by using our demo. Please star ⭐️ us on the Github page. And don’t forget that DeepPavlov has a dedicated forum, where any questions concerning the framework and the models are welcome. We really apprecitae your feedback, please let us know what you ❤️ and what you 💔 about DeepPavlov.

--

--

Vasily Konovalov
DeepPavlov

Machine Learning and Natural Language Processing expert.