Text/Document Classification using PyText

4 min readMar 9, 2019

Hands-on with Facebook’s newly open-sourced NLP library PyText based on PyTorch

After releasing PyTorch 1.0, Facebook Research recently open-sourced its Natural Language Modelling Framework based on PyTorch, PyText. It tries to bridge the gap between experimentation and rapid deployment/production, which was difficult with existing libraries.

PyText aims to achieve the following things:

Make experimentation easy and fast
Reduce extra work when using pre-built models on new data
Define a clear path for researchers and engineers to build, test and deploy their models quickly
Ensure high performance

PyText has a lot of support for rapid prototyping and is faster than other Natural Processing(NLP) Libraries available. Here’s a comparison of PyText with other NLP libraries:

Facebook now uses PyText in their Portal which is a video calling service, and in their M suggestions feature in Messenger. The M suggestions feature generates more than a billion daily predictions, which shows its capability to operate at production level and still has low latency.

PyText is built on PyTorch, and it connects to ONNX and Caffe2. With PyText, AI researchers and engineers can convert PyTorch models to ONNX and then export them as Caffe2 for production deployment at scale.

PyText relies on the components displayed in the figure below:

https://code.fb.com/wp-content/uploads/2018/12/06_PyText_Flowchart_hero.png

Task: combines various components required for a training or inference task into a pipeline. It can be configured as a JSON file that defines the parameters of all the child components. We’ll be discussing a sample config for a document classification task later in the post.

Data Handler: processes raw input data and prepare batches of tensors to feed to the model.

Model: defines the neural network architecture.

Optimizer: encapsulates model parameter optimization using the loss from forward pass of the model.

Metric Reporter: implements the relevant metric computation and reporting for the models.

Trainer: uses the data handler, model, loss, and optimizer to train a model and perform model selection by validating against a holdout set.

Predictor: uses the data handler and model for inference given a test dataset.

Exporter: exports a trained PyTorch model to a Caffe2 graph using ONNX.

Let’s start with building a sentiment classifier using PyText, it’s simple!

To install PyText on your machine, enter the following on to your command line via pip:

pip install pytext-nlp

Before getting started let’s get the data right. Here, we will be using Amazon reviews which has positive and negative reviews for the various product and has 10000 total examples.
The PyText needs a .tsv(tab separated values, in the following fashion):

___label___ 'This is a Text'

Here’s how our dataset(.tsv format) looks:

To define a model in PyText, it uses a configuration file (Task) which is in a .json format, where you can define your model.

Here’s a configuration file, where we are providing the training, validation and testing data, as well as other details like the number of epochs, batch size, and optimizer.

{
  "task": {
    "DocClassificationTask": {
      "data_handler": {
        "train_path": "data/train.tsv",
        "eval_path": "data/eval.tsv",
        "test_path": "data/test.tsv",
        "train_batch_size": 128,
        "eval_batch_size": 128,
        "test_batch_size": 128},
      "trainer": {
        "epochs": 20
      },
      "optimizer":{
        "lr": 0.001,
        "type": "adam",
        "weight_decay": 0.000004
      }
    }
  }
}

Now to train the model, just type on the command line as:

pytext train < config.json

And boom, it should be training. By default, it uses a Bidirectional LSTM model and with 15 epochs, the model achieves around 83% which is good, considering data as we didn’t preprocess the text data.

PyText exports the model as Caffe2 object, to save the trained model:

pytext export --output-path model.c2 < config.json

For predicting, we use a PyText predictor object which requires a saved model and the configuration file(.json). Here’s a small python script to predict the sentiment of a given Sentence/Text:

The above four lines is what it takes to predict a review based on the previous model trained.

The predictor object predicts the sentiment, and the returns the probability of all the labels, the labels with the higher probability can be chosen as the answer.

Conclusion

This was a basic guide for learning about PyText and getting started with a basic classifier. I will be writing much more about PyText in the coming weeks, so make sure you follow me know more about PyText and its applications

GitHub Repo: https://github.com/jayrodge/PyText-Classifier

If you found this helpful, please share it on Linkedin, Twitter, Facebook or any of your favorite forums.

Connect with me on Linkedin, about.me

Text/Document Classification using PyText

Conclusion

Written by Jay Rodge