Realize portable and reusable text preprocessing

Published in

chakki

2 min readSep 28, 2019

Text preprocessing is necessary process even if you use Deep Learning. Before you start to use cool Deep Learning model, you have to tokenize text, make vocabulary, convert tokens to ids, and ids to vectors… It’s just like the hassle of airport security.

chariot saves you.
You can implement text preprocessing by stacking premade functions. And it becomes portable and reusable module.

Text preprocessing implemented by chariot

In short, chariot enables preprocessing like a buffet.

In this article, I show the feature and how to use chariot.

chakki-works/chariot

Deliver the ready-to-train data to your NLP model. Prepare Dataset You can prepare typical NLP datasets through the…

github.com

Text preprocessing by chariot

The feature of the chariot is following 3 points.

Declarative implementation
Portable save format
Offer the ready to train format data

As I show above, you can make preprocessing by stacking the functions declaratively.

Text preprocessing implemented by chariot

Each preprocessing function is implemented based on scikit-learn Transformer. For that reason, you can use any scikit-learn Transformer to make preprocessing and save it as pickle file (=portable).

preprocessor.save("my_preprocessor.pkl")
loaded = Preprocessor.load("my_preprocessor.pkl")

When you train the machine learning model, you have to adjust the length of text (padding) and make one-hot-vector, etc. In the chariot, these kinds of the process are defined as formatter and you can stack it to the preprocessor.

Format process after preprocessing

The format process is applied only in the training process.

You can confirm the examples of the chariot by Jupyter Notebook.

Text classification example.

chakki-works/chariot

Deliver the ready-to-train data to your NLP model. — chakki-works/chariot

github.com

Language modeling example.

chakki-works/chariot

Deliver the ready-to-train data to your NLP model. — chakki-works/chariot

github.com

Let’s realize portable and reusable text preprocessing by chariot!

Realize portable and reusable text preprocessing

chakki-works/chariot

Deliver the ready-to-train data to your NLP model. Prepare Dataset You can prepare typical NLP datasets through the…

Text preprocessing by chariot

chakki-works/chariot

Deliver the ready-to-train data to your NLP model. — chakki-works/chariot

chakki-works/chariot

Deliver the ready-to-train data to your NLP model. — chakki-works/chariot

Written by piqcy