Realize portable and reusable text preprocessing

piqcy
chakki
Published in
2 min readSep 28, 2019

Text preprocessing is necessary process even if you use Deep Learning. Before you start to use cool Deep Learning model, you have to tokenize text, make vocabulary, convert tokens to ids, and ids to vectors… It’s just like the hassle of airport security.

photo by Fabio Mascarenhas

chariot saves you.
You can implement text preprocessing by stacking premade functions. And it becomes portable and reusable module.

Text preprocessing implemented by chariot

In short, chariot enables preprocessing like a buffet.

Photo by Ramnath Bhat

In this article, I show the feature and how to use chariot.

Text preprocessing by chariot

The feature of the chariot is following 3 points.

  1. Declarative implementation
  2. Portable save format
  3. Offer the ready to train format data

As I show above, you can make preprocessing by stacking the functions declaratively.

Text preprocessing implemented by chariot

Each preprocessing function is implemented based on scikit-learn Transformer. For that reason, you can use any scikit-learn Transformer to make preprocessing and save it as pickle file (=portable).

preprocessor.save("my_preprocessor.pkl")
loaded = Preprocessor.load("my_preprocessor.pkl")

When you train the machine learning model, you have to adjust the length of text (padding) and make one-hot-vector, etc. In the chariot, these kinds of the process are defined as formatter and you can stack it to the preprocessor.

Format process after preprocessing

The format process is applied only in the training process.

You can confirm the examples of the chariot by Jupyter Notebook.

Text classification example.

Language modeling example.

Let’s realize portable and reusable text preprocessing by chariot!

--

--

piqcy
chakki
Editor for

All change is not growth, as all movement is not forward. Ellen Glasgow