Realize portable and reusable text preprocessing
Text preprocessing is necessary process even if you use Deep Learning. Before you start to use cool Deep Learning model, you have to tokenize text, make vocabulary, convert tokens to ids, and ids to vectors… It’s just like the hassle of airport security.
chariot saves you.
You can implement text preprocessing by stacking premade functions. And it becomes portable and reusable module.
In short, chariot enables preprocessing like a buffet.
In this article, I show the feature and how to use chariot.
Text preprocessing by chariot
The feature of the chariot is following 3 points.
- Declarative implementation
- Portable save format
- Offer the ready to train format data
As I show above, you can make preprocessing by stacking the functions declaratively.
Each preprocessing function is implemented based on scikit-learn Transformer. For that reason, you can use any scikit-learn Transformer to make preprocessing and save it as pickle file (=portable).
preprocessor.save("my_preprocessor.pkl")
loaded = Preprocessor.load("my_preprocessor.pkl")
When you train the machine learning model, you have to adjust the length of text (padding) and make one-hot-vector, etc. In the chariot, these kinds of the process are defined as formatter
and you can stack it to the preprocessor.
The format process is applied only in the training process.
You can confirm the examples of the chariot by Jupyter Notebook.
Text classification example.
Language modeling example.
Let’s realize portable and reusable text preprocessing by chariot!