Multi-Class Text Classification with Kashgari in 15 minutes

Text classification is one of the essential and typical tasks in supervised machine learning (ML). Assigning categories to a text document, which can be a web page, library book, media articles, gallery, etc. has many applications like spam filtering, email routing, sentiment analysis, etc. In this article, I will demonstrate how we can do text classification using my recent open sourced text-classification and sequence labeling tool Kashgari.

We are going to use the US Consumer Finance Complaints Dataset to train a multi-class classification model. When there is a new complain, we want to assign it to one of 12 categories. The classifier assumes that each new complaint is attached to one and only one category. This is multi-class text classification problem.

Step 1. Prepare environment

first, we need to prepare python3.6 environment with these packages.

Step2. Prepare dataset

Download dataset from data.gov and unzip to the `data` path. Let’s check out the dataset first.

Image for post
Image for post

We will use Consumer_complaint_narrative column as Input and product as Output. For example,

When the user example is

Our model should categorize this complaint as a mortgage class.

We are going to use nltk.tokenizer to tokenize input sentence. Convert input to a list of words, like

Split train, validate and test dataset.

Step3. create and train model

Save and load model

Use pre-training embedding models

If you want to use pre-trained embedding to improve the performance or enhance the model’s generalization capability, it could be done very quickly.

Use tensorboard for visualizing training

Kashgari is built directly on Keras, use the keras.callbacks.TensorBoard.

Written by

NLP developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store