Multi-Class Text Classification with Kashgari in 15 minutes

BrikerMan
3 min readFeb 2, 2019

--

Text classification is one of the essential and typical tasks in supervised machine learning (ML). Assigning categories to a text document, which can be a web page, library book, media articles, gallery, etc. has many applications like spam filtering, email routing, sentiment analysis, etc. In this article, I will demonstrate how we can do text classification using my recent open sourced text-classification and sequence labeling tool Kashgari.

We are going to use the US Consumer Finance Complaints Dataset to train a multi-class classification model. When there is a new complain, we want to assign it to one of 12 categories. The classifier assumes that each new complaint is attached to one and only one category. This is multi-class text classification problem.

Step 1. Prepare environment

first, we need to prepare python3.6 environment with these packages.

pip install kashgari
pip install tensorflow

Step2. Prepare dataset

Download dataset from data.gov and unzip to the `data` path. Let’s check out the dataset first.

import pandas as pd
df = pd.read_csv('consumer_complaints.csv')
df.head()

We will use Consumer_complaint_narrative column as Input and product as Output. For example,

When the user example is

I am very disappointed that the CFPB did not help to resolve this fraudulant loan. # XXXX between XXXX XXXX and XXXX, FF ( AKA ) One West. Case # XXXX with CFPB.

Our model should categorize this complaint as a mortgage class.

We are going to use nltk.tokenizer to tokenize input sentence. Convert input to a list of words, like

import nltkdf['input_x'] = [nltk.word_tokenize(sentence) for sentence in df['consumer_complaint_narrative'].values]df['input_x'].values[1]
# ['I', 'am', 'very', 'disappointed', 'that', 'the', 'CFPB', 'did', 'not', 'help', 'to', 'resolve', 'this', 'fraudulant', 'loan', '.', '#', 'XXXX', 'between', 'XXXX', 'XXXX', 'and', 'XXXX', ',', 'FF', '(', 'AKA', ')', 'One', 'West', '.', 'Case', '#', 'XXXX', 'with', 'CFPB', '.'

Split train, validate and test dataset.

import numpy as npmsk = np.random.rand(len(df)) < 0.8
train_df = df[msk]
test_val_df = df[~msk]
msk = np.random.rand(len(test_val_df)) < 0.5
test_df = test_val_df[msk]
validate_df = test_val_df[~msk]
print(f'train data count: {len(train_df)}')
print(f'test data count: {len(test_df)}')
print(f'validate data count: {len(validate_df)}')
train_x, train_y = list(train_df['input_x']), list(train_df['product'])
test_x, test_y = list(test_df['input_x']), list(test_df['product'])
validate_x, validate_y = list(validate_df['input_x']), list(validate_df['product'])
# train data count : 53368
# test data count : 6618
# validate data count : 6820

Step3. create and train model

# currently kashgari provices CNNModel, BLSTMModel and CNNLSTMModel classifier model>>> from kashgari.tasks.classification import BLSTMModel>>> model = BLSTMModel()
>>> model.fit(train_x, train_y, validate_x, validate_y, epochs=5, batch_size=128)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, 636) 0
_________________________________________________________________
embedding_2 (Embedding) (None, 636, 100) 3019300
_________________________________________________________________
bidirectional_1 (Bidirection (None, 512) 731136
_________________________________________________________________
dense_1 (Dense) (None, 12) 6156
=================================================================
Total params: 3,756,592
Trainable params: 3,756,592
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
120/120 [==============================] - 149s 1s/step - loss: 1.9369 - acc: 0.2598 - val_loss: 1.9113 - val_acc: 0.2578
Epoch 2/5
120/120 [==============================] - 145s 1s/step - loss: 1.7397 - acc: 0.2878 - val_loss: 1.3229 - val_acc: 0.4795
Epoch 3/5
120/120 [==============================] - 146s 1s/step - loss: 1.0026 - acc: 0.6602 - val_loss: 0.7488 - val_acc: 0.7002
...

Save and load model

model.save('./model')new_model = CNNModel.load_model('./model')
complain = """I am unable to obtain my experian credit report on line. I am also unable to access my free anual credit report online via the annualcreditreport.com website. I was told the size of the report is causing this and may be related to the number of soft inquiries generated by reviewing my own report. As I had previously been a victim of Identity Theft, I checked my report almost daily for new/unauthorized inquiries or accounts appearing."""
x = nltk.word_tokenize(complain)
new_model.predict(x)
# Mortgage

Use pre-training embedding models

If you want to use pre-trained embedding to improve the performance or enhance the model’s generalization capability, it could be done very quickly.

# load word2vec embedding
from kashgari.embeddings import WordEmbeddings
embedding = WordEmbeddings('<embedding-file-path>', sequence_length=600)
# load BERT embedding
from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('bert-base-chinese', sequence_length=600)
# use embedding to init model
from kashgari.tasks.classification import CNNModel
model = CNNModel(embedding)

Use tensorboard for visualizing training

Kashgari is built directly on Keras, use the keras.callbacks.TensorBoard.

import kerastf_board_callback = keras.callbacks.TensorBoard(log_dir='./logs', update_freq=1000)model = CNNModel()
model.fit(train_x,
train_y,
val_x,
val_y,
batch_size=100,
fit_kwargs={'callbacks': [tf_board_callback]})

--

--