Photo credit: Pixabay

Multi-Class Text Classification with LSTM

How to develop LSTM recurrent neural network models for text classification problems in Python using Keras deep learning library

Published in
5 min readApr 10, 2019

--

Automatic text classification or document classification can be done in many different ways in machine learning as we have seen before.

This article aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented using Keras. We will use the same data source as we did Multi-Class Text Classification with Scikit-Lean, the Consumer Complaints data set that originated from data.gov.

The Data

We will use a smaller data set, you can also find the data on Kaggle. In the task, given a consumer complaint narrative, the model attempts to predict which product the complaint is about. This is a multi-class text classification problem. Let’s roll!

df = pd.read_csv('consumer_complaints_small.csv')
df.info()
Figure 1
df.Product.value_counts()
Figure 2

Label Consolidation

After first glance of the labels, we realized that there are things we can do to make our lives easier.

  • Consolidate “Credit reporting” into “Credit reporting, credit repair services, or other personal consumer reports”.
  • Consolidate “Credit card” into “Credit card or prepaid card”.
  • Consolidate “Payday loan” into “Payday loan, title loan, or personal loan”.
  • Consolidate “Virtual currency” into “Money transfer, virtual currency, or money service”.
  • “Other financial service” has very few number of complaints and it does not mean anything, so, I decide to remove it.
df.loc[df['Product'] == 'Credit reporting', 'Product'] = 'Credit reporting, credit repair services, or other personal consumer reports'
df.loc[df['Product'] == 'Credit card', 'Product'] = 'Credit card or prepaid card'
df.loc[df['Product'] == 'Payday loan', 'Product'] = 'Payday loan, title loan, or personal loan'
df.loc[df['Product'] == 'Virtual currency', 'Product'] = 'Money transfer, virtual currency, or money service'
df = df[df.Product != 'Other financial service']

After consolidation, we have 13 labels:

df['Product'].value_counts().sort_values(ascending=False).iplot(kind='bar', yTitle='Number of Complaints', 
title='Number complaints in each product')
Figure 3

Text Pre-processing

Let’s have a look how dirty the texts are:

def print_plot(index):
example = df[df.index == index][['Consumer complaint narrative', 'Product']].values[0]
if len(example) > 0:
print(example[0])
print('Product:', example[1])
print_plot(10)
Figure 4
print_plot(100)
Figure 5

Pretty dirty, huh!

Our text preprocessing will include the following steps:

  • Convert all text to lower case.
  • Replace REPLACE_BY_SPACE_RE symbols by space in text.
  • Remove symbols that are in BAD_SYMBOLS_RE from text.
  • Remove “x” in text.
  • Remove stop words.
  • Remove digits in text.
text_preprocessing_LSTM.py

Now go back to check the quality of our text pre-processing:

print_plot(10)
Figure 6
print_plot(100)
Figure 7

Nice! We are done text pre-processing.

LSTM Modeling

  • Vectorize consumer complaints text, by turning each text into either a sequence of integers or into a vector.
  • Limit the data set to the top 5,0000 words.
  • Set the max number of words in each complaint at 250.
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['Consumer complaint narrative'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
  • Truncate and pad the input sequences so that they are all in the same length for modeling.
X = tokenizer.texts_to_sequences(df['Consumer complaint narrative'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
  • Converting categorical labels to numbers.
Y = pd.get_dummies(df['Product']).values
print('Shape of label tensor:', Y.shape)
  • Train test split.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
  • The first layer is the embedded layer that uses 100 length vectors to represent each word.
  • SpatialDropout1D performs variational dropout in NLP models.
  • The next layer is the LSTM layer with 100 memory units.
  • The output layer must create 13 output values, one for each class.
  • Activation function is softmax for multi-class classification.
  • Because it is a multi-class classification problem, categorical_crossentropy is used as the loss function.
consumer_complaint_lstm.py
Figure 8
accr = model.evaluate(X_test,Y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]))
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();
Figure 9
plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();
Figure 10

The plots suggest that the model has a little over fitting problem, more data may help, but more epochs will not help using the current data.

Test with a New Complaint

test_new_complaint.py

Jupyter notebook can be found on Github. Enjoy the rest of the week!

Reference:

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Susan Li
Susan Li

Written by Susan Li

Changing the world, one post at a time. Sr Data Scientist, Toronto Canada. https://www.linkedin.com/in/susanli/

Responses (30)