Building a Text Classification model using BiLSTM

Pashupati Gupta
Analytics Vidhya
Published in
9 min readJun 1, 2020
Credits : EnterpriseTalk

Text classification is one of the fundamental tasks in NLP. Almost every NLP system uses text classification somewhere in its backend. For example - The intent classifier of a chatbot, named-entity recognition, auto-tagging, etc.

There are many approaches to this problem from statistical machine learning models (Logistic, Naive Bays, SVM, etc.) to high-end deep learning models (CNN, RNN, Transformers, etc.). This blog covers the practical aspects (coding) of building a text classification model using a recurrent neural network (BiLSTM). We will use Python and Jupyter Notebook along with several libraries to build an offensive language/text classification model. It has three parts.

  1. Data Preparation
  2. Model Building
  3. Training and Evaluation

Data Preparation

Selecting the right dataset is the key to build a state-of-the-art model but finding/making one is a hectic task. Fortunately, we have a readily available dataset for our task released by Harvard University.

Download Offensive Language Identification Dataset (OLID) from here. You can also use any other dataset for this task. Let’s look at the dataset using the Pandas library.

import pandas as pd
url = 'olid-training-v1.0.tsv'
df = pd.read_csv(url, sep="\t")
df.head()
originally downloaded dataset

This is dataset is a collection of tweets classified as offensive (OFF) and non-offensive (NOT) in the column ‘subtask_a’. We need only ‘tweet’ and ‘subtask_a’ column renamed as ‘label’.

del df['subtask_b']
del df['subtask_c']
del df['id']
df.columns = ['tweet', 'label']
df.head()
modified dataset

The next step of data preparation is processing the tweets. We will use NLTK, BS4, Contractions, and Regex(re) libraries for this. Let’s import the libraries and write the required functions. Each function’s objective is commented on in the function itself. The first function is denoise_text used to remove noise from the text.

#importing required librariesimport nltk
import inflect
import contractions
from bs4 import BeautifulSoup
import re, string, unicodedata
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
# First function is used to denoise text
def denoise_text(text):
# Strip html if any. For ex. removing <html>, <p> tags
soup = BeautifulSoup(text, "html.parser")
text = soup.get_text()
# Replace contractions in the text. For ex. didn't -> did not
text = contractions.fix(text)
return text
# Check the function
sample_text = "<p>he didn't say anything </br> about what's gonna <html> happen in the climax"
denoise_text(sample_text)

The next function is normalize_text which includes many steps. A separate function is written for each step and then they are compiled in one function.

# Text normalization includes many steps.
# Each function below serves a step.
def remove_non_ascii(words):
"""Remove non-ASCII characters from list of tokenized words"""
new_words = []
for word in words:
new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
new_words.append(new_word)
return new_words
def to_lowercase(words):
"""Convert all characters to lowercase from list of tokenized words"""
new_words = []
for word in words:
new_word = word.lower()
new_words.append(new_word)
return new_words
def remove_punctuation(words):
"""Remove punctuation from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
def replace_numbers(words):
"""Replace all interger occurrences in list of tokenized words with textual representation"""
p = inflect.engine()
new_words = []
for word in words:
if word.isdigit():
new_word = p.number_to_words(word)
new_words.append(new_word)
else:
new_words.append(word)
return new_words
def remove_stopwords(words):
"""Remove stop words from list of tokenized words"""
new_words = []
for word in words:
if word not in stopwords.words('english'):
new_words.append(word)
return new_words
def stem_words(words):
"""Stem words in list of tokenized words"""
stemmer = LancasterStemmer()
stems = []
for word in words:
stem = stemmer.stem(word)
stems.append(stem)
return stems
def lemmatize_verbs(words):
"""Lemmatize verbs in list of tokenized words"""
lemmatizer = WordNetLemmatizer()
lemmas = []
for word in words:
lemma = lemmatizer.lemmatize(word, pos='v')
lemmas.append(lemma)
return lemmas
def normalize_text(words):
words = remove_non_ascii(words)
words = to_lowercase(words)
words = remove_punctuation(words)
words = replace_numbers(words)
words = remove_stopwords(words)
#words = stem_words(words)
words = lemmetize_verbs(words)
return words
# Testing the functions
print("remove_non_ascii results: ", remove_non_ascii(['h', 'ॐ', '©', '1']))
print("to_lowercase results: ", to_lowercase(['HELLO', 'hiDDen', 'wanT', 'GOING']))
print("remove_punctuation results: ", remove_punctuation(['hello!!', 'how?', 'done,']))
print("replace_numbers results: ", replace_numbers(['1', '2', '3']))
print("remove_stopwords results: ", remove_stopwords(['this', 'and', 'amazing']))
print("stem_words results: ", stem_words(['beautiful', 'flying', 'waited']))
print("lemmatize_verbs results: ", lemmatize_verbs(['hidden', 'walking', 'ran']))
print("normalize_text results: ", normalize_text(['hidden', 'in', 'the', 'CAVES', 'he', 'WAited', '2', 'ॐ', 'hours!!']))

We can see that each function works correctly. You may have noticed that each function takes a list of tokens (words) as an argument. So, we need a function to tokenize the text. We will use word_tokenize from NLTK for this.

# Tokenize tweet into words
def tokenize(text):
return nltk.word_tokenize(text)
# check the function
sample_text = 'he did not say anything about what is going to happen'
print("tokenize results :", tokenize(sample_text))

This also works fine. Now, we have all the functions required for processing the tweets. Let’s apply them to our dataset using a text_prepare function. We will also LabelEncode our target variable in this step.

def text_prepare(text):
text = denoise_text(text)
text = ' '.join([x for x in normalize_text(tokenize(text))])
return text
df['tweet'] = [text_prepare(x) for x in df['tweet']]
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
df.head()
processed dataset

Yay! This looks great. We are done with the data preparation step. Note that I haven’t used stem_words function while normalizing the text since it leads to better results in this particular case. Let’s move to the model building.

Model Building

Model building is a crucial step but thanks to deep learning frameworks that made it easier. We will use Keras library to build a recurrent neural network based on bidirectional LSTMs. Read about LSTMs here. These models take word embeddings as input so we will use pre-trained GloVe embeddings to make the embedding dictionary. Download glove embeddings from here.

Before doing so we first need to tokenize our whole text and turn them into sequences where each word is assigned to an integer. Also, all the sequences must be of the same length so we need to pad the sequences. Let’s write a function which will take X_train, X_test, MAX_NB_WORDS (maximum number of words in the vocabulary), MAX_SEQUENCE_LENGTH (maximum length of text sequences) as input and will perform the above-mentioned steps to build the embedding dictionary. First, import all required libraries.

from keras.layers import Dropout, Dense, Embedding, LSTM, Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from sklearn.metrics import matthews_corrcoef, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.utils import shuffle
import numpy as np
import pickle
import matplotlib.pyplot as plt
import warnings
import logging
logging.basicConfig(level=logging.INFO)

Now the implement a prepare_model_input function as below. This returns X_train_Glove, X_test_Glove, word_index (word assigned with integers), and the embedding_dict.

def prepare_model_input(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):
np.random.seed(7)
text = np.concatenate((X_train, X_test), axis=0)
text = np.array(text)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(text)
# pickle.dump(tokenizer, open('text_tokenizer.pkl', 'wb'))
# Uncomment above line to save the tokenizer as .pkl file
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Found %s unique tokens.' % len(word_index))
indices = np.arange(text.shape[0])
# np.random.shuffle(indices)
text = text[indices]
print(text.shape)
X_train_Glove = text[0:len(X_train), ]
X_test_Glove = text[len(X_train):, ]
embeddings_dict = {}
f = open("glove.6B.50d.txt", encoding="utf8")
for line in f:
values = line.split()
word = values[0]
try:
coefs = np.asarray(values[1:], dtype='float32')
except:
pass
embeddings_dict[word] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_dict))
return (X_train_Glove, X_test_Glove, word_index, embeddings_dict)
## Check function
x_train_sample = ["Lorem Ipsum is simply dummy text of the printing and typesetting industry", "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout"]
x_test_sample = ["I’m creating a macro and need some text for testing purposes", "I’m designing a document and don’t want to get bogged down in what the text actually says"]
X_train_Glove_s, X_test_Glove_s, word_index_s, embeddings_dict_s = prepare_model_input(x_train_sample, x_test_sample, 100, 20)
print("\n X_train_Glove_s \n ", X_train_Glove_s)
print("\n X_test_Glove_s \n ", X_test_Glove_s)
print("\n Word index of the word testing is : ", word_index_s["testing"])
print("\n Embedding for thw word want \n \n", embeddings_dict_s["want"])

Everything is working fine. Now, let’s implement a build_bilstms helper function that will return the BiLSTM model. We will use Embedding, Dense, Dropout, LSTM, Bidirectional layers from keras.layers to build a sequential model.

def build_bilstm(word_index, embeddings_dict, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5, hidden_layer = 3, lstm_node = 32):
# Initialize a sequebtial model
model = Sequential()
# Make the embedding matrix using the embedding_dict
embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_dict.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
if len(embedding_matrix[i]) != len(embedding_vector):
print("could not broadcast input array from shape", str(len(embedding_matrix[i])),
"into shape", str(len(embedding_vector)), " Please make sure your"
" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
exit(1)
embedding_matrix[i] = embedding_vector

# Add embedding layer
model.add(Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
# Add hidden layers
for i in range(0,hidden_layer):
# Add a bidirectional lstm layer
model.add(Bidirectional(LSTM(lstm_node, return_sequences=True, recurrent_dropout=0.2)))
# Add a dropout layer after each lstm layer
model.add(Dropout(dropout))
model.add(Bidirectional(LSTM(lstm_node, recurrent_dropout=0.2)))
model.add(Dropout(dropout))
# Add the fully connected layer with 256 nurons and relu activation
model.add(Dense(256, activation='relu'))
# Add the output layer with softmax activation since we have 2 classes
model.add(Dense(nclasses, activation='softmax'))
# Compile the model using sparse_categorical_crossentropy
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model

We have implemented the helper function to build the model. Let’s build the actual model for our task.

X = df.tweet
y = df.label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
print("Preparing model input ...")
X_train_Glove, X_test_Glove, word_index, embeddings_dict = prepare_model_input(X_train,X_test)
print("Done!")
print("Building Model!")
model = build_bilstm(word_index, embeddings_dict, 2)
model.summary()
summary of the model

Yay again! we are done with the model-building part. Let’s train and evaluate the model.

Training and Evaluation

Training is the part where one can experiment and find best hyper-parameters for his/her model. First, let’s implement some utility functions for training and checking the model performance.

def get_eval_report(labels, preds):
mcc = matthews_corrcoef(labels, preds)
tn, fp, fn, tp = confusion_matrix(labels, preds).ravel()
precision = (tp)/(tp+fp)
recall = (tp)/(tp+fn)
f1 = (2*(precision*recall))/(precision+recall)
return {
"mcc": mcc,
"true positive": tp,
"true negative": tn,
"false positive": fp,
"false negative": fn,
"pricision" : precision,
"recall" : recall,
"F1" : f1,
"accuracy": (tp+tn)/(tp+tn+fp+fn)
}
def compute_metrics(labels, preds):
assert len(preds) == len(labels)
return get_eval_report(labels, preds)
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string], '')
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()

Now, we are ready for training the model. We will train our model for 5 epochs with a batch size of 128.

history = model.fit(X_train_Glove, y_train,
validation_data=(X_test_Glove,y_test),
epochs=5,
batch_size=128,
verbose=1)
model training logs

Well, the accuracy achieved by our model seems good but let’s look at the graph of loss and accuracy to see clearly what's going on.

plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

In the accuracy vs epochs graph, we observe that validation accuracy is maintained around 0.74 whereas training accuracy increases continuously. In the loss vs epoch graph as well validation loss is maintained around 0.50 whereas training loss decreases continuously. This is a sign of slight overfitting. We can reduce the complexity of the model, increase dropout probability, and use regularization to reduce overfitting. Let’s finally evaluates the model we just trained.

print("\n Evaluating Model ... \n")
predicted = model.predict_classes(X_test_Glove)
print(metrics.classification_report(y_test, predicted))
print("\n")
logger = logging.getLogger("logger")
result = compute_metrics(y_test, predicted)
for key in (result.keys()):
logger.info(" %s = %s", key, str(result[key]))
evaluation results

We can see that despite being slightly overfitted the model performs pretty well on test data with an accuracy of 75% and F1-score 0.62.

Finally Yay! we have successfully built and trained a BiLSTM model for text classification. Kudos! Find all the codes in this GitHub repository.

Optional -: You may consider saving the model and tokenizer as a .pkl file for deployment purposes. (Next part of this blog)

#To save the tokenizer follow instructions in prepare_model_input function i.e. uncomment this line #pickle.dump(tokenizer, open('text_tokenizer.pkl', 'wb')) in that function# To save the model run this line
pickle.dump(model, open('model.pkl', 'wb'))
# you are ready for deployment!

PS :- Thanks for reading till here. Any correction, critic, or compliment is most welcomed.

--

--