Multi-Class Text Classification | Practical Guide To Machine Learning

Published in

Analytics Vidhya

9 min readMay 11, 2020

In this article I will discuss how to perform Multi Class Text Classification task in a practical way in Machine Learning. Here we are also discussing different Natural Language Processing (NLP) techniques to perform our data pre-processing tasks in order to make meaningful data to build our Machine Learning Models. You can try step-by-step guide to learn and understand each an every steps here to perform Multi Class Classification for text data.

Dataset

Here we will use News Classification Dataset for the tutorial as it is a manually labeled dataset. This dataset contains articles which are categorized into 4 different classes (Business, SciTech, Sports, World). You can use any dataset in any domain which is relevant to your problem requirement.

You can download the dataset as a json file from the site. Anyway for the easier use,I prefer creating a csv file including only necessary data from the json file. But you can also use the original json dataset file directly without creating any csv file and it’s up to you as you prefer.

import pandas as pd
import jsondata = []for line in open("News Classification DataSet.json", "r"):
    data.append(json.loads(line))content, label = [], []
for each in data:
    content.append(each['content'])
    label.append(each['annotation']['label'][0])df = pd.DataFrame([content, label]).T
df.columns = ['content', 'label']
print(df.head())'''create a csv dataset file using json dataset file'''
df.to_csv('News_Dataset.csv', index = False)

From this moment onwards, I will use this generated csv dataset file.

You can use Jupyter notebook or any other preferred IDE to execute and test the code along with the tutorial.

Exploring the Data

First we load all the necessary libraries needed to implement the entire solution.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

Let’s load and visualize the dataset.

# loading data
df = pd.read_csv("News_Dataset.csv")
print(df.shape)
df.head()

There are total of 7600 news data in the dataset and let’s check whether there are any null values in the dataset.

# Percentage of news with text
total = df['content'].notnull().sum()
round((total/len(df)*100),1)

We got the result as 100 and it means that our dataset hasn’t any null value.

Let’s find out unique labels from the dataset.

pd.DataFrame(df.label.unique())

You can see there are 4 different classes that are; Business, SciTech, Sports & World.

# The bar chart to show the number of news per category
fig = plt.figure(figsize=(8,6))df.groupby('label').content.count().sort_values().plot.barh(
    ylim=0, title= 'NUMBER OF NEWS DATA IN EACH CATEGORY\n')
plt.xlabel('Number Of Ocurrences', fontsize = 10);
plt.savefig('category_graph.png', bbox_inches='tight')

It can be observed that each category contains same amount of news data.

Text Pre-Processing

This is the most important part of the process and you need a clear understanding of what you do and what you needs to do in this section. Before start text pre processing, it is very important to have a clear understanding of the content of your dataset. Without having better insight of your dataset you might ended up with garbage data even after cleaning the dataset.

Just take a look at content of some data of our dataset.

df.content[0]

df.content[1]

You can see that there are several unwanted and irrelevant characters/symbols/digits in the data and we need to clean the dataset by removing those texts.

# clean textimport re
from nltk.corpus import stopwordsREPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))def clean_text(text):
    """
        text: a string
        
        return: modified string
    """
    text = str(text).lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = re.sub(" \d+", " ", text) # remove digits
    text = re.sub(" #\d+", " ", text) # remove digits starting with # symbol
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return textdf['content'] = df['content'].apply(clean_text)
df.head(10)

df.content[1]

We used some standard text cleaning steps such as removing white spaces and numbers, stop words removals, removing symbols & puncuations, convert text to lowercase etc. The text cleaning strategies and steps can be different from dataset to dataset and specific to the content of your dataset. As an example in this dataset; there are numbers starting with # symbols and we used regex to remove those texts and we remove some specific urls with html content. Likewise according to the nature of your dataset you should have a clear understanding of the content of your dataset and depend on that you have to perform specific text processing / cleaning steps.

Now we need to represent each class as a number, so that our predictive model can better understand the different categories.

# Create a new column 'label_id' with encoded categories 
df['label_id'] = df['label'].factorize()[0]
new_df = df[['label', 'label_id']].drop_duplicates()# Dictionaries for future use
category_to_id = dict(new_df.values)
id_to_category = dict(new_df[['label_id', 'label']].values)# New dataframe
df.head()

Feature Extraction

Here we create text features to train and build our classification model.

TfidfVectorizer class can be initialized with the following parameters:

min_df : remove the words from the vocabulary which have occurred in less than ‘min_df’ number of files.
max_df : remove the words from the vocabulary which have occurred in more than ‘max_df’ total number of files in corpus.
sublinear_tf : set to True to scale the term frequency in logarithmic scale.
stop_words: remove the predefined stop words in ‘english’.
use_idf : weight factor must use inverse document frequency.
ngram_range : (1, 2) to indicate that unigrams and bigrams will be considered.

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
                        ngram_range=(1, 2), 
                        stop_words='english')# We transform each news data into a vector
features = tfidf.fit_transform(df.content).toarray()labels = df.label_idprint("Each of the %d news data is represented by %d features (TF-IDF score of unigrams and bigrams)" %(features.shape))

Next we will find the three most correlated terms with each of news category with the following step.

N = 3
for Category, category_id in sorted(category_to_id.items()):
  features_chi2 = chi2(features, labels == category_id)
  indices = np.argsort(features_chi2[0])
  feature_names = np.array(tfidf.get_feature_names())[indices]
  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
  print("\n==> ", (Category), ":")
  print("  * Most Correlated Unigrams are: %s" %(', '.join(unigrams[-N:])))
  print("  * Most Correlated Bigrams are: %s" %(', '.join(bigrams[-N:])))

Split the Dataset into Training & Testing Sets

For the data splitting task, we use train_test_split function from sklearn.

X = df['content'] # collection of news data
y = df['label'] # labels(i.e.,for the 4 different news categories)X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state = 0)print("shape of x_train set :", X_train.shape)
print("shape of y_train set :",y_train.shape)
print("shape of x_test set :",X_test.shape)
print("shape of y_test set :",y_test.shape)

As you can see, we are about to use 5700 data for training and 1900 data for testing.

Model Selection

Now we have data to train and it’s time to build the model. Then which machine learning algorithm to use? As starting point we can try different standard machine learning classifiers and according to their performance with the dataset, we can select a better model to use as the solution.

We select the following 4 classifiers for the experiment:

Random Forest Classifier
Linear Support Vector Machine
Multinomial Naive Bayes
Logistic Regression

models = [
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]# Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
    
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

We used 5 fold cross-validation to get the performance for each model.

mean_accuracy = cv_df.groupby('model_name').accuracy.mean()
std_accuracy = cv_df.groupby('model_name').accuracy.std()acc = pd.concat([mean_accuracy, std_accuracy], axis= 1, 
          ignore_index=True)
acc.columns = ['Mean Accuracy', 'Standard deviation']
acc

According to the stats, we will select Logistic Regression for the final solution.

Let’s see how to build a Logistic Regression model and evaluate it using standard evaluation techniques.

Build Model & Evaluate

X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(features, 
                                                               labels, 
                                                               df.index, test_size=0.25, 
                                                               random_state=1)
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)# Classification report
print('\t\t\tCLASSIFICATIION REPORT\n')
print(metrics.classification_report(y_test, y_pred, 
                                    target_names= df['label'].unique()))

According to the generated classification report, it reached better values for precion, recall and f1-score metrics.

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Also it is clear that we have test accuracy around 86.57%.

Next we will get the confusion matrix.

import seaborn as sns
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt='d',
            xticklabels=new_df.label.values, 
            yticklabels=new_df.label.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title("CONFUSION MATRIX - Logistic Regression\n", size=14)

Finally we can check the misclassified scenarios.

for predicted in new_df.label_id:
    for actual in new_df.label_id:
        if predicted != actual and conf_mat[actual, predicted] >= 20:
            print("actual class '{}' predicted as '{}' : {} examples.".format(id_to_category[actual], 
                                                           id_to_category[predicted], 
                                                           conf_mat[actual, predicted]))
    
            display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['label', 
                                                                'content']])
            print('')

I only attached only one output scenario as above misclassified issue. But you should be able to see all the scenarios once you run the above step.

Save Model

If you want to save your built model for further use you can easily save it and load from saved path. And also remember that you need to save fitted vectorizer and category_to_id & id_to_category dictionaries along with the model.

# save the model
model_filename = open('finalized_model.sav', 'wb')
pickle.dump(model, model_filename)
model_filename.close()# save fitted vectorizer to use for the prediction time
tf_idf_output = open('fitted_vectorizer.pickle', 'wb')
pickle.dump(fitted_vectorizer, tf_idf_output)
tf_idf_output.close()# save the dictionaries
output1 = open('category_to_id.pkl', 'wb')
pickle.dump(category_to_id, output1)
output1.close()output2 = open('id_to_category.pkl', 'wb')
pickle.dump(id_to_category, output2)
output2.close()

Summary

We successfully built a solution for multiclass classification in Machine Learning. You can try this with different datasets and performing some advanced text processing. Also trying different machine learning models/classifiers will enable you to gain better understanding and experience. Here we used tf-idf vectors as word features but you can use word embeddings to use as the features.

If you are interested in Deep Learning, you can try out this kind of tasks using deep learning as well. You can try using different deep learning frameworks like TensorFlow, Keras, PyTorch etc.

Hope you enjoyed the article and stay tuned with another interesting article. Also I’ll be happy to hear your feedbacks.