Dealing with Imbalanced Dataset for Multi-Class text classification having Multiple Categorical Features

Satish Korapati
5 min readJun 19, 2020

--

SGDClaasifier, Natural Language Processing, SMOTE, RandomOverSampler, ColumnTransformer

We often come across various text classification use cases like Social Media Monitoring, Customer Service, Voice of customer, etc., which has only one text feature that is considered as an Independent variable. The other text features are not considered in the text classification even though they are important for the data prediction. So here in this article, I will be showing on how to deal with imbalanced datasets having many categorical features along with the text description and target multiple classes.

So, lets get started!!!

Below are the topics which I will be discussing in detail:

1. Text Cleaning (Stopwords, Stemming, Removing bad characters)

2. Feature Engineering (OneHotEncoder, TFIDFVectorizer) using ColumnTransformer

3. RandomOverSampler

4. SMOTE

Few definitions which could help you in better understanding:

Multi-Class Classification: In machine Learning, multiclass or multinomial classification is the problem of classifying instances into one of three or more classes.

Categorical Feature: A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

Imbalanced Datasets: An imbalanced classification problem is an example of a classification problem where the distribution of examples across the known classes is biased or skewed. The distribution can vary from a slight bias to a severe imbalance where there is one example in the minority class for hundreds, thousands, or millions of examples in the majority class or classes.

Dataset

Load the data from the csv file to Pandas DataFrame through pd.read_csv(). The dataset consists of 7 categorical features which includes one text description feature.

import pandas as pd
import os
import matplotlib.pyplot as plt
path = os.path.join(os.getcwd(),’Desktop\Machine Learning’)
file_loc = os.path.join(path,”sample_data.csv”)
categ_feature_list = list(data.columns)
data = pd.read_csv(file_loc)
target = data.groupby(‘Target’)[‘Target’].count()
target.plot.bar()

From above image it is understood that the target variable is having 15 classes and also the dataset is imbalanced.

Let’s begin with the process of developing a text classification model.

  1. Text Cleaning:

The dataset used here is having multiple categorical features which contains special characters, stopwords and words with similar meanings. The below python code removes stop words, bad characters, punctuation, converts the text to lower case and stemming the words.

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stop_words = set(stopwords.words(‘english’))
stemer = PorterStemmer()
def remove_SW_Stem(text):
text=[stemer.stem(words) for words in text.split(“ “) if words not in stop_words]
return “ “.join(text)
special_chars = re.compile(‘[^⁰-9a-z#+_]’)
add_space = re.compile(‘[/(){}\[\]\\@;]’)
def clean_text(text):
text=text.lower()
text = add_space.sub(“ “,text)
text = special_chars.sub(“ “,text)
text = remove_SW_Stem(text)
return text
for feature in categ_feature_list:
data[feature] = data[feature].apply(lambda text:clean_data(text))

2. Feature Engineering:

TFIDFVectorizer (Term Frequency-Inverse Document Frequency) is used to convert a collection of raw documents to a matrix of TF-IDF features. It will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. OneHotEncoding will encode categorical features as a one-hot numeric array.

The data in the dataset is in the textual format which needs conversion to numeric format. And I have applied OneHotEncoding to all the categorical features and TFIDFVectorizer for text feature.

Using OneHotEncoding the data will result into one-hot numeric array and TFIDFVectorizer into matrix of TFIDF features.

We need to combine the outputs of TFIDFVectorizer and OneHotEncoder without which we will not be able to train the model. To overcome the problem of combining two different results (TFIDF features and one-hot numeric array) I’m using the ColumnTransformer from the Sklearn Library. The ColumnTransformer combines both outputs and it will return the final output as csr_matrix.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
before_vect = data[[‘cate_1’, ’cate_2’, ’cate_3’, ’cate_4’, ’cate_5’, ’Description’]]columnTransformer = ColumnTransformer( [(‘E’,OneHotEncoder(dtype=’int’),[‘cate_1’, ’cate_2’, ’cate_3’, ’cate_4’, ’cate_5’]),
(‘tfidf’,TfidfVectorizer(stop_words=None, max_features=100000),’Description’)],
remainder=’drop’)
vector_transformer = columnTransformer.fit(before_vect)
vectorized_df = vector_transformer.transform(before_vect)
y = data[“target”]
y=y.to_frame()

3. RandomOverSampler:

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

Here I am using the RandomOverSampler algorithm to balance the dataset with the parameter random_state=777. After balancing the data, I am splitting it with test_size =0.3 (30% of data is used for testing and only 70% of data is used for Training the model). Then I have used the SGDClassifier for training the data as this is one of the best algorithms for text classification and also it can be applied for large-scale data.

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
ros = RandomOverSampler(random_state=777)
X_ROS, y_ROS = ros.fit_sample(vectorized_df, y)
X_train, x_test, Y_train, y_test = train_test_split(X_ROS,y_ROS,test_size=0.3,random_state=42)sgd = SGDClassifier(max_iter=1000, tol=1e-3)
sgd.fit(X_train, Y_train)
pred_sgd = sgd.predict(x_test)
print(“Accuracy %s” % accuracy_score(pred_sgd,y_test))
print(classification_report(y_test,pred_sgd))

4. SMOTE:

Synthetic Minority Over-Sampling Technique is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them.

Again for the dataset, I am using the SMOTE algorithm to balance the dataset with the parameter k_neighbors = 5. After balancing the data, I am splitting it with test_size =0.3 (30% of data is used for testing and only 70% of data is used for Training the model). Then I have used the SGDClassifier for training the data.

from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
smote = SMOTE(random_state=777,k_neighbors=5)
X_smote,y_smote = smote.fit_sample(vectorized_df,y)
X_train, x_test, Y_train, y_test = train_test_split(X_smote,y_smote,test_size=0.3,random_state=42)
sgd = SGDClassifier(max_iter=1000, tol=1e-3)
sgd.fit(X_train, Y_train)
pred_sgd = sgd.predict(x_test)
print(“Accuracy %s” % accuracy_score(pred_sgd,y_test))
print(classification_report(y_test,pred_sgd))

Hope this article provided you a better understanding on how to use SMOTE and RandomOverSampler when we have imbalanced dataset with text variable and few categorical variables. You can also use other algorithms like ADASYN for balancing the dataset and Logistic Regression, MultinomialNB, RandomForestClassifier, XGBOOST, Word2vec for the data prediction.

--

--

Satish Korapati

Researcher, Data Scientist (Machine Learning, NLP, DeepLearning)