Text Classification — From Bag-of-Words to BERT — Part 1 (BagOfWords)

Anirban Sen
Analytics Vidhya
Published in
9 min readDec 29, 2020

--

What is Text Classification?

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

Few Applications:

  1. Platforms such as E-commerce, news agencies, content curators, blogs, directories, and likes can use automated technologies to classify and tag content and products.
  2. A faster emergency response system can be made by classifying panic conversations on social media.
  3. Marketers can monitor and classify users based on how they talk about a product or brand online.

and many more

Problem Statement: For practice, we will be using Kaggle Competition named “Toxic Comment Classification Challenge” by Jigsaw (A subsidiary of Alphabet). In this competition, we’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate. The dataset contains comments from Wikipedia’s talk page edits. (So, along with text classification we will also be learning how to implement multi-output/multi-label classification)

Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive. I do not encourage such words and this is only for experiment purposes.

Evaluation: Submissions are evaluated on the mean column-wise ROC AUC. In other words, the score is the average of the individual AUCs of each predicted column.

Models: The models are mentioned in order of increasing complexity

  1. Bag Of Words
  2. Word2Vec Embeddings
  3. fastText Embeddings
  4. Convolutional Neural Networks (CNN)
  5. Long Short-Term Memory (LSTM)
  6. Bidirectional Encoder Representations from Transformers (BERT)

So, let’s start 😁

1. Bag Of Words

The bag-of-words model is the most commonly used method of text classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

Intuition:

Bag of words intution

Of course, there is more to it like preprocessing. But the diagram gives a basic understanding. Let’s dive into the implementation

Implementation:

We will be using CountVectorizer (a sklearn implementation of Bag-of-Words) model to convert the texts to a numerical dataset which can be then mapped against the output variables toxic, severe_toxic, obscene, threat, insult, identity_hate, and any model can be used learn the dependency of the output variable i.e. toxic type in this case on the occurrence of words. For now, we will be using Naive Bayes and Logistic Regression on top of the dataset created by CountVectorizer and will choose the one giving the best results on the validation dataset to predict on the test dataset. We will be using the Multi-Output Classifier wrapper from sklearn to create models for all 6 output variables. For people interested in the full code, you can find it here.

  1. Reading the dataset
Training Dataset
Training Dataset

We have around 160k training texts and 153k test texts

2. Basic Preprocessing

Preprocessing is one of the vital steps in NLP like any other ML task. In NLP, it helps to get rid of unhelpful parts of the data, or noise, by converting all characters to lowercase, removing punctuation marks, and removing stop words and typos. In this case, punctuations and numbers are removed along with stopwords like in, the, of so that these can be removed from texts as these words don't help in determining the classes(Whether a sentence is toxic or not)

stop_words = stop_words.ENGLISH_STOP_WORDS#Function for basic cleaning/preprocessing texts
def clean(doc):
# Removal of punctuation marks (.,/\][{} etc) and numbers
doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])
# Removal of stopwords
doc = " ".join([token for token in doc.split() if token not in stop_words])
return doc.lower()

3. Creating a Bag of Words vector

Creating a bag of words model with a maximum of 5000 most-frequent words (as including all the words will make the dataset sparse and will only add noise). Also, Clean the dataset when creating the dataset using a bag of words

vect = CountVectorizer(max_features= 5000, preprocessor=clean)
X_train_dtm = vect.fit_transform(X_train)
X_val_dtm = vect.transform(X_val)

print(X_train_dtm.shape, X_val_dtm.shape)
#(119678, 5000) (39893, 5000)
Bag of Words Vector

Above, we can see the bag of words. E.g. abide is present in the 1st sentence 0 times. The Bag of words is pretty much sparse (we can further reduce the max_features if required). This will be the input for a Machine Learning Classifier

4. Creating the Multi-Output Classifier

Since we need to classify each sentence as toxic or not, severe_toxic or not, obscene or not, threat or not, insult or not, and identity_hate or not, we need to classify the sentence against 6 output variables (This is called Multi-Label Classification which is different from multi-class classification where a target variable has more than 2 options e.g. a sentence can be positive, negative and neutral)

For the same, we will be using MultiOutputClassifier from sklearn which as mentioned earlier is a wrapper. This strategy consists of fitting one classifier per target. N.B.: I tried using Support Vector Classifier as well, but that took a lot of time to train without giving the best results

#Initializing and fitting models on Training Data
#Naive Bayes Model
nb = MultiOutputClassifier(MultinomialNB()).fit(X_train_dtm, y_train)
#Logistic Regression Model (As we have unbalanced dataset, we use class_weight which will use inverse of counts of that class. It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1)
lr = MultiOutputClassifier(LogisticRegression(class_weight='balanced', max_iter=3000)) \
.fit(X_train_dtm, y_train)

5. Measuring performance of validation dataset

Since the competition uses the ROC-AUC as the evaluation metric, we will be using the same in the notebook. We will compare the mean ROC-AUC across all the 3 models we have trained. We will be using the predict_proba function of models instead of predict which gives us the probability scores instead of predicted value based on a threshold of 0.5, as it is used by the roc_auc_measure.

#Function for calculating roc auc with given actual binary values across target variables and the probability score made by the model
def calculate_roc_auc(y_test, y_pred):
aucs = []
#Calculate the ROC-AUC for each of the target column
for col in range(y_test.shape[1]):
aucs.append(roc_auc_score(y_test[:,col],y_pred[:,col]))
return aucs

Given the performance metrics let’s run the models on the validation dataset

#Creating an empty list of results
results = []
#Making predictions from all the trained models and measure performance for each
for model in [nb,lr]:
#Extracting name of the model
est = type(model.estimator).__name__
#Actual output variables
y_vals = y_val.to_numpy()
#Model Probabilities for class 1 of each of the target variables
y_preds = np.transpose(np.array(model.predict_proba(X_val_dtm))[:,:,1])
#Calculate Mean of the ROC-AUC
mean_auc = mean(calculate_roc_auc(y_vals,y_preds))
#Append the name of the model and the mean_roc_auc into the results list
results.append([est, mean_auc])
Validation Results

As we can see, Both the models perform really good with LR performing slightly better. So, we will use it as the final model to submit the predictions for the test data. Also, these simple models give pretty good results without much of a hassle or technical know-how, that is why they are still used widely.

A bit on Logistic Regression is no harm.

The logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead, or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object is detected in the image would be assigned a probability between 0 and 1, with a sum of one.

Below is the image most commonly used image for Logistic Regression

Logistic Model

By simple transformation, the logistic regression equation can be written in terms of an odds ratio.

Finally, taking the natural log of both sides, we can write the equation in terms of log-odds (logit) which is a linear function of the predictors.

The coefficient (b1) is the amount the logit (log-odds) changes with a one-unit change in x.

For e.g. if the equation is 1+2x i.e. b0 = 1 and b1 = 2.
Increasing x by 1 increases the log-odds by 2 and the odds that Y=1 increase by a factor of 10². Note that the probability of Y=1 has also increased, but it has not increased by as much as the odds have increased.

This was about the logistic function.

Now to find the best dividing line (in other terms reduce the loss function), Logistic Regression also uses Gradient Descent but with a different loss function (Linear Regression uses Mean squared error). Logistic Regression uses log loss/ maximum likelihood estimation (MLE) function

The cost function for Logistic Regression

where m is the number of samples (as we take the average), y is the actual value and h(x) is the output of the model

6. Model Interpretation

This is the most exciting part at least for me. Since we are just using a simple Logistic Regression model, we can directly use the coefficient values of the model to get an understanding of the predictions made. By doing so, we get to know which feature is important or which word makes a sentence toxic. If we would have used a complex model, we could go for SHAP or LIME. Also, since we have 6 output variables, we will have 6 feature importances which will be interesting to see. We will look at Top 5 words that determine if the sentence is a toxic-type or not according to the model.

#Assigning the feature names to an empty list
feat_impts = [vect.get_feature_names()]
#For all the models save the feature importances in the #list.estimators_ would give the internal models used by the #multioutput regressor
for clf in lr.estimators_:
feat_impts.append(clf.coef_.flatten())
#Saving the coefficients in a dataframe
df_feats_impts = pd.DataFrame(np.transpose(np.array(feat_impts)), columns = ["word","toxic","severe_toxic","obscene","threat","insult","identity_hate"])
#Creating Individual Feature Importance table by sorting on specific toxic-type column and selecting top 5 words
toxic_fi = df_feats_impts[["word","toxic"]].sort_values(by = "toxic", ascending = False).head()

In the last time of the code above we have created a Top 5 feature importance list for the toxic output variable. Similarly, we can create a data frame for all the other 5 toxic types with the top 5 words based on coefficients of the respective model.

We can see that the models are quite rightly selecting the most important features and it makes complete sense. E.g. for threats — words like kill, shoot, destroy, etc are most important. For identity hate — words like nigger, nigga, homosexual, faggot. Most important words for toxic are less extreme than most important words for severe toxic.

7. Results and Scope of Improvements

Results on the Kaggle Leaderboard

TODOs:
1. Try TF-IDF instead of CountVectorizer
TF — Freq(word in sentence)/ Length of sentence
IDF — log (# of documents/ # of documents that have the word)
TF_IDF = TF * IDF
TF-IDF tend to perform better than CountVectorizer in some cases
2. Try ensemble models instead of Vanilla ML models
Bagging and Boosting models give better results than classic ML techniques in most cases
3. Better Text Preprocessing
Typo correction, Lemmatization, etc can be done to further improve the model

This was about Bag of Words, in the next one, it will be about Word2Vec. Stay safe till then. Again, the whole code is present (here). Please do provide your feedback in form of responses and claps :)

--

--