Toxic Comment Classification

Nupur Baghel
May 30, 2018 · 13 min read

My journey to building a multi-label comment classifier……

I was required to build a final capstone project as part of my Udacity machine learning engineer nanodegree. I picked up a dataset from a live competition at Kaggle and worked through it. This blog demonstrates my learning, problems faced and how I overcame these to finally succeed 🎉

The first step was deciding what was the problem that interests me. I had a list of projects to chose from, on domains such as robotics, healthcare and education. I also had the option of deciding a completely different project on Kaggle. I went with the latter option.

Step 1: Problem Statement and Background

While surfing for some good problem statement on Kaggle, I landed at the Toxic Comment Classification Challenge which was running live then.

The background for the problem originates from the multitude of online forums, where-in people participate actively and make comments. As the comments some times may be abusive, insulting or even hate-based, it becomes the responsibility of the hosting organizations to ensure that these conversations are not of negative type. The task was thus to build a model which could make prediction to classify the comments into various categories. Consider the following examples :

Example showing 2 toxic and 1 non-toxic comment.

The exact problem statement was thus as below:

Given a group of sentences or paragraphs, used as a comment by a user in an online platform, classify it to belong to one or more of the following categories — toxic, severe-toxic, obscene, threat, insult or identity-hate with either approximate probabilities or discrete values (0/1).

Multilabel vs Multiclass classification ?

As the task was to figure out whether the data belongs to zero, one, or more than one categories out of the six listed above, the first step before working on the problem was to distinguish between multi-label and multi-class classification.

In multi-class classification, we have one basic assumption that our data can belong to only one label out of all the labels we have. For example, a given picture of a fruit may be an apple, orange or guava only and not a combination of these.

In multi-label classification, data can belong to more than one label simultaneously. For example, in our case a comment may be toxic, obscene and insulting at the same time. It may also happen that the comment is non-toxic and hence does not belong to any of the six labels.

Hence, I had a multi-label classification problem to solve. The next step was to gain some useful insights from data which would aid further problem solving.

Step 2: Studying data & identifying hidden patterns


I had a dataset of 95981 samples of comments along with their labels. I observed that every 1 in 10 samples was toxic, every 1 in 50 samples was obscene and insulting, but the occurrences of sample being severe-toxic, threat and identity hate was extremely rare. The first 5 rows appeared as shown above.

Next I created some visualisations:

Keeping length of comments on the independent axis with buckets of size 200, I counted the number of comments which had number of characters in that range. For instance, from the graph we can see there are 10000 comments which have 400 to 600 characters.


From the first visualisation we can observe that comments were of varying lengths from less than 200 characters to 1200 characters. The majority of comments had length up to 200.

For the next visualisation, I had length of comments on the independent axis again similar to the previous plot. But instead of counting number of comments, I counted comments belonging to each of the different categories.


The second visualisation plots the number of comments belonging to various categories. Toxic comments were highest in number, followed by obscene, insult, severe-toxic, identity-hate and threat in decreasing order.

This analysis gave me some really good insights about the distribution of my data. The next step was to perform pre-processing of the data. The volume of the data available was fair enough for good analysis but not easy enough to deal with. The next section explains the issues I faced on account of the volume of data.

Problem 1 : Partitioning into testing and training

Based upon my experience till this point, I thought Scikitlearn’s train_test_split would do the magic of dividing the data into testing and training regardless of its size. Sigh, it didn’t happen so and I ended up with an error instead.

To solve the issue, I wrote a custom function which would shuffle the indices randomly. So that later we could split our data into training and testing sets without any bias. The code for shuffling and splitting was as follows :

Problem 2: Out of Memory, Kernel Died!

This was the most common error i encountered throughout the project. It would follow me whichever direction i moved. On applying even the most basic algorithm on the most basic method (to be discussed later), the program would run for good ten minutes and finally end up with out of memory error (I was using a Jupyter notebook for my work). After scratching my head for two days, the idea💡 finally triggered. Trim the data!!!!!

Solution : Since including very long length comments for training increased the number of words manifold, the kernel was unable to handle the required memory. It was required to trim the data effectively, so as to not miss essential features and loose accuracy. Setting 400 characters as the threshold included up to 80% of the data and hence appeared to be a good choice. “We had less words in total, but the percentage of toxic words captured were more”.

PS: I had also tried applying Principal Component Analysis, but it did not work out well because of the humongous number of words present. So, I finally went ahead with trimming based on length.

STEP 3: Data Preprocessing

1. Preparation for removal of punctuation marks: I imported the string library comprising all punctuation characters and appended the numeric digits to it, as those were required to be removed too.

2. Updating the list of stop words : Stop words are those words that are frequently used in both written and verbal communication and thereby do not have either a positive/negative impact on our statement. E.g. is, this, us, etc.

Python has a built-in dictionary of stop words. I used the same and also appended the single letters like ‘b’, ‘c’ …. to it, which might be pre-existing or have generated during data preprocessing.

3. Stemming and Lemmatising : Stemming is the process of converting inflected/derived words to their word stem or the root form. Basically, a large number of similar origin words are converted to the same word. E.g. words like “stems”, “stemmer”, “stemming”, “stemmed” are based on “stem”. This helps in achieving the training process with a better accuracy.

Lemmatising is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. This is quite similar to stemming in its working but not exactly same. Lemmatising depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. I used the word-net library in nltk for this purpose. Stemmer and Lemmatizer were imported from nltk.

applying stemmer and lemmatiser!

4. Applying Count Vectorizer : Count Vectorizer is used for converting a string of words into a matrix of words. Column headers have the words themselves and the cell values signify the frequency of occurrence of the word.

I passed the custom list of stop words created earlier as the parameter with default values for ‘lowercase’ and ‘regular expression’ .

Step 4: Finalising Evaluation Metrics !

Before moving on to applying algorithms, it was necessary to understand which metrics were suitable for our problem and dataset available. For multi-label classification there exists two major categories of metrics -

  1. Label based metrics: includes one-error, average precision, etc. These are calculated separately for each of the labels, and then averaged for all without taking into account any relation between the labels.
  2. Example based metrics: include accuracy, hamming loss, etc.These are calculated for each example and then averaged across the test set. Information about metrics has been obtained from this research paper.

An important observation was the fact that our data was skewed, i.e. very less percentage of total comments were actually toxic(less than 10%). Therefore, choosing accuracy as metric would give invalid results. For example even if we construct a basic classifier which predicts everything to be non-toxic, it would be able to do so with 90% accuracy. It would not be able to find what is actually wanted but still give good metric scores. Hence, scores such as ‘Hamming-loss’ & ‘Log-loss’ which work well on even skewed data were finalised for comparing the results of different models. Let us know about each of these:


The hamming loss (HL) is defined as the fraction of the wrong labels to the total number of labels. For a multi-label problem, we need to decide a way to combine the predictions of the different labels to produce a single result. The method chosen in hamming loss is to give each label equal weight. The formula thus formed is :

Hamming Loss formula

Here ⊕ denotes exclusive-or between X( i,l ) which is the true value of l-th label for i-th comment value and Y( i,l ) which is the predicted value of the same.


Log Loss quantifies the accuracy of a classifier by penalizing false classifications. The exponentially decaying curve it possesses clearly indicates the same -

It works only for those problems which have two or more labels. In order to calculate Log Loss the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as:

Log loss formula

where N is the number of samples or instances, M is the number of possible labels, y(i, j) is a binary indicator of whether or not label j is the correct classification for instance i, and p(i, j) is the model probability of assigning label j to instance i.

After finalizing hamming-loss and log-loss as evaluation metrics, I was ready to study the different algorithms for multi-label classification -

STEP 5: Applying algorithmic techniques to build a multi-label classifier

Finally I reached the core part of the project, where I could start building the classifier. I had two major paths to choose from -

  • Problem transformation methods like binary relevance method, label power set, classifier chain and random k-label sets (RAKEL) algorithm
  • Adaptation algorithms like the AdaBoost MH, AdaBoost MR, k-nearest neighbours, decision trees and back propagation-multi label neural networks(BP-MLL).

I. Problem Transformation Methods

The scikit-multilearn library was used for implementing the various methods. Each method requires a base classifier which is created for each of the label and combined in a unique way. Classifiers which were used by me include : Multinomial Naive Bayes, Gaussian Naive Bayes and SVC.

1. Binary Relevance Method : This method does not take into account the interdependence of labels. Each label is solved separately like a single label classification problem. This is the simplest approach to be applied.

Unique classifier for each label

For example, to apply Binary Relevance method using MultinomialNB, the following code would be used:

The results shown by them on testing data were:

2. Classifier Chain Method : In this method, the first classifier is trained on input data and then each next classifier is trained on the input space and previous classifier, and so on. Hence this method takes into account some interdependence between labels and input data. Some classifiers may show dependence such as toxic and severe_toxic. Hence it is a fair deal to use this method.

Data and classifier are merged to build next classifier

3. Label Power Set Method : In this method, we consider all unique combinations of labels possible. Any one particular combination hence serves as a label, converting our multi-label problem to a multi class classification problem. Considering our dataset, many comments are such that they have all non-toxic labels together and many are such that obscene and insult are true together. Hence, this algorithm seems to be a good method to be applied.

Giving number to each combination of labels

II. Adaptation Algorithms

1.MLKNN : This is the adapted multi label version of K-nearest neighbours. Similar to this classification algorithm is the BRkNNaClassifier and BRkNNbClassifier which are based on K-Nearest Neighbours Method. Since our problem is somewhat similar to the page categorization problem, this algorithm is expected to give acceptable results. However, the time complexity involved is large and therefore it will be preferable to train it on smaller part of the dataset.

PS: I tried this method for trimmed dataset, but it was taking too long to complete( for even k=2 ). Hence, I didn’t use it further.

2. BP-MLL Neural Networks : Back propagation Multi-label Neural Networks is an architecture that aims at minimising pair-wise ranking error.

An architecture of one hidden layer feed forward neural network is as follows- The input layer is fully connected with the hidden layer and the hidden layer is fully connected with the output layer. Since I had six output labels, the output layer will have six nodes.

My basic model architecture was a simple one and appeared as follows:

Hence I created several models based on problem transformation and adaptation algorithm approach. The next step was to select the best model, after fine tuning the parameters of the neural network architecture or the selection of the classifier + transformation method pair with the best results.

STEP 6: Refining current models

Refining is the most essential requirement of any machine learning model created. GridSearchCV is a common tool for trying out a suitable number of combinations of model parameters, although manually tuning is also an option.

1)I used manual tuning for problem transformation methods, where I put all three of my classifiers : multinomialNB, gausseanNB, SVC through all of the three methods : binary relevance, classifier chain and label power set. The best result was obtained with “multinomialNB classifier on label power set”

2)I used GridSearchCV for neural network model refinement, the parameter grid consisted of varying number of epochs, nodes in hidden layer and learning rate. Batch size was kept fixed.

This part was challenging enough since I had to wait for almost two nights to get the final result( It took 20–24 hours for 3x3x3 param grid to run on three fold cross validation). The resulting best params were :

This concluded my search for the best model using fine tuning. I had chosen SVC classifier as my benchmark model and results of the different models were compared with its results to estimate whether I was actually successful in my efforts.

STEP 7: Comparing models and Concluding!

The final step was to figure out which model gave the best results. A mixture of hamming-loss and log-loss was used to select the same. Comparison of the two best models with benchmark model is as below:

Graphically, the following plots visualize losses produced by each model. The lesser the loss, the better the model!

If we compare all the models on hamming-loss : The best model would be LP-MultiNB i.e. Label Power Set Model with MultinomialNB classifier. It had a hamming-loss of 3.17% only.

If we compare all the models on log-loss : The best model will be BP-MLL model (neural network) with params = { nodes in hidden layer = 16, learning rate = 0.001, epochs = 10, batch size = 64 } as obtained from grid-search.

It was a key point to note that almost all models generated by problem transformation methods had low values of hamming losses, whereas models generated by the adaptive algorithm approach had low values of log-losses.

This concludes my work. Thank you!

You can have a look at the project proposal, report and simulations I created for this project on the Github repository. I hope you learnt something new through this read!

Nupur Baghel

Written by

Machine Learning Enthusiast | Open Source programmer | Upcoming Data Scientist 👩‍🔬

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade