Multilingual Toxic Comment Classification

5 min readMay 22, 2020

This post contents datasets and notebooks can be found in kaggle competition.

This Project was created to enhance the environment of Social Media and eliminate any toxicity leads to Imbalance in society by showing hatred and fear of self-expression, which in turn leads to killing creative ideas and creating a society with a single order, devoid of creativity and full of negativity.

and to achieve that we developed three NLP models Bi-LSTM,BERT, and XLM Roberta. these models can detect the toxicity of the comment or a sentence from six different languages. for the simplicity of the post I will go through Bert model only. if you want to check the rest of the models you can find it in kaggle competition here.

So Let’s Start

In Jigsaw Multilingual Toxic Comment Classification competition there are 3 different datasets for each train, validation, and test dataset

We use here this three datasets:

1- jigsaw toxic comment train.csv

2- validation.csv

3- test.csv

Exploratory Data Analysis

Exploratory Data Analysis is a critical stage before using the data for modeling. In this stage we will investigate the data to discover the pattern, make hypothesis, gather insights from it, and display a graphical representation for easier reading.

Our training data contains 223549 records and 8 columns. the two important columns in these data is the comment_text column which has all the text we will going to train our NLP models on, and the toxic columns which is our target column.

Since we are working on classification task our model will have 2 classes: toxic refer to as 1 and non-toxic refer to as 0. All the three datasets didn’t have nulls values in their columns.

NOTE: in this part I add one more dataset for analysis only

The Percentages Between Toxic Types

The toxic word comes in different shapes, and in our data the comments categorize the toxicity based on 5 categories obscene, insult, identity, and severe toxic. From the pie chart the category obscene got the highest percentage of the toxic comments from the data came after that the insult comments.

After seeing the result of toxic words based on races I decided to look farther on the data and compare the toxicity based on the religions. Found the christin is the most religous affected by the toxicity comments in the data.

In the events we see every day around the world. One can see the racism and bullies that happened to someone is not just because of their identitys like African American or white, but also on religions and the gender of the person.

Languages

After some observation I decided to change the languages code in validation and test data to it’s language name

Since we are building a model that operates on a diverse range of conversation from different languages. We looked at how many languages we have in both the validation and test datasets. First in the training dataset we only have English conversation which means our models will be training on only one language. Then in the validation and testing dataset we will test our models on different languages other than the English. In the bar chart on the left the validation data has only 3 languages Spanish, Italian, and Turkish. while the bar chart on the right the test dataset has 6 languages Spanish, Russian, Italian, French, Portuguese, and Turkish.

Cleaning The Data From Emoji

In the datasets you will notice there are emoji and punctuation like ($,⭐︎,♦︎) in the comments. We build a function to remove both of them, but eventually we found that Bert tokenizer was updated to new version which now support the emoji so now you don’t have to delete the emoji from the text before training the model.

Imbalanced Data

The competition datasets are imbalance, where the data on toxic class is significantly higher than the other class to handle this challenge, we use the right evaluation metric called AUC and it stand for Area under the curve. This metric use to measure the performance of classification problem. AUC also can tell how much our model is capable of distinguishing between the two classes. The higher the AUC the better the model is at predicting zeros as 0 and ones as 1.

Bert Model

Bert stands for Bidirectional Encoder Representations from Transformers. It’s a new techniques for NLP pre-training languages representation. Which means now machine learning communities can use Bert models that have been training already on a large amount of words, to do wide variety of tasks such as Question Answering tasks, Named Entity Recognition (NER), and Classification like sentiment analysis.

Preparing The Functions

Use the pre-trained model bert as a tokenizer. it already has vocabulary for emoji. this is the reason we don’t need to remove emoji from
Call the function regular encode on the 3 dataset to convert each words after the tokenizer into a vector
Create and prepare a source dataset from your input data to fit the model in the next step. The prefetch(AUTO) will allows the next elements to be prepared while the current element is being processed (pipline).

Bert Model

bert-base-multilingual-cased

Train on:12-layer, 768-hidden, 12-heads110M parameters.
Train on 104 languages with Wikipedias

Take the encoder results of Bert from transformers and use it as an input in the NN model
Training the data and tune our model with the validation dataset

We got 88.5% AUC from Kaggle submission

Our models trained on more than 200000 English conversion records and then tested on six different languages to predict wither the comment is toxic or not. This system can be useful not just for identifying user's comments toxicity it can be improved to identify those who are portraying suicidal tendencies and evidence of self-harming behaviors. Because this kind of issues can easily drag young people to that kind of health problems.

Multilingual Toxic Comment Classification

So Let’s Start

Exploratory Data Analysis

The Percentages Between Toxic Types

Languages

Cleaning The Data From Emoji

Imbalanced Data

Bert Model

Preparing The Functions

Bert Model

bert-base-multilingual-cased

Written by Sarah Aljudaibi