TEXT CLASSIFICATION / NLP

Jigsaw Unintended Bias in Toxicity Classification: A Kaggle Case-Study

In this blog, I’ll explain how to build a first-cut solution to classify the toxicity of comments using a simple LSTM model in Keras.

Tulrose Deori

Published in

The Startup

14 min readMar 14, 2020

Table of Contents:

Introduction
Business Problem
Mapping the real-world problem to an ML/DL problem
Understanding the data
Existing Approaches
My Approach
Understanding Embedding, Bidirectional LSTM, GlobalMaxPool1D, GlobalAveragePool1D, and Attention
Model Explanation
Code
Training
Results
Future Works and Conclusions
References

1 Introduction:

At the end of 2017, the Civil Comments platform shut down and chose to make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes and hosted a competition in Kaggle to employ ML/DL to help detect toxic comments. [1]

To read more about the challenge, click here.

2 Business Problem:

The Conversation AI team (it is research initiated by Jigsaw and Google) build a toxicity model, they found that the model incorrectly learned to associate the names of frequently attacked identities with toxicity. So the model predicted high toxicity for those comments which contain words like gay, black, Muslim, white, lesbian, etc, even when comments were not actually toxic (e.g. I am a gay woman.). This happened because the dataset was collected from the sources where such words (or identities) are considered as highly offensive. A model is needed to be build which can find the toxicity in the comments and minimize the unintended bias with respect to some identities. [1]

Toxic comments are the comments which are offensive and sometimes can make some people leave the discussion (on public forums).
Unintended Bias is related to unplanned bias which happened because the data was collected from such sources which considered some words (or identities) very offensive.

3 Mapping the real-world problem to an ML/DL Problem:

3.1 Type of Machine Learning Problem:

The problem at hand is a binary classification task:

Target label 0 means non-toxic comments.
Target label 1 means toxic comments.

3.2 Performance Metric:

Source: Kaggle

The competition uses a newly developed metric that combines several sub metrics to balance overall performance with various aspects of unintended bias.

First, we’ll define each sub metric.

1. Overall AUC: This is the ROC-AUC for the full evaluation set.

2. Bias AUCs: To measure unintended bias, we again calculate the ROC-AUC on three specific subsets of the test set for each identity, each capturing a different aspect of unintended bias. For this, the dataset is divided into two major groups — Background and Identity groups.

An Identity group can be defined as a bunch of comments that have some mention of a particular ‘identity’ in them.
Everything that doesn’t belong to the Identity group goes to the Background group.

Each group can be further divided into two groups which contain positive and negative examples. Therefore there are 4 subsets.

image source: https://medium.com/jash-data-sciences/measuring-unintended-bias-in-text-classification-a1d2e6630742

a) Subgroup AUC: Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

b) BPSN (Background Positive, Subgroup Negative) AUC: Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

c) BNSP (Background Negative, Subgroup Positive) AUC: Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:

image source: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation

We combine the overall AUC with the generalized mean of the Bias AUCs to calculate the final model score:

4 Understanding the data:

4.1 About Data:

Download the data files from here.

The data includes the following:

train.csv: the training set, which includes comments, toxicity labels, and subgroups.
test.csv: the test set, which contains comment texts but toxicity labels or subgroups.
sample_submission.csv: a sample submission file in the correct format.

The text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. [1]

Although there are many identity columns in Train data only a few are required and these identities are: male, female, homosexual_gay_or_lesbian, christian, jewish, muslim, black, white, psychiatric_or_mental_illness. These identities will help us in calculating the final metric.

4.2 Exploratory Data Analysis:

Let’s try to study and analyze our data and try to come up with some meaningful insights. EDA helps us in many ways:

We might find some patterns in the data which will help us build good models.
We might come up with some meaningful insights which would help us make important business decisions.

Let’s first load our CSV datafiles to a pandas data frame:

import pandas as pd
train_df = pd.read_csv('train.csv.zip')
test_df = pd.read_csv('test.csv.zip')

4.2.1 Univariate Analysis of target feature:

This feature is the measure of toxicity for a comment text.

We can see that the target feature ranges b/w 0.0 to 1.0
Most of the comments have a toxicity score in the range 0.0 to 0.2

Let’s try to see the barplot for each class.

We can see that the data is very much unbalanced. Most of the comments are non-toxic and there are very few toxic comments.

4.2.2 Univariate Analysis of Auxiliary target features:

The data also has several additional toxicity subtype attributes that are highly correlated to the target feature. These features are:

severe_toxicity , obscene , threat , insult , identity_attack , sexual_explicit

Let’s try to visualize their distributions and check if they provide any useful pieces of information:

We can say that most comments are made with the intention to insult someone.

4.2.3 Analysis of Identity features:

A subset of comments has also been labeled with a variety of identity attributes. They can be grouped into five categories: race or ethnicity, gender, sexual orientation, religion and disability, as following:

race or ethnicity: asian, black, jewish, latino, other_race_or_ethnicity, white
gender: female, male, transgender, other_gender
sexual orientation: bisexual, heterosexual, homosexual_gay_or_lesbian, other_sexual_orientation
religion: atheist, buddhist, christian, hindu, muslim, other_religion
disability: intellectual_or_learning_disability, other_disability, physical_disability, psychiatric_or_mental_illness

Analysis wrt to “race” or “ethnicity”:

Analysis wrt to “gender”:

Analysis wrt "sexual orientation":

Analysis wrt to “religion”:

Analysis wrt to “disability”:

We can derive the following conclusions from the above plots:

Most comments are made on the ‘Christian’ religion. Also, most of the toxic comments are made on this religion.
The comments that mention ‘ gay’ or ‘lesbian’ are more likely to be toxic.

4.2.4 Analysis of comment_text feature:

We check the distribution of numbers of characters present in a comment:

We have a bimodal distribution of character length in the data.
The average character length of comments is 297

We check the distribution of the number of words present in a comment:

We have a clear unimodal left-skewed distribution for the number of words in the data.
The average number of words in a comment is 52.

Next, let’s sample 20,000 comments from both toxic and non-toxic comments and see Wordcloud for the top 100 words. This will give us a sense of “which” words are most frequently used for a certain type of comment.

Wordcloud for Toxic comments:

Wordcloud for Non-Toxic comments:

4.3 Data Preprocessing:

Let’s have a look at some of the comment texts randomly to get a sense of how they look. This will help us to make decisions about the type of processing steps that should be applied to the texts:

Sentence 1:It's ironic to sterilize homosexuals. They won't, by their OWN nature, have sex with the opposite sex. Thereby not contributing their genes to the pool. That is why homosexuality, scientifically speaking, is considered a fatal genetic mutation. (ps I'm still waiting for fallout on my mutant comment) haha
_________________________________________Sentence 2:Natural Law is neither natural nor is it a law. It is a philosophical opinion. IMHO, it's use in making an argument is invalid.
_________________________________________Sentence 3:Jail????

For what? (Snork)

So, for saying tax 'Payers' should stop (in perpetuity) supporting tax 'Takers' (aka welfare recipients)? 
* I agree with Cory.
.
Or for not being willing to hold "in person" town halls where violence can happen-like with the alt left baseball shooting? 
* I agree with Cory.
.
Jail? Laughable.

 I support Cory. He doesn't believe in wasting my hard earned tax dollars either. For that I thank him.
_________________________________________Sentence 4:A non-issue.
_________________________________________Sentence 5:murdering he says
did you murder a cheeseburger today hypocrite?
_________________________________________

We see that our data points consist of lots of punctuations, contractions, quotes, etc. So next we will try to clean them so as to increase their vocabulary coverage.

4.3.1 Handling Contractions:

We can use the NLTK library for this purpose.

from nltk.tokenize.treebank import TreebankWordTokenizertokenizer = TreebankWordTokenizer()
def handle_contractions(x):
    x = tokenizer.tokenize(x)
    return x

The TreebankWordTokenizer tokenizer performs the following steps:

split standard contractions, e.g. don’t -> do n’t and they’ll -> they ‘ll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace
separate periods that appear at the end of the line

4.3.2 Handling punctuation marks: We will remove the unwanted punctuation marks.

We are now done cleaning our data.

4.4 Adding weights to the data samples:

A subset of comments in the dataset has also been labeled with a variety of identity attributes, representing the identities that are mentioned in the comment. The following columns corresponding to identity attributes are included in the evaluation calculation. So, we will add identity information as weights to our data points.

#https://github.com/jiaruxu233/Jigsaw-Unintended-Bias-in-Toxicity-Classification/blob/master/Custom_Loss.ipynbIDENTITY_COLUMNS = ['male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish','muslim', 'black', 'white', 'psychiatric_or_mental_illness']weights = np.ones((len(train_df),)) / 4# Subgroup
weights += (train_df[IDENTITY_COLUMNS].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) / 4# Background Positive, Subgroup Negative
weights += (( (train_df['target'].values>=0.5).astype(bool).astype(np.int) + (train_df[IDENTITY_COLUMNS].fillna(0).values<0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int) / 4# Background Negative, Subgroup Positive
weights += (( (train_df['target'].values<0.5).astype(bool).astype(np.int) + (train_df[IDENTITY_COLUMNS].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int) / 4

loss_weight = 1.0 / weights.mean()

4.5 Featurization:

Deep learning or machine learning models can not understand human language. Hence we need to convert our data into a mathematical form before we can feed it as input to our model.

Text tokenization is a method to vectorize a text corpus, by turning each text into a sequence of integers (each integer is the index of a token in a dictionary). This can be done with simple lines of code using the utility functions available in Keras:

from keras.preprocessing import texttok=text.Tokenizer()
tok.fit_on_texts(list(x_train))

x_train = tok.texts_to_sequences(x_train)
x_test = tok.texts_to_sequences(x_test)

It is to be noted that all the sequence's length is not equal. Because the comment texts are not equal in length. So, we will pad the sequences to the same length, MAX_LEN. Sequences that are shorter than MAX_LEN are padded with value at the start/end. Sequences longer than MAX_LEN are truncated so that they fit the desired length

x_train = sequence.pad_sequences(x_train, maxlen=MAX_LEN)
x_test = sequence.pad_sequences(x_test, maxlen=MAX_LEN)

Now, we are ready with our data to train a Deep Learning Model.

5 Existing Approaches:

Most of the top-scoring solutions use an ensemble of multiples models. They have used combinations of multiple BERT models, multiple GPT2 models, multiple LSTM models, etc. It requires several hours of training and also tremendous computation power to build one such ensemble model. But the model I’ve built could be trained using Google Colab in just a few hours.

6 My Approach:

Unlike a complex ensemble of multiple models, I have built a simple LSTM model for the classification task.

The main ideas that gives so much power to this simple LSTM model are:

use of Attention Layer
instead of making the prediction at the end of the model training, we make predictions for each epoch and then combine them using some weights.

7 Understanding Embedding, Bidirectional LSTM, GlobalMaxPool1D, GlobalAveragePool1D, and Attention:

Before we dive into the model, we need to understand the different layers and operations we will use in our model.

7.1 Embedding Layer: This layer creates the word embeddings for all the words present in the documents of our corpus. It does so by mapping the integer inputs to the vectors found at the corresponding index in the embedding matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. [2]

This means when we feed the tokenized data obtained in Section 4.5, of shape (samples, indices) , as input to the Embedding layer, the output will be a 3D tensor of shape (samples, sequence_length, embedding_dim).

For the problem at hand, we created the embedding matrix using the pre-trained word embeddings of Crawl and Glove (both 300 dimensions). As such, the embedding matrix will be of shape: (vocab_size, 600) .

7.2 Bidirectional LSTM: Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together. This structure allows the networks to have both backward and forward information about the sequence at every time step. This means, at any point in time it preserves information from both past and future. [3]

image source: https://medium.com/@raghavaggarwal0089/bi-lstm-bc3d68da8bd0

7.3 GlobalAveragePooling1D: For each feature dimension, it takes the average among all time steps. So a tensor with shape (batch_size, step_dim, features_dim) becomes a tensor of shape (batch_size, features_dim) after global average pooling.

7.4 GlobalMaxPooling1D: For each feature dimension, it takes the maximum among all time steps. So a tensor with shape (batch_size, step_dim, features_dim) becomes a tensor of shape (batch_size, features_dim) after global max pooling.

7.5 Attention: In psychology, attention is the cognitive process of selectively concentrating on one or a few things while ignoring others. [4]

Similarly, in a sentence, not all words contribute equally to the representation of the sentence meaning. Hence, we introduce “attention” to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector. This sentence vector can be seen as a high-level representation of the given sentence and can be used as feature for the classification task. [5]

Specifically,

That is, we first feed the word annotations hi, obtained from the biLSTM layer before the Attention layer, through an affine transformation function (General Alignment Score function as defined in Luong et al. [6] ) to get the alignment score ei for each of the biLSTM hidden states.

Then, the alignment score for each of the hidden states is combined and represented in a single vector and subsequently softmax.

After that, we compute the vector representation for the sentence U (context vector) as a weighted sum of the word annotations hi based on the weights ai. The context vector we obtain can be seen as a high-level representation of the given sentence and can be used as feature for the classification task. [5]

The following figure illustrates the Attention architecture:

8 Model Explanation:

The following diagram is the architecture of our model:

I used two layers of bi-LSTM Layer.
The output of the first bi-LSTM layer is fed to an Attention layer. It produces an output say A1.
The output of the second bi-LSTM layer is fed to the Attention, GlobalAveragePooling1D, and GlobalMaxPooling1D layers. It produces an output say A2.
The outputs A1 and A2 are concatenated and passed to a Dense layer.
I used Linear layers with skip connections in the deeper layers.
Then, we finally make our predictions using a Dense layer.

Skip connections skip some layers in the neural network and feed the output of one layer to another layer skipping a few layers in between. So a piece of information that we have in the primary layers can be fed explicitly to the later layers using the skip architecture. Skip connections also help traverse information faster in deep neural networks.

9 Code:

Below is the Keras code to define the above model:

10 Training:

The model is compiled with Adam optimizer
We make two predictions with our model: target and aux_target .

Because, the data also has several additional toxicity subtype attributes (severe_toxicity, obscene, threat, insult, identity_attack) that are highly correlated to the target, we also use the toxicity probabilities of these auxiliary targets.

We used a custom_loss function to penalize the weighted target and use the vanilla binary_crossentropy to penalize the aux_target.
We used LearningRateScheduler to schedule a different learning rate at every epoch.
We used a batch size of 256.
We used DECODER_HIDDEN_DIM = 128

We trained two models, each for 4 epochs. It took 5–6 hours to complete the training process with the free computing resources provided by Google Colab.

11 Results:

The KAGGLE scores and the LB ranking for the model are summarised below:

KAGGLE Score: 0.93623
LB: 629/2633

12 Future Works and Conclusions:

We can try to use smaller batch sizes and train for larger epochs to boost the model performance. Tinkering the LearningRateScheduler might also lead to a slight improvement in the model performance.

And that’s all. Thank you for reading my blog. Please leave comments, feedback, and suggestions if you feel any.

Full code on my GitHub repo, here.

You can find me on LinkedIn, here.