Jigsaw Unintended Bias in Toxicity Classification Using Bi-directional RNNs

EDA, minimizing unintended bias w.r.t mentioned identities.

Vamshi Krishna Dude

Published in

Analytics Vidhya

5 min readDec 7, 2019

Overview

The Conversation AI team by Jigsaw and Google hosted a competition in Kaggle to detect toxicity in comments and minimize unintended bias in text/phrase w.r.t identities like Male, Female, homosexual_gay_or_lesbian, black, white, etc. I entered the competition with a week left and was able to get a score of 0.9239 with simple Bi-directional LSTMs and a score of 0.939 with a fine-tuned BERT model. In this blog, I’ll share the implementation details of simple Bi-directional LSTMs.

Understanding as a machine learning problem

It can be interpreted as a binary classification problem having a target variable as toxic/non-toxic and also as a regression problem having the target variable ranging [0.0, 1.0] with 0 as non-toxic and 1 as toxic.
Some form of interpretability for the results.

Data

I refer to civil comments data for now which is similar to the data available in kaggle competition. Below is the code to view sample data. For full code check my GitHub notebook.

Some insights on Comments

We found only 8% of data is toxic. Data is imbalanced.

Here is a plot showing comments percentages as per Toxic and Non-Toxic in Train data. We found only 8% data is toxic.

Identities [FEMALE, MALE, BLACK, WHITE, CHRISTIAN] found mostly in the Training data comments. toxicity percentage is high in comments with [homosexual_gay_or_lesbian] .

New features[Number of words in a comment, Number of expressive characters(?!. etc.) in a comment] are extracted.

num_words in the range[100, 180]approx. and num_expr_words in the range[100, 150]approx. in a comment tends to be a non-toxic comment.

Preprocessing Comments

Here, we have comments as text that needs to preprocessed before converting to machine-understandable data without losing context of the comment. For full code check my GitHub notebook.

Preprocessing techniques I’ve followed :

removing special characters and punctuation marks except [?!].
replacing markup text, ‘https://’, spaces, numerics, with an empty string.
expanding language contractions (eg. don’t -> do not)
removing stopwords using nltk lib except ‘not’ keyword.
word lemmatization using nltk lib.

After preprocessing, a word cloud having most frequent words in a list of comments :

most frequent words in toxic and non-toxic comments

Now It's the time to convert the preprocessed comments to machine/model-understandable features…

Tokenization of words in the comments

Tokenizer assigns a rank or a token to each word in the comments based on the frequency of the word in all comments. Fit the tokenizer with train and test data comments to give our model maximum vocabulary as we are not using pre-trained models here. Use pad_sequences() to ensure that all comments have the same length and to train the model in batches. It's good to have the maximum sequence for padding between 200–250 based on num-words distribution from the above plot. I’ve chosen 220.

# Tokenizer
tokenizer = Tokenizer()
# we are fitting test set also here.To give maximum vocabulary to our model.
tokenizer.fit_on_texts(list(train_df_float[“comment_text”].values)+list(test_df_float[“comment_text”].values))train_seq=tokenizer.texts_to_sequences(list(train_df_float[“comment_text”].values))# Padding the sequences - for equal comment length.
train_seq = pad_sequences(train_seq, maxlen=MAX_PAD_SEQ_LEN)
train_labels = train_df[TARGET_COLUMN]

Build Embedding matrix

Embedding layer, using the embedding matrix, will generate a continuous vector representation for each word that is represented as a token by the tokenizer. I have used Glove, Crawl 300d word vectors, and concatenated two word-vectors to get maximum context for a word.

# Embedding matrix for embedding layer in the neural network.
embedding_matrix = embedding_matrix(tokenizer.word_index)

Bi-directional LSTM model

Two Bidirectional LSTM’s give the context of a sequence for NLP use cases most of the time. 1Dimensional Dropout at the start and MaxPooling with valid padding gave good results. Using sample weights while training the model is a trick to avoid false positives for the model.

LSTM_UNITS = 128
DROPOUT_RATE = 0.3
DENSE_HIDDEN_UNITS = 4*LSTM_UNITS# input layer
sequence_input = Input(shape=(input_shape,))# embedding layer
embedding_layer=Embedding(embedding_matrix.shape[0],
embedding_matrix.shape[1], 
weights[embedding_matrix],
input_length=input_length,
trainable=False)(sequence_input)# spatial dropout dimen:1D
sd = SpatialDropout1D(DROPOUT_RATE)(embedding_layer)# Bidirectional layers
x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(sd)
x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
x = keras.layers.MaxPooling1D(2, padding='valid')(x)
x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
x = Dropout(DROPOUT_RATE)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)# output_layer
result = Dense(1, activation='sigmoid')(x)
# auxilary outputs layer
aux_result = Dense(aux_target_count, activation='sigmoid')(x)model = Model(inputs=sequence_input, outputs[result,aux_result])
model.compile(loss='binary_crossentropy',optimizer='adam')

Training the model, we use sample weights to the training data. We increase the weight of 1/4 for each comment for each case a comment having identity feature/s, toxic & not having an identity feature, non-toxic & having an identity feature.

# Credits : https://www.kaggle.com/gpreda/jigsaw-fast-compact-solutionsample_weights = np.ones(len(train_df), dtype=np.float32)/4
sample_weights+=train_df[IDENTITY_COLUMNS].sum(axis=1).astype(bool).astype(np.int) / 4
sample_weights += train_df_float[TARGET_COLUMN] *(~train_df_float[IDENTITY_COLUMNS]).sum(axis=1).astype(bool).astype(np.int) /4
sample_weights += (~train_df_float[TARGET_COLUMN]) * train_df_float[IDENTITY_COLUMNS].sum(axis=1).astype(bool).astype(np.int) /4sample_weights /= sample_weights.mean()

parameters for the model configuration :

Total params: 169,631,630 
Trainable params: 5,143,430 
Non-trainable params: 164,488,200

Model Output and Evaluation

The model evaluation code is referred from the hands-on tutorial by Google.

Output : AUC score - 0.920 - as the final metric .

Below heat map showing bias metrics for individual identities. bnsp_auc[background negative and subgroup positive] is good for all identity groups but bpsn_auc[background positive and subgroup negative] is not for black, white, homosexual_gay_or_lesbian. This states that a maximum of 15% of the comments with identity[black] are classified as positive[toxic] even if they are negative[non-toxic].

final metrics(AUC) — representation of the model performance

Sample Results

=======Sample Prediction Comments=======
Because people tend to be emotion-driven idiots which is why politicians and activists are always aiming at your heartstrings.  toxicity value : 0.807018Spoken like the true libtard you are.  Hahahaha! LIBTARD!!  
toxicity value : 0.211538What an ugly orgy.  Shame on Fish and Game for allowing this.  How many ways can we devise to exploit a resource...............shame on us.  
toxicity value : 0.625000Kind of like you and your small-minded, repetitive and utterly banal drivel.  
toxicity value : 0.546667"The rednecks and morons who elected Trump are in their glory, ..."\n\nIs that you Hillary Clinton?  
toxicity value : 0.716216

For full code check .ipynb file.

Conclusions :

Used GLOVE and FAST CRAWL 300d embeddings for comment text vectorization and modified sample weights as an improvement. Defined bias metrics from google benchmark kernel.
A major improvement can be achieved with context-based word-embeddings using bert, xlnet models and finetuning them.