Detecting toxic comments with Keras and interpreting the model with ELI5

Develop an estimator with a neural network model for a text classification problem and used ELI5 library for explain the predictions

Armand Olivares
5 min readAug 19, 2019

The main objective of this article is make a model with keras in order to identify whether a text tweet contains toxic comments or not in this particular case of sentiment analysis.

The datasets

For this article We have considered 2 dataset of tweets, which you can download at:

  1. Dataset 1: this dataset is from the hate speech identification with a distribution of comments:

2. Dataset 2: this dataset is from the contest of twitter sentiment analysis to predict whether a text tweet contains hate speech or not, the distribution of class for this dataset is as follow:

Both are dataset used for classify hate-speech, however in this article we will considered hate and offensive speech into in just one category, i.e ‘toxic’

Also as you can see both dataset are umbalacend but they complement each other so by merging them we created a new dataset with a balanced distribution of class:

After merging the dataset, is time to clean and formating it.

Data Pre-processing

The cleaning of the data consist of:

  • Remove all punctuation, i.e (.,:!? etc)
  • Remove all links, i.e (https://www.twitter.com)
  • Remove all user names, (i.e @user_example)
  • Remove digits in text
  • Deconstruct Contractions, (i.e change I’m with I am)
  • Turn words into lowercase
  • Delete all stopwords
  • Lemmatize all the words

below is the snipet code for the cleaner functions:

And the decontracted one:

Let’s check how the cleaner function is working in out dataset, for a random tweet:

after apply the cleaner function

The cleaner function work properly, it’s time to move to the modeling part.

Modeling

To use the explaning library ELI5 with deep learnings models you need to create your own estimator using scikit-learn, so we must to create a new estimator class.

The process is as follow:

1. Decide what type of object you want to create and select the Base Classes

class KerasTextClassifier(BaseEstimator, TransformerMixin)

2. Create the init method inside the class: this method has default values of our classifier:

As you can see in this method you define the values of variables like number of clases (default is 2), the number of epochs (default 15), the embedding dimension (300) and some other initials parameters. Also in this method you build the model.

3. Create the _get_model method: the method where the keras model is build by adding layers as usual:

4. Create the preprocesing methods: in this article we will put the method for cleaning, tokenizing, and padding the data inside the class.

5. Create the fit metod: it has all the hard work, I means checking the parameters, taking and processing the data, and train the model:

The fit method does the following:

  • Make the Tokenizer
  • Clean the Data
  • Truncate and pad the input sequences so that they are all in the same length for modeling
  • Train de Keras model

6. Create the predict and predict_proba and score methods: the methods for making the predictions and evaluate out model.

and we need to build the embedding index to pass to the estimator, using spacy pre-trained embeddings:

embeddings_index = np.zeros((30000 + 1, 300))
for word, idx in tokenizer.word_index.items():
try:
embedding = nlp.vocab[word].vector
embeddings_index[idx] = embedding
except:
pass

Now that we have built our keras estimator, we need to instanciate the model and pass the embedding weigths:

And train the model:

The score accuracy on test data:

94% not bad.

Interpreting text predictions with ELI5

Now we can explain the prediction by looking at a randomly select comment in test set:

For example for class “Non Toxic

In this diagram, the green words made the classifier think that the comment was less toxic while the red words made the comment seem more toxic.

We clearly can see that the classifier made the right choice by classifying as “Non toxic” the comments and give more importance to words as “golden”, “life”, “smile” for just mention an example.

For class “Toxic”:

We see that for tweet classifying as “Toxic” it give more importance to word as “bitch”, “hoe”, “retard” and “fuck”, i.e means rude words.

Final Thoughts

This article should give you a rough understanding of how to approach for NLP interpretability with with Keras and ELI5.

You can try changing the DL model, cleaner even more the data, etc.

You can use this as starter point for similar problems in NLP.

The code can be found on this Jupyter notebook, and you can browse for more projects on my Github.

--

--