Quora Insincere Question Classification

Detect Toxic Content to Improve Online Conversations

Priyanka Patel
Analytics Vidhya
19 min readOct 1, 2019

--

In this blog I’ll be explaining how I performed Classification of Quora Insincere question dataset through Machine Learning and Deep Learning Methods.

Overview Of The Problem:

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions — those founded upon false premises, or that intend to make a statement rather than looking for helpful answers.

Quora has come up with a competition where we develop models that identify and flag insincere questions.

Problem Statement:

Predict whether a question asked on Quora is sincere or not.

Evaluation Metrics:

Metric is F1 Score between the predicted and the observed targets. There are just two classes, but the positive class makes just over 6% of the total. So the target is highly imbalanced, which is why a metric such as F1 seems appropriate for this kind of problem as it considers both precision and recall of the test to compute the score.

F1-Score (link)

Data Overview:

Quora provided a good amount of training and test data to identify the insincere questions. Train data consists of 1.3 million rows and 3 features in it.

File descriptions

  • train.csv — the training set
  • test.csv — the test set
  • embeddings

Data fields

  • qid — unique question identifier
  • question_text — Quora question text
  • target — a question labeled “insincere” has a value of 1, otherwise 0

Exploratory Data Analysis:

Load Train and Test Data-set:

First, load the train and test data set. Here, I also check the shape and data points in a set.

Distribution of data points among output classes:

We can see that dataset is highly imbalanced with only 6.19% of Insincere Questions.

Basic Feature Engineering:

We can add some features as a part of feature engineering pipeline for Quora Insincere Questions Classification Challenge.

Some features that I have included are listed below:

  • freq_qid = Frequency of qid
  • qlen = Length of qid
  • n_words = Number of words in Question
  • numeric_words = Number of numeric words in Question
  • sp_char_words = Number of special characters in Question
  • unique_words = Number of unique words in Question
  • char_words = Number of characters in Question

I have added above mentioned features as they will help us evaluate our data better in determining what features are useful and which ones to discard/remove.

Data Preprocessing:

The text data is not entirely clean, thus we need to apply some data preprocessing techniques.

Preprocessing techniques for Data Cleaning:

  1. Removing Punctuation

Special Characters that were in the data; we’ll use replace to remove these characters

2. Cleaning Numbers

3. Correcting Misspelled Words

For better embedding coverage we’ll replace misspelled words using a misspell mapping and regex functions.

4. Removing Contractions

Contractions are words that we write with an apostrophe.

5. Removing Stopwords

6. Stemming

Stemming is the process of converting words to their base forms using crude Heuristic rules. For example, one rule could be to remove ’s’ from the end of any word, so that ‘cats’ becomes ‘cat’.

7. Lemmatization

Lemmatization is very similar to stemming but it aims to remove endings only if the base form is present in a dictionary.

Once we are done with processing the text, we’ll apply below given steps to clean the text on train and test data.

Analysis of Extracted Features:

From the wordcloud we can see that Muslim, Trump, Black, Indian etc are the words that appear a lot in Insincere questions.

After applying Data preprocessing and cleaning, our text is ready for Classification. I have applied both Conventional and Deep Learning Methods for Classification.

Let’s first understand Machine Learning Methods of Classification.

Machine Learning Methods

Advance NLP Text Processing:

1. Bag of Words (CountVectorizer)

Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

CountVectorizer converts a collection of text documents to a matrix of token counts. For this I have selected the n-gram range to be 1–3 and min_df to be 3 for building the vocabulary.

We run these features for machine learning models like Logistic Regression, Naive Bayes and LightGBM.

2. Term Frequency — Inverse Document Frequency (TF-IDF)

Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.

Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.

Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:

TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. For this also n-gram range is 1–3 and min_df is 3 to build vocabulary.

We again, run these features for machine learning models like Logistic Regression, Naive Bayes and LightGBM.

3. Hashing Feature (HashingVectorizer)

HashingVectorizer is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved.

HashingVectorizer converts a collection of text documents to a matrix of token occurrences. 2**10 is the number of features we want in a column in the output matrices.

For this, I tried Logistic Regression and LightGBM.

4. Word2Vec Feature (Word Embeddings)

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

Word2Vec consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer.

For this also, tried Logistic Regression and LightGBM.

Since we have seen conventional methods in details, lets now see how I applied deep learning methods.

Deep Learning Methods

I”ll explain the deep learning models like TextCNN, BiLSTM and Attention that I tried for this classification.

Note: For all models, sigmoid as activation function is used in the output layer and to compile the model, used Adam optimizer and Binary Cross Entropy as loss fucntion.

1. TextCNN

TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer. Suppose the input text sequence consists of n words, and each word is represented by a d-dimension word vector. Then the input example has a width of n, a height of 1, and d input channels. The calculation of textCNN can be mainly divided into the following steps:

  1. Define multiple one-dimensional convolution kernels and use them to perform convolution calculations on the inputs. Convolution kernels with different widths may capture the correlation of different numbers of adjacent words.
  2. Perform max-over-time pooling on all output channels, and then concatenate the pooling output values of these channels in a vector.
  3. The concatenated vector is transformed into the output for each category through the fully connected layer. A dropout layer can be used in this step to deal with over-fitting.

2D convolution layer creates a convolution kernel that is convolved with the layer inputs to produce a tensor of outputs. I have chosen 36 as the dimentionality of the output space and a list for filter size that specifies the height and width of the conv2D window. ReLu as activation and he_normal as kernel initializer is used. On top of it max pooling is applied.

This model gave an f1-score of 0.6101.

2. Bidirectional LSTM

Bidirectional LSTMs are supported in Keras via the Bidirectional layer wrapper.

This wrapper takes a recurrent layer (e.g. the first LSTM layer) as an argument.

It also allows you to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer.

A bidirectional RNN is effectively just two RNN’s where one gets fed the sequence forward, while the other one gets the sequence fed backward.

A bidirectional RNN

For a most simplistic explanation of Bidirectional RNN, think of RNN cell as a black box taking as input a hidden state(a vector) and a word vector and giving out an output vector and the next hidden state. This box has some weights which are to be tuned using Backpropagation of the losses. Also, the same cell is applied to all the words so that the weights are shared across the words in the sentence. This phenomenon is called weight-sharing.

For a sequence of length 4 like “The Quick Brown Fox”, The RNN cell gives 4 output vectors, which can be concatenated and then used as part of a dense feedforward architecture.

In the Bidirectional RNN, the only change is that we read the text in the usual fashion as well in reverse. So we stack two RNNs in parallel, and hence we get 8 output vectors to append.

Once we get the output vectors, we send them through a series of dense layers and finally a softmax layer to build a text classifier.

Here 64 is the size(dim) of the hidden state vector as well as the output vector. Keeping return_sequence we want the output for the entire sequence. So what is the dimension of output for this layer?
64*70(maxlen)*2(bidirection concat)

Note: CuDNNLSTM is fast implementation of LSTM layer in Keras which only runs on GPU.

The BiLSTM model gave an F1-score of 0.6272.

3. Attention Models

Attention was presented by Dzmitry Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate” that reads as a natural extension of their previous work on the Encoder-Decoder model.

Attention is proposed as a solution to the limitation of the Encoder-Decoder model encoding the input sequence to one fixed length vector from which to decode each output time step. This issue is believed to be more of a problem when decoding long sequences.

Attention is proposed as a method to both align and translate.

Alignment is the problem in machine translation that identifies which parts of the input sequence are relevant to each word in the output, whereas translation is the process of using the relevant information to select the appropriate output.

We want to create scores for every word in the text, which is the attention similarity score for a word.

To do this, we start with a weight matrix(W), a bias vector(b) and a context vector u. The optimization algorithm learns all of these weights. On this note I would like to highlight something I like a lot about neural networks — If you don’t know some params, let the network learn them. We only have to worry about creating architectures and params to tune.

Then there are a series of mathematical operations. See the figure for more clarification. We can think of u1 as nonlinearity on RNN word output. After that v1 is a dot product of u1 with a context vector u raised to exponentiation. From an intuition viewpoint, the value of v1 will be high if u and u1 are similar. Since we want the sum of scores to be 1, we divide v by the sum of v’s to get the Final Scores,s

These final scores are then multiplied by RNN output for words to weight them according to their importance. After which the outputs are summed and sent through dense layers and softmax for the task of text classification.

Note: CuDNNLSTM is fast implementation of LSTM layer in Keras which only runs on GPU.

This model gave an F1-score of 0.6305 which is better than any other models that I have tried.

Result

The Result after running a 5 fold Stratified CV.

Future Work

For all the above models, I did not work on hyperparameter tuning. You can try to improve the performance by performing hyperparameter tuning using hyperopt or Grid Search or Random Search.

Conclusion:

The F1-score obtained from deep learning method CuDNNLSTM with Attention performed better than any other model, giving a score of 0.6304.

For the classification on this problem, I first performed necessary data preprocessing and cleaning such as removal of punctuation, contraction, stopwords, replacing misspelled words, stemming and lemmatizing the text.

After that I performed machine learning classification methods such as Naive Bayes, Logistic Regression and LightGBM using sklearn’s text feature extraction methods like CountVectorizer, TF-IDF, Hashing and Word2vec embeddings.

To get better results I also performed deep learning using models like TextCNN, Bidirectional LSTM with and without Attention.

A coding link to this problem implementation.

Also a big shoutout to mlwhiz whose blog on this project helped me in implementing it.

Finally, thanks to all of you for reading my blog.

References:

https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/

https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39

--

--