Learning Insincerity of Quora Questions using LSTM Networks

Published in

ML 2 Vec

6 min readMar 26, 2020

One of the interesting Datasets I recently found on Kaggle is of Insincere Questions on Quora. The Dataset consists of Text questions, asked on Quora over a period of time, and labeled as Insincere or Sincere based on the content of questions. As a human, it is easy to tell when a question that is being asked has sincerity to it, because people can use platforms like Quora to ask enraging questions that are meant to entice or garner a response, when none is necessary. These can be political questions, religious, or questions targeted towards a specific audience.

In my investigation of this dataset, I conducted multiple tasks, including Cleaning the Dataset for proper NLP training, building and training a Neural Network model to get results, and tuning the model by trying different Neural Network configurations and layers. My final model using an Attention-Based Network achieved higher than 80% on Training, and broke 60% on a Validation dataset. In this post, I will walk through the different Network Models I tried, and how well they worked.

In this post, I will cover the following topics:

Data Cleaning and Augmentation
Training a Neural Network on the dataset
Tuning the Network and Attention Models

Data Cleaning and Augmentation

To start our investigation, let’s look at our dataset and our raw Natural Language questions from Quora. Our dataset consists of two main attributes, the text of the question and the label indicating the insincerity.

Distribution of Data Labels in the Training Dataset

First of all, we look at the distribution of the Labels in the Dataset. We can see there is a huge discrepancy between the labels, with only 6% consisting of one label. This can be a problem during training as it introduces bias. We can use Data Augmentation techniques to add examples and reduce this bias.

Let’s look at a few example sentences of both types of labels:

What will happen to the induced EMF in an alternator when mechanical torque is increased. — Label: Sincere
What are the best conditions for growing tomatoes? — Label: Sincere
If Pixar was going to make a nother dystopian, or apocalyptic movie, what would the story be about? — Label: Sincere
<Redacted> people, please minimize your paper usage and carbon footprint. Do ya’ll have any idea the damage that is done? — Label: Insincere
Why does quora only link to <redacted> newspapers with an <redacted> agenda? — Label: Insincere
From 1 to 10, how racist is it to call a <redacted> a "<redacted>"? <redacted> are a stereotype associated with <redacted>. — Label: Insincere

I decided to remove many of the identifiable words in the insincere questions, because they denote political and religious discrimination. For a human, it is easy to tell the sincerity of a question. Sincere questions ask for legitimate information, while insincere questions try to ensnare disagreement between members.

One more thing we can see is that there is a lot of text non-conformity in the data. For instance, there is a mix of upper casing and lower casing words (Pixar vs quora), misspellings (a nothe), contractions (ya'll). To make our text data more conform, we will use data cleaning techniques, which I covered earlier in a different post.

To evaluate the data cleaning methods used, we will use a Word Embeddings Dictionary, Glove, to check for the number of individual words in our dataset that are covered by the Embeddings vocabulary.

Checking Coverage of our Data Vocabulary

The multiple data cleaning techniques I used are the following:

Data Cleaning Techniques

Code and Notebook for Data Cleaning methods is on the GitHub Repository for this project here.

The 2nd problem we face is label bias. We have a high misrepresentation of one label over another, which we can solve using Data Augmentation to add new examples of the insincere labels. In Data Augmentation, we will use Word Embeddings (this time, FastText) to find synonyms for all words in our vocabulary. Then, we will iterate through our sentences, and replace each word from a synonym with a 50% probability. This helps generate new examples of sentences that are slightly different than the originals.

Data Augmentation

With Data Augmentation, Insincere labels now make up 12% of our dataset. We can still add more examples by running this process multiple times, but we’ll use these for now.

Training a Neural Network on the Dataset

For this project, we use PyTorch as the Deep Learning library. As a NLP problem, the model should use GRU Units or LSTM Layers to be able to learn context between words. As shown earlier as well, Word Embeddings will help the model understand word context based on vector similarity, and we will structure our Neural Network to take input from an Embeddings Layer. Let’s look at the basic model we will be training:

The first model we train has the following architecture:

Embeddings Layer
1st LSTM Layer
1st Dense Fully Connected Layer
ReLU Activation
2nd LSTM Layer
Global Max-Average Pooling Layer
2nd Dense Fully Connected Layer

We add a Binary Cross Entropy Loss Function and an Adam Optimizer for training.

Next, let’s look at training. For training the model, we iterate over batches of the dataset we have prepared, and run them through the model, updating the weights with each epoch. To score our model, we use multiple metrics, including Accuracy, Precision, Recall, and specifically the F1 score. The F1 score is an important metric here because it helps us calculate an unbiased score that disregards the large discrepancy between our two labels.

This code has been concised for readability. You can review the full training code on GitHub to see how metrics are computed, and how evaluation is done.

At each Batch Iteration, we load in the X and Y inputs, run them through the model to obtain predictions, compute loss, backpropagate the model to update weights.

After training the model over 5 epochs, the model got a F1 Score on Training of 0.637 and 0.540 on Validation.

Tuning the Network and Attention Models

After the results on the basic model with 2 LSTM Layers, I tried different configurations on the model to see how it can be improved. There were small improvements with each configuration update:

Model with 128 (2x as before) hidden units on the first layer, and 5 epochs

With this configuration, Training score on 5 epochs came out to 0.625 and Validation to 0.567. The model showed only slight improvement.

2. Same model trained on 10 epochs

With additional epochs, Training score came out to 0.661 and Validation to 0.558. The model starts overfitting at this point, with a small dip on Testing data.

3. Model with a Self-Attention Layer

Attention Models are a concept related to Neural Machine Translation, used for training sequence-to-sequence models. In this project, I use a different module called a Self Attention. With Self Attention, the model can look at other positions in the sentence that can help it learn a better encoding for each word. The hidden vectors of each word incorporate the meanings and context of other words that are relevant to it.

Using attention, training F1 score came out to 0.771 and validation F1 score to 0.601, in 5 epochs. With additional epochs, overfitting occurs although training F1 is able to reach 0.815.

Additional configurations were also tried, such as using Glove embeddings instead of FastText.

Summary of Results

With the final training completed, the highest Validation F1 score obtained on unseen data was 0.601. This is lower than my target F1 score, and many submissions on the Public Kaggle Leaderboard for this project. However, many submissions on this project use different techniques not explored here, including Transfer learning, Data Augmentation by pulling in answers from Quora, and BERT-Transformers. For the first 2, I was limited by hardware resources, and Transformers were not explored because I do not completely understand them yet. These will be looked at in future work, although perhaps with a different dataset.

With this project, my main Takeaway was investigating a dataset from Start to Finish. Using the raw dataset, with high bias, I completed data preprocessing tasks, data augmentation, training, and tuning.

GitHub

All code for this project is available on GitHub at https://github.com/ravishchawla/Reinforcement-Learning-Navigation

Licensing, Authors, Acknowledgements

Credit to Kaggle for providing the data and environment. You can find the Licensing for the data and other descriptive information from Kaggle. This code is free to use.