Detecting duplicate questions on Quora- beating Stanford's accuracy 🎉

Published in

Data Science 101

5 min readFeb 3, 2019

Recently, I came across a very interesting problem, so thought of sharing with you all. The problem was to identify whether two questions are duplicates or not. For example, let’s take these two questions- “How old are you ? ”&“What is your age?”. Though these questions have no word in common but have the same intent.

So I thought of building a deep learning based model to detect these duplicate questions.

Business Use Case

This model can help a variety of businesses which moderate and consolidate user-generated content.

For example, finding the similarity in text can help reduce human effort to a great extent in the businesses like call centers.

A few days back, I also came across another interesting problem from Quora to detect toxic content from questions which can help to keep online platforms safe. Would work on this problem sometime soon!

Dataset

Coming back to the duplicate questions task — I’ve used the dataset provided by Quora. The data set is available on Kaggle.

To give you a brief introduction to the data set it has fields like :

id: A common id assigned for each sample.
qid1: A unique number to identify question 1.
qid2: A unique number to identify question 2.
question1: Actual content of question 1.
question2: Actual content of question 2.
is_duplicate: If questions have the same meaning then 1 otherwise 0

Using the combination of natural language processing and deep learning we will see how far we can go using this approach. Since this is supervised learning, we need a set of features and a target variable to train our model. Here we can ignore all columns other than “question1”, “question2”, and “is_duplicate”.

Another traditional approach like Tf-idf, you can try an easy solution using sklearn and it’s going to work fine. Where you generally see these steps:

Creating a vector representation of each text using tfidfvectorizer.
Use the available Fit function to vectorize your data and of course by going through the series of data preprocessing.
Transform the new entry with the vectorizer previously trained.
Compute the cosine similarity between each vector representation of the elements in your data set.

This approach of creating embedding works well when you have done data preprocessing in an efficient way. But I want to try out something that doesn’t require much data pre-processing.

For better understanding, I have broken down the entire process in segments. Let’s check it out.

Analyzing the data.

Checking first few samples of data.

Finding some information that can be helpful in summarizing the data.

From my past experiences, I had to face a challenge of class imbalance so keeping this thing in mind I have plotted a bar graph to visualize the class distribution.

2. Creating Embedding.

Any Deep Learning model does not understand input in the text, speech or an image. For every such input, you need to vectorize the input. For this use case I have used universal-sentence-encoder and as the name suggests, it creates the embedding for the entire sentence rather than working on words.

I’ll tell you more about this awesome embedding technique. Internally it uses the concept of deep averaging network (DAN) encoder. For more insights, I just came across the research paper which you can access here.

Moving ahead with the implementation, you just need to pass the string and you will get an embedding of shape 512 which you can directly feed as input to your algorithm, here I have used a neural network.

3. Building a neural network.

Once we have successfully created the embedding of each sentence we will club all the embedding together to feed in my all-time favourite neural network. A short description for each layer will make you realize the beauty of this algorithm.

Input Layer:- Using question1 and question2 to treat input of model
Lambda Layer:- Using lambda layer of keras, I have applied the embedding of each input before passing it into model.
Concatenate Layer:- Concatenating both embedding of question1,question2
Batch Normalization: -Since the distribution of each layer’s inputs change during training, as the parameters of the previous layers change so I have used to normalize the input at each layer.
Dense Layer:- Passing input to fully connected layer
Dropout Layer:- Adding this layer to previous hidden layer to avoid overfitting.

4. Training and predicting

After building the model we are now ready to train our model. A little preparation of data won't hurt you much!

Splitting features(question1 & question2) and target (is_duplicate) into 80–20%.
Coverting into list.
Creating one hot encoding of target variable by converting a categorical variable into dummy variables. A simple pandas.get_dummies will do it in one go.
Initializing the global variables using the tensorflow hub.
Finally, training the model by saving the weights after every epoch.

After running the 20 epochs of size 512 I was very happy to share that I’ve achieved more accuracy than a paper by Stanford University on the same dataset. Paper here.

For visualizing the accuracy and loss, we have a great tool comet.ml.Which supports you to track hyperparameters, environment packages, and GPU usage, also automatically-created graphs for Machine Learning analysis and easy reproduction of experiments.

I achieved a training accuracy of 88.24 % and validation accuracy of 85.18% which can be clearly seen from this accuracy graph. The maximum accuracy achieved in the aforementioned paper was only 83.8%

To predict the intent of questions, Let’s try with some random test questions.

Let’s see what the prediction looks like…

Closing Notes: This can be seen from the above output that questions “How old are you?” & “What is your age” are similar. You can access the full code along with the output from here.

Thank you for your patience…..Claps(Echoing)

Detecting duplicate questions on Quora- beating Stanford's accuracy 🎉

Business Use Case

Dataset

Written by Anchit Jain