How Quora suggests similar questions using Machine Learning

Published in

Analytics Vidhya

10 min readJan 16, 2020

Quora (/ˈkwɔːrə/) is an American question-and-answer website where questions are asked, answered, and edited by Internet users, either factually or in the form of opinions — Wikipedia.

With over 300 million people visit Quora every month, it’s no surprise that many people ask duplicate questions, that is, questions that have the same intent.

For example, questions like “How can I be a good geologist?” and “What should I do to be a great geologist?” are duplicate questions because they all have the same intent and should be answered once only.

Quora tries to eliminate duplicate questions by suggesting users about similar questions like so:

In this post, we’ll try to address this issue using Machine Learning and NLP techniques. Let’s start the process step by step and dive deep into the problem.

Problem statement

Identify which questions asked on Quora are duplicates of questions that have already been asked.
Our task is to predict whether a pair of questions are duplicates or not.

In Machine Learning terms: This is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not with the performance metric given as Log-loss.

Step 1. EDA or Exploratory data analysis

Let’s try to explore and visualize our data before applying any Machine Learning Algorithm. It gives us a deeper understanding of the dataset and helps to add some new features according to the requirement of the problem statement, this is also known as feature engineering.

Data is present in a file Train.csv
Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate( 1 = duplicate, 0 = not duplicate).
Number of rows in Train.csv = 404,290

Let’s see the data present in csv file with the help of pandas dataframe.

a) Distribution of Data Points

Now let’s try to find the distribution of data points among output classes, we can see that approx 63 percent questions pair are not duplicate and 36 percent questions pair are duplicate.

df.groupby(“is_duplicate”)[‘id’].count().plot.bar()#63%
print('Question pairs are not Similar (is_duplicate = 0):\n   {}%'.format(100 - round(df['is_duplicate'].mean()*100, 2)))#36%
print('\n~> Question pairs are Similar (is_duplicate = 1):\n   {}%'.format(round(df['is_duplicate'].mean()*100, 2)))

b) Checking for NULL values

We have two rows that contain null values, we’ll fill the data with a space for consistency.

nan_rows = df[df.isnull().any(1)]
print (nan_rows)

# Filling the null values with ‘ ‘
df = df.fillna(‘’)

Step 2. Feature Engineering

It’s time to construct a few new features given below. This process is also known as feature engineering and it’s one of the most important steps in building a machine learning pipeline.

Once done we’ll be creating a fresh CSV file(df_fe_without_preprocessing_train.csv) with all the below features.

freq_qid1 = Frequency of qid1's
freq_qid2 = Frequency of qid2's
q1len = Length of q1
q2len = Length of q2
q1_n_words = Number of words in Question 1
q2_n_words = Number of words in Question 2
word_Common = (Number of common unique words in Question 1 and Question 2)
word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
word_share = (word_common)/(word_Total)
freq_q1+freq_q2 = sum total of the frequency of qid1 and qid2
freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

df[‘freq_qid1’] = df.groupby(‘qid1’)[‘qid1’].transform(‘count’) 
df[‘freq_qid2’] = df.groupby(‘qid2’)[‘qid2’].transform(‘count’)
df[‘q1len’] = df[‘question1’].str.len() 
df[‘q2len’] = df[‘question2’].str.len()
df[‘q1_n_words’] = df[‘question1’].apply(lambda row: len(row.split(“ “)))
df[‘q2_n_words’] = df[‘question2’].apply(lambda row: len(row.split(“ “)))def normalized_word_Common(row):
 w1 = set(map(lambda word: word.lower().strip(), row[‘question1’].split(“ “)))
 w2 = set(map(lambda word: word.lower().strip(), row[‘question2’].split(“ “))) 
 return 1.0 * len(w1 & w2)df[‘word_Common’] = df.apply(normalized_word_Common, axis=1)def normalized_word_Total(row):
 w1 = set(map(lambda word: word.lower().strip(), row[‘question1’].split(“ “)))
 w2 = set(map(lambda word: word.lower().strip(), row[‘question2’].split(“ “))) 
 return 1.0 * (len(w1) + len(w2))df[‘word_Total’] = df.apply(normalized_word_Total, axis=1)def normalized_word_share(row):
 w1 = set(map(lambda word: word.lower().strip(), row[‘question1’].split(“ “)))
 w2 = set(map(lambda word: word.lower().strip(), row[‘question2’].split(“ “))) 
 return 1.0 * len(w1 & w2)/(len(w1) + len(w2))df[‘word_share’] = df.apply(normalized_word_share, axis=1)df[‘freq_q1+q2’] = df[‘freq_qid1’]+df[‘freq_qid2’]
df[‘freq_q1-q2’] = abs(df[‘freq_qid1’]-df[‘freq_qid2’])# Saving to CSV file
df.to_csv(“df_fe_without_preprocessing_train.csv”, index=False)

Step 3. Preprocessing of text

After completion of Step2, we are going to define a helper function which will help us in various steps of cleaning the raw text data including below steps

Removing HTML tags
Removing Punctuations
Performing stemming
Removing Stopwords

# init Objects
tokenizer=RegexpTokenizer(r’\w+’)
en_stopwords=set(stopwords.words(‘english’))
ps=PorterStemmer()
def getStemmedReview(review):
 review=review.lower()
 review=review.replace(“<br /><br />”,” “)
 #Tokenize
 tokens=tokenizer.tokenize(review)
 new_tokens=[token for token in tokens if token not in en_stopwords]
 stemmed_tokens=[ps.stem(token) for token in new_tokens]
 clean_review=’ ‘.join(stemmed_tokens)
 return clean_review

In the above function, we are lowering the cases of sentences, breaking them into tokens, preserving the tokens if they are not part of predefined stopwords and finally stemming them for the root form.

Once all the above gets completed we are returning the clean text in the end.

Step4. Advance Feature Extraction(NLP and Fuzzy Features)

Now we are going to add some more features using one of the amazing libraries present in Python known as FuzzyWuzzy, let’s have a look at the fuzzy features offered below

Definition:

Token: You get a token by splitting sentence space
Stop_Word: stop words as per NLTK.
Word: A token that is not a stop_word

Features:

cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2
cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2
cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2
csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2
csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
last_word_eq : Check if First word of both questions is equal or not
last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
first_word_eq : Check if First word of both questions is equal or not
first_word_eq = int(q1_tokens[0] == q2_tokens[0])
abs_len_diff : Abs. length difference
abs_len_diff = abs(len(q1_tokens) — len(q2_tokens))
mean_len : Average Token Length of both Questions
mean_len = (len(q1_tokens) + len(q2_tokens))/2
fuzz_ratio : https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
fuzz_partial_ratio : https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
token_sort_ratio : https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
token_set_ratio : https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2
longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

After adding the above features our final data frame will look like below

Pair plotting the new features [‘ctc_min’, ‘cwc_min’, ‘csc_min’, ‘token_sort_ratio’]

n = df.shape[0]
sns.pairplot(df[[‘ctc_min’, ‘cwc_min’, ‘csc_min’, ‘token_sort_ratio’, ‘is_duplicate’]][0:n], hue=’is_duplicate’, vars=[‘ctc_min’, ‘cwc_min’, ‘csc_min’, ‘token_sort_ratio’])
plt.show()

We can see that blue and orange points are separated. There is some overlap but we can consider these as good features to build our model.

Step5. Featuring text data with TF-IDF

To feed the data to the Machine Learning model, we have to convert categorical data, such as text or words, into a numerical form.

We are going to use TfidfVectorizer for this purpose which is already present in the scikit-learn library, for more details please refer the link- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

We are going to vectorize our question data and then we will be merging our newly extracted features along with the vectorized ones. After these steps, our data frame will look like below. Please note now we have 797 columns in total for our ML model.

Step6. Splitting our data for training and testing.

After vectorization, we can assign columns to X and y.

X will have all the columns except Unnamed, id, and is_duplicate whereas our y will have is_duplicate column only.

X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.30, random_state=42)

Step7. Applying different Machine Learning models

Finally, it’s time to apply ML models to find the best one for the least value of log-loss.

Model1: Logistic Regression

Here we are using SGD Classifier from scikit-learn with loss=” log”

alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.

For values of alpha = 1e-05 The log loss is: 0.592800211149
For values of alpha = 0.0001 The log loss is: 0.532351700629
For values of alpha = 0.001 The log loss is: 0.527562275995
For values of alpha = 0.01 The log loss is: 0.534535408885
For values of alpha = 0.1 The log loss is: 0.525117052926
For values of alpha = 1 The log loss is: 0.520035530431
For values of alpha = 10 The log loss is: 0.521097925307

We can see that for alpha = 1 the log-loss is minimum.

So we choose alpha=1 as our hyper-parameter.

For values of best alpha = 1 The train log loss is: 0.513842874233
For values of best alpha = 1 The test log loss is: 0.520035530431

Now as our train and test log-loss is almost similar we can conclude that our model is not overfitting.

Model2: Linear SVM Model

Here we are going to use loss=”hinge” with same SGD Classifier

For values of alpha = 1e-05 The log loss is: 0.657611721261
For values of alpha = 0.0001 The log loss is: 0.489669093534
For values of alpha = 0.001 The log loss is: 0.521829068562
For values of alpha = 0.01 The log loss is: 0.566295616914
For values of alpha = 0.1 The log loss is: 0.599957866217
For values of alpha = 1 The log loss is: 0.635059427016
For values of alpha = 10 The log loss is: 0.654159467907

For alpha = 0.0001 we get minimum log-loss . Now we are going to build our model on this alpha.

Results we got are:

For values of best alpha = 0.0001 The train log loss is: 0.478054677285
For values of best alpha = 0.0001 The test log loss is: 0.489669093534

Here the values of log-loss are better than our previous model and as train and test log-loss is almost similar, we can conclude that our model is not overfitting.

C) XGBoost Model

The last model we are going to use is xgboost. Below snippet contains already tuned hyperparameters. Users can experiment more here with other parameters as well.

With the above-tuned parameters, we get log-loss value as 0.36

Final Result

After trying out various models we can see that xgboost is giving us the minimum log-loss

Conclusion and insights towards deploying the model in production

We saw that xgboost is giving us the minimum log-loss, now comes an interesting question-

How this model is going to behave in production?

Let’s say a new question comes up, are we going to compare it with all the questions present inside quora? You guessed it right, this will be a recipe for disaster and is going to cause major latency issues.

Approach

A fair solution to the problem should be something like below -

Once a new question comes up, first we are going to find a similar set of questions present in Quora already, we can find those questions using inverted indices, this technique is used in search engines like Google for information retrieval.

Let’s say we found 5 similar questions for our newly asked question. Now our asked question will be paired with all these 5 questions and each set will be given as an input to our model.

For example Datapoint1: (q,q1) - Probability of similarity p1
Datapoint2: (q,q2) - Probability of similarity p2
Datapoint3: (q,q3) - Probability of similarity p3
Datapoint4: (q,q4) - Probability of similarity p4
Datapoint5: (q,q5) - Probability of similarity p5

Based on a threshold value let’s say 90%, we can select which questions out of given 5 are having a probability of similarity with the newly asked Q>90%.

Once we find those questions with more than 90% similarity to the asked one, the user will be suggested to check the answers of those selected questions before asking a new one.

Thank you for reading till here. In this post, we learned how to pre-process the raw text data, how to add new features along with performing exploratory data analysis and finally applying various Machine Learning models against a given metric.

I hope you found this tutorial useful. I’m curious about what you think so hit me with some comments. You can also get in touch with me directly through email or connect with me on LinkedIn.