photo credit: Pixabay

Finding Similar Quora Questions with Word2Vec and Xgboost

How to use Natural Language Processing to identify similar records in any text data set

Susan Li
Towards Data Science
5 min readOct 29, 2018

--

Last week, we explored different techniques for de-duplication for identifying similar documents using BOW, TFIDF, and Xgboost. We found that the traditional methods such as TFIDF can achieve some impressive results. That’s one of the reasons that Google’s long been using TFIDF in indexing and information retrieval to figure out the importance of a given keyword to a given page.

To continue our learning journey and grow our skills, today, we will explore how to solve the same matching and de-duplication problem using a different method, again, we will tackle the task of de-duplication as an extension of the classifier. Let’s get started!

The Data

The task of the Quora duplicate question pair is to determine whether one pair of questions have the same meaning. The data contains a pair of questions and a ground truth label, marked by human experts, labeling whether the pairs of questions are duplicates. Please note that these labels are subjective, meaning that not all human experts might agree on whether the pair of questions is duplicate. Hence, the data should be taken as informed and not 100% accurate.

df = pd.read_csv('quora_train.csv')
df = df.dropna(how="any").reset_index(drop=True)
a = 0
for i in range(a,a+10):
print(df.question1[i])
print(df.question2[i])
print()
Figure 1

Computing The Word Mover’s Distance (WMD)

WMD is a method that allows us to assess the “distance” between two documents in a meaningful way, no matter they have or have no words in common. It uses word2vec vector embeddings of words. It measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Let’s see an example, the following question pair is labeled as duplicate:

question1 = 'What would a Trump presidency mean for current international master’s students on an F1 visa?'
question2 = 'How will a Trump presidency affect the students presently in US or planning to study in US?'
question1 = question1.lower().split()
question2 = question2.lower().split()
question1 = [w for w in question1 if w not in stop_words]
question2 = [w for w in question2 if w not in stop_words]

We will be using word2vec pre-trained Google News corpus. We load these into a Gensim Word2Vec model class.

import gensimfrom gensim.models import Word2Vec

model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz', binary=True)

let’s compute WMD of these two sentence using the wmdistance method. Remember, these two sentences are expressing the same meaning, and they are labeled as duplicate in the original quora data.

distance = model.wmdistance(question1, question2)
print('distance = %.4f' % distance)

distance = 1.8293

The computed distance between these two sentences is pretty large. This brings us to normalized WMD.

Normalizing word2vec vectors

When using the wmdistance method, it is beneficial to normalize the word2vec vectors first, so they all have equal length.

model.init_sims(replace=True)
distance = model.wmdistance(question1, question2)
print('normalized distance = %.4f' % distance)

normalized distance = 0.7589

After normalization, the distance became much smaller.

Let’s try one more pair, this time, these two questions are not duplicate.

question3 = 'Why am I mentally very lonely? How can I solve it?'
question4 = 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?'
question3 = question3.lower().split()
question4 = question4.lower().split()
question3 = [w for w in question3 if w not in stop_words]
question4 = [w for w in question4 if w not in stop_words]
distance = model.wmdistance(question3, question4)
print('distance = %.4f' % distance)

distance = 1.2637

model.init_sims(replace=True)
distance = model.wmdistance(question3, question4)
print('normalized distance = %.4f' % distance)

normalized distance = 1.2637

After normalization, the distance remains the same. WMD thinks the 2nd pair is not as similar as the 1st pair. It worked!

FuzzyWuzzy

We have covered some basics on Fuzzy String Matching in Python, let’s have a quick peek on whether FuzzyWuzzy can help with our question dedupe problem.

from fuzzywuzzy import fuzzquestion1 = 'What would a Trump presidency mean for current international master’s students on an F1 visa?'
question2 = 'How will a Trump presidency affect the students presently in US or planning to study in US?'
fuzz.ratio(question1, question2)

53

fuzz.partial_token_set_ratio(question1, question2)

100

question3 = 'Why am I mentally very lonely? How can I solve it?'
question4 = 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?'
fuzz.ratio(question3, question4)

28

fuzz.partial_token_set_ratio(question3, question4)

37

Basically, Fuzzywuzzy does not think the 2nd pair of questions are similar. That’s good. Because the 2nd pair are not similar according to human evaluation.

Feature Engineering

First, we create a few functions to compute WMD and normalized WMD and word to vector representation.

wmd_normWmd_sent2vec

The new features we will be creating are:

  • The length of word.
  • The length of character.
  • The length of common word between question1 and question2.
  • The length difference between question1 and question2.
  • The Cosine distance between vectors question1 and question2.
  • The City block (Manhattan) distance between vectors question1 and question2.
  • The Jaccard distance between vectors question1 and question2.
  • The Canberra distance between vectors question1 and question2.
  • The Euclidean distance between vectors question1 and question2.
  • The Minkowski distance between vectors question1 and question2.
  • The Bray-Curtis distance between vectors question1 and question2.
  • The skewness and kurtosis of vectors question1 and question2.
  • WMD
  • normalized WMD

All of the distance computations can be accomplished by using scipy.spatial.distance function.

new_features1

Word2vect Modeling

We will be using word2vec pre-trained Google News corpus. I downloaded and saved into “word2Vec_models” folder. We then load these into a Gensim Word2Vec model class.

model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz', binary=True)
df['wmd'] = df.apply(lambda x: wmd(x['question1'], x['question2']), axis=1)

Normalized Word2vec Modeling

norm_model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz', binary=True)
norm_model.init_sims(replace=True)
df['norm_wmd'] = df.apply(lambda x: norm_wmd(x['question1'], x['question2']), axis=1)

Get vectors for question1 and question2, then compute all the distances.

new_features2

Xgboost

Xgboost
Figure 2

The Xgboost on all the new features we created achieved 0.77 test accuracy which was lower than Character Level TF-IDF + Xgboost at 0.80, however, we are able to increase recall for duplication questions from 0.67 to 0.73 which is a significant improvement.

Jupyter notebook can be found on Github. Have a productive week!

Reference:

https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Susan Li
Susan Li

Written by Susan Li

Changing the world, one post at a time. Sr Data Scientist, Toronto Canada. https://www.linkedin.com/in/susanli/

Responses (7)