Finding Similar Quora Questions with Word2Vec and Xgboost
How to use Natural Language Processing to identify similar records in any text data set
Last week, we explored different techniques for de-duplication for identifying similar documents using BOW, TFIDF, and Xgboost. We found that the traditional methods such as TFIDF can achieve some impressive results. That’s one of the reasons that Google’s long been using TFIDF in indexing and information retrieval to figure out the importance of a given keyword to a given page.
To continue our learning journey and grow our skills, today, we will explore how to solve the same matching and de-duplication problem using a different method, again, we will tackle the task of de-duplication as an extension of the classifier. Let’s get started!
The Data
The task of the Quora duplicate question pair is to determine whether one pair of questions have the same meaning. The data contains a pair of questions and a ground truth label, marked by human experts, labeling whether the pairs of questions are duplicates. Please note that these labels are subjective, meaning that not all human experts might agree on whether the pair of questions is duplicate. Hence, the data should be taken as informed and not 100% accurate.
df = pd.read_csv('quora_train.csv')
df = df.dropna(how="any").reset_index(drop=True)
a = 0
for i in range(a,a+10):
print(df.question1[i])
print(df.question2[i])
print()
Computing The Word Mover’s Distance (WMD)
WMD is a method that allows us to assess the “distance” between two documents in a meaningful way, no matter they have or have no words in common. It uses word2vec vector embeddings of words. It measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Let’s see an example, the following question pair is labeled as duplicate:
question1 = 'What would a Trump presidency mean for current international master’s students on an F1 visa?'
question2 = 'How will a Trump presidency affect the students presently in US or planning to study in US?'question1 = question1.lower().split()
question2 = question2.lower().split()question1 = [w for w in question1 if w not in stop_words]
question2 = [w for w in question2 if w not in stop_words]
We will be using word2vec pre-trained Google News corpus. We load these into a Gensim Word2Vec model class.
import gensimfrom gensim.models import Word2Vec
model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz', binary=True)
let’s compute WMD of these two sentence using the wmdistance method. Remember, these two sentences are expressing the same meaning, and they are labeled as duplicate in the original quora data.
distance = model.wmdistance(question1, question2)
print('distance = %.4f' % distance)
distance = 1.8293
The computed distance between these two sentences is pretty large. This brings us to normalized WMD.
Normalizing word2vec vectors
When using the wmdistance method, it is beneficial to normalize the word2vec vectors first, so they all have equal length.
model.init_sims(replace=True)
distance = model.wmdistance(question1, question2)
print('normalized distance = %.4f' % distance)
normalized distance = 0.7589
After normalization, the distance became much smaller.
Let’s try one more pair, this time, these two questions are not duplicate.
question3 = 'Why am I mentally very lonely? How can I solve it?'
question4 = 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?'question3 = question3.lower().split()
question4 = question4.lower().split()question3 = [w for w in question3 if w not in stop_words]
question4 = [w for w in question4 if w not in stop_words]distance = model.wmdistance(question3, question4)
print('distance = %.4f' % distance)
distance = 1.2637
model.init_sims(replace=True)
distance = model.wmdistance(question3, question4)
print('normalized distance = %.4f' % distance)
normalized distance = 1.2637
After normalization, the distance remains the same. WMD thinks the 2nd pair is not as similar as the 1st pair. It worked!
FuzzyWuzzy
We have covered some basics on Fuzzy String Matching in Python, let’s have a quick peek on whether FuzzyWuzzy can help with our question dedupe problem.
from fuzzywuzzy import fuzzquestion1 = 'What would a Trump presidency mean for current international master’s students on an F1 visa?'
question2 = 'How will a Trump presidency affect the students presently in US or planning to study in US?'
fuzz.ratio(question1, question2)
53
fuzz.partial_token_set_ratio(question1, question2)
100
question3 = 'Why am I mentally very lonely? How can I solve it?'
question4 = 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?'
fuzz.ratio(question3, question4)
28
fuzz.partial_token_set_ratio(question3, question4)
37
Basically, Fuzzywuzzy does not think the 2nd pair of questions are similar. That’s good. Because the 2nd pair are not similar according to human evaluation.
Feature Engineering
First, we create a few functions to compute WMD and normalized WMD and word to vector representation.
The new features we will be creating are:
- The length of word.
- The length of character.
- The length of common word between question1 and question2.
- The length difference between question1 and question2.
- The Cosine distance between vectors question1 and question2.
- The City block (Manhattan) distance between vectors question1 and question2.
- The Jaccard distance between vectors question1 and question2.
- The Canberra distance between vectors question1 and question2.
- The Euclidean distance between vectors question1 and question2.
- The Minkowski distance between vectors question1 and question2.
- The Bray-Curtis distance between vectors question1 and question2.
- The skewness and kurtosis of vectors question1 and question2.
- WMD
- normalized WMD
All of the distance computations can be accomplished by using scipy.spatial.distance
function.
Word2vect Modeling
We will be using word2vec pre-trained Google News corpus. I downloaded and saved into “word2Vec_models” folder. We then load these into a Gensim Word2Vec model class.
model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz', binary=True)
df['wmd'] = df.apply(lambda x: wmd(x['question1'], x['question2']), axis=1)
Normalized Word2vec Modeling
norm_model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz', binary=True)
norm_model.init_sims(replace=True)
df['norm_wmd'] = df.apply(lambda x: norm_wmd(x['question1'], x['question2']), axis=1)
Get vectors for question1 and question2, then compute all the distances.
Xgboost
The Xgboost on all the new features we created achieved 0.77 test accuracy which was lower than Character Level TF-IDF + Xgboost at 0.80, however, we are able to increase recall for duplication questions from 0.67 to 0.73 which is a significant improvement.
Jupyter notebook can be found on Github. Have a productive week!
Reference:
https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/