How to find similar Quora questions with Word2Vec+XGBoost #Part-2

Manish Pawar
4 min readDec 14, 2018

--

Skip to content

Earlier, you guys saw how to build a Machine Learning model to classify whether question pairs are duplicates or not and we used BagOfWords. + XGboost model (#Part-1).

So, now we are here to get a different approach to it & we will tackle the task of de-duplication as an extension of the classifier.

Now, if you have read Part-1 of this, then we’ll be using the same data as prior.

df = pd.read_csv('quora_train.csv')

Now, we will be using WMD(Word mover’s distance). WMD is a method that allows us to assess the “distance” between two documents in a meaningful way, even when they have no words in common. It uses word2vec vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in k-nearest neighbours classification.
Let’s see an example, the following question pair is labelled as duplicate:

So we need a corpus to use this WMD on & hence we’ll be using word2vec pre-trained Google News corpus. You can download it here. We load these into a Gensim Word2Vec model class. (If some jargons r just bouncing over your brains, I gotta link for each of these at the end).

import gensimfrom gensim.models import Word2Vecmodel=gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz')

Do WMD ….

distance = model.wmdistance(question1, question2)
print('distance = %.4f' % distance)

which outputs:
distance = 1.773

We see that distance is too large..(<1) so we need to normalize..

Let’s normalize word2vec vectors first, so they all have similar length.

model.init_sims(replace=True)distance = model.wmdistance(question1, question2)print('normalized distance = %.4f' % distance)

which prints :
normalized distance = 0.6473
Feels better now !

Now, we do feature engineering sort of thing like creating functions to compute WMD,normalized WMD and word2vec representation.

def wmd(q1, q2): q1 = str(q1).lower().split() q2 = str(q2).lower().split() stop_words = stopwords.words('english') q1 = [w for w in q1 if w not in stop_words] q2 = [w for w in q2 if w not in stop_words] return model.wmdistance(q1, q2) def norm_wmd(q1, q2): q1 = str(q1).lower().split() q2 = str(q2).lower().split() stop_words = stopwords.words('english') q1 = [w for w in q1 if w not in stop_words] q2 = [w for w in q2 if w not in stop_words] return norm_model.wmdistance(q1, q2)def sent2vec(s): words = str(s).lower() words = word_tokenize(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: try: M.append(model[w]) except: continue#then we convert it to numpy array(NOTE THAT MANY ERRORS OCCUR BECAUSE OF NOT CONVERTING TO NUMPY) M = np.array(M)

Featurization includes:
The lengths of word, character, common word between question1 and question2, the difference between question1 and question2, WMD & normalized WMD.

df['len_q1'] = df.question1.apply(lambda x: len(str(x)))df['len_q2'] = df.question2.apply(lambda x: len(str(x)))df['diff_len'] = df.len_q1 - df.len_q2df['len_char_q1'] = df.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ')))))df['len_char_q2'] = df.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))df['len_word_q1'] = df.question1.apply(lambda x: len(str(x).split()))df['len_word_q2'] = df.question2.apply(lambda x: len(str(x).split()))df['common_words'] = df.apply(lambda x: len(set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()))), axis=1)

Now, we will be applying this to our gensim

df['wmd'] = df.apply(lambda x: wmd(x['question1'], x['question2']), axis=1)

Of course we will be applying it to normalized ones too..

norm_model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz')norm_model.init_sims(replace=True)df['norm_wmd'] = df.apply(lambda x: norm_wmd(x['question1'], x['question2']), axis=1)

We need to get vectors for question 1 &2

question1_vectors = np.zeros((df.shape[0], 300))for i, q in enumerate(tqdm_notebook(df.question1.values)): question1_vectors[i, :] = sent2vec(q) question2_vectors = np.zeros((df.shape[0], 300))for i, q in enumerate(tqdm_notebook(df.question2.values)): question2_vectors[i, :] = sent2vec(q)X = df.loc[:, df.columns != 'is_duplicate']y = df.loc[:, df.columns == 'is_duplicate']

We now split into train,test sets…

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Now’s the time to use XGBoost and predict

import xgboost as xgbmodel = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1, colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic', eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel()) prediction = model.predict(X_test)

Accuracy & formalities…

print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))

The accuracy which we received with BoW was almost 80% and here with W2V we got 77%.
But, it’s significant since these kinds of approach can handle input data efficiently since we compute distances between words and even though less accurate, it’s better.

USEFUL LINKS :

part-1 of this

https://xgboost.readthedocs.io/en/latest/

https://www.tensorflow.org/tutorials/representation/word2vec

https://stats.stackexchange.com/questions/85930/difference-in-meaning-of-these-terms-dataset-vs-corpus

https://www.journaldev.com/19279/python-gensim-word2vec

REFERENCE : https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/

Originally published at blog.lipishala.com on December 14, 2018.

--

--