How to find similar Quora questions with Word2Vec+XGBoost #Part-2
Earlier, you guys saw how to build a Machine Learning model to classify whether question pairs are duplicates or not and we used BagOfWords. + XGboost model (#Part-1).
So, now we are here to get a different approach to it & we will tackle the task of de-duplication as an extension of the classifier.
Now, if you have read Part-1 of this, then we’ll be using the same data as prior.
df = pd.read_csv('quora_train.csv')
Now, we will be using WMD(Word mover’s distance). WMD is a method that allows us to assess the “distance” between two documents in a meaningful way, even when they have no words in common. It uses word2vec vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in k-nearest neighbours classification.
Let’s see an example, the following question pair is labelled as duplicate:
So we need a corpus to use this WMD on & hence we’ll be using word2vec pre-trained Google News corpus. You can download it here. We load these into a Gensim Word2Vec model class. (If some jargons r just bouncing over your brains, I gotta link for each of these at the end).
import gensimfrom gensim.models import Word2Vecmodel=gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz')
Do WMD ….
distance = model.wmdistance(question1, question2)
print('distance = %.4f' % distance)
which outputs:
distance = 1.773
We see that distance is too large..(<1) so we need to normalize..
Let’s normalize word2vec vectors first, so they all have similar length.
model.init_sims(replace=True)distance = model.wmdistance(question1, question2)print('normalized distance = %.4f' % distance)
which prints :
normalized distance = 0.6473
Feels better now !
Now, we do feature engineering sort of thing like creating functions to compute WMD,normalized WMD and word2vec representation.
def wmd(q1, q2): q1 = str(q1).lower().split() q2 = str(q2).lower().split() stop_words = stopwords.words('english') q1 = [w for w in q1 if w not in stop_words] q2 = [w for w in q2 if w not in stop_words] return model.wmdistance(q1, q2) def norm_wmd(q1, q2): q1 = str(q1).lower().split() q2 = str(q2).lower().split() stop_words = stopwords.words('english') q1 = [w for w in q1 if w not in stop_words] q2 = [w for w in q2 if w not in stop_words] return norm_model.wmdistance(q1, q2)def sent2vec(s): words = str(s).lower() words = word_tokenize(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: try: M.append(model[w]) except: continue#then we convert it to numpy array(NOTE THAT MANY ERRORS OCCUR BECAUSE OF NOT CONVERTING TO NUMPY) M = np.array(M)
Featurization includes:
The lengths of word, character, common word between question1 and question2, the difference between question1 and question2, WMD & normalized WMD.
df['len_q1'] = df.question1.apply(lambda x: len(str(x)))df['len_q2'] = df.question2.apply(lambda x: len(str(x)))df['diff_len'] = df.len_q1 - df.len_q2df['len_char_q1'] = df.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ')))))df['len_char_q2'] = df.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))df['len_word_q1'] = df.question1.apply(lambda x: len(str(x).split()))df['len_word_q2'] = df.question2.apply(lambda x: len(str(x).split()))df['common_words'] = df.apply(lambda x: len(set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()))), axis=1)
Now, we will be applying this to our gensim
df['wmd'] = df.apply(lambda x: wmd(x['question1'], x['question2']), axis=1)
Of course we will be applying it to normalized ones too..
norm_model = gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz')norm_model.init_sims(replace=True)df['norm_wmd'] = df.apply(lambda x: norm_wmd(x['question1'], x['question2']), axis=1)
We need to get vectors for question 1 &2
question1_vectors = np.zeros((df.shape[0], 300))for i, q in enumerate(tqdm_notebook(df.question1.values)): question1_vectors[i, :] = sent2vec(q) question2_vectors = np.zeros((df.shape[0], 300))for i, q in enumerate(tqdm_notebook(df.question2.values)): question2_vectors[i, :] = sent2vec(q)X = df.loc[:, df.columns != 'is_duplicate']y = df.loc[:, df.columns == 'is_duplicate']
We now split into train,test sets…
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Now’s the time to use XGBoost and predict
import xgboost as xgbmodel = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1, colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic', eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel()) prediction = model.predict(X_test)
Accuracy & formalities…
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
The accuracy which we received with BoW was almost 80% and here with W2V we got 77%.
But, it’s significant since these kinds of approach can handle input data efficiently since we compute distances between words and even though less accurate, it’s better.
USEFUL LINKS :
https://xgboost.readthedocs.io/en/latest/
https://www.tensorflow.org/tutorials/representation/word2vec
https://www.journaldev.com/19279/python-gensim-word2vec
REFERENCE : https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/
Originally published at blog.lipishala.com on December 14, 2018.