Finding similar Quora questions with BagOfWords(BoW)+XGBoost #Part-1

Manish Pawar
5 min readDec 14, 2018

--

As we all know, Quora is a question-and-answer website where questions are asked, answered, edited, and organized by its community of users in the form of opinions.

We will play with QUORA question analysis with different approaches, different optimization in parts. (#Part-1,#Part-2).

In September 2018, Quora reported hitting 300 million monthly users. And it’s no surprise that many people ask duplicate questions. For example, questions like “websites to study deep learning?” & “Online sources for deep learning?” are duplicate questions because they all have the same intent.

So, for the #Part-1, we’ll go with a Machine Learning model to classify whether question pairs are duplicates or not and we start from a model that is BagOfWords. We will be using XGBoost (better performance) along with it since a lot of hype is going around about it in ML community as well as in Kaggle.

Initially, we will be using a dataset from Kaggle. It contains over 400K pairs of Quora questions.

#all_importsimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline#our datasetdf = pd.read_csv('quora_train.csv')df.dropna(axis=0, inplace=True) #drop some non-available(empty) columns

Now, we have to clean our dataset. (Citation: https://www.kaggle.com/currie32/the-importance-of-cleaning-text)

We need to…

  • Change abbreviations to its default terms.
  • Remove comma between numbers.
  • Change special chars to words.
  • Not to stem words
  • Remove punctuation
  • Not to remove stop words since words like which & how may have a strong emphasis.

See, we gotta do a lot of work, and so lengthy is our code..

SPECIAL_TOKENS = { 'quoted': 'quoted_item', 'non-ascii': 'non_ascii_word', 'undefined': 'something'}def clean(text, stem_words=True): import re from string import punctuation from nltk.stem import SnowballStemmer from nltk.corpus import stopwords def pad_str(s): return ' '+s+' ' if pd.isnull(text): return '' # Clean the text text = re.sub(" whats ", " what is ", text, flags=re.IGNORECASE) text = re.sub("\'ve", " have ", text) text = re.sub("can't", "can not", text) text = re.sub("n't", " not ", text) text = re.sub("i'm", "i am", text, flags=re.IGNORECASE) text = re.sub("\'re", " are ", text) text = re.sub("\'d", " would ", text) text = re.sub("\'ll", " will ", text) text = re.sub("e\.g\.", " eg ", text, flags=re.IGNORECASE) text = re.sub("b\.g\.", " bg ", text, flags=re.IGNORECASE) text = re.sub("(\d+)(kK)", " \g<1>000 ", text) text = re.sub("e-mail", " email ", text, flags=re.IGNORECASE) text = re.sub("\(s\)", " ", text, flags=re.IGNORECASE) # remove comma between numbers, i.e. 15,000 -> 15000 text = re.sub('(?<=[0-9])\,(?=[0-9])', "", text) # add padding to punctuations and special chars, we still need them later text = re.sub('\$', " dollar ", text) text = re.sub('\%', " percent ", text) text = re.sub('\&', " and ", text)# clean text rules get from : https://www.kaggle.com/currie32/the-importance-of-cleaning-text text = re.sub(r" (the[\s]+|The[\s]+)?US(A)? ", " America ", text) text = re.sub(r" UK ", " England ", text, flags=re.IGNORECASE) text = re.sub(r" india ", " India ", text) text = re.sub(r" switzerland ", " Switzerland ", text) text = re.sub(r" china ", " China ", text) text = re.sub(r" chinese ", " Chinese ", text) text = re.sub(r" imrovement ", " improvement ", text, flags=re.IGNORECASE) text = re.sub(r" intially ", " initially ", text, flags=re.IGNORECASE) text = re.sub(r" quora ", " Quora ", text, flags=re.IGNORECASE) text = re.sub(r" dms ", " direct messages ", text, flags=re.IGNORECASE) text = re.sub(r" demonitization ", " demonetization ", text, flags=re.IGNORECASE) text = re.sub(r" actived ", " active ", text, flags=re.IGNORECASE) text = re.sub(r" kms ", " kilometers ", text, flags=re.IGNORECASE) text = re.sub(r" cs ", " computer science ", text, flags=re.IGNORECASE) text = re.sub(r" upvote", " up vote", text, flags=re.IGNORECASE) text = re.sub(r" iPhone ", " phone ", text, flags=re.IGNORECASE) text = re.sub(r" \0rs ", " rs ", text, flags=re.IGNORECASE) text = re.sub(r" calender ", " calendar ", text, flags=re.IGNORECASE) text = re.sub(r" ios ", " operating system ", text, flags=re.IGNORECASE) text = re.sub(r" gps ", " GPS ", text, flags=re.IGNORECASE) text = re.sub(r" gst ", " GST ", text, flags=re.IGNORECASE) text = re.sub(r" programing ", " programming ", text, flags=re.IGNORECASE) text = re.sub(r" bestfriend ", " best friend ", text, flags=re.IGNORECASE) text = re.sub(r" dna ", " DNA ", text, flags=re.IGNORECASE) text = re.sub(r" III ", " 3 ", text) text = re.sub(r" banglore ", " Banglore ", text, flags=re.IGNORECASE) text = re.sub(r" J K ", " JK ", text, flags=re.IGNORECASE) text = re.sub(r" J\.K\. ", " JK ", text, flags=re.IGNORECASE)# Remove punctuation from text text = ''.join([c for c in text if c not in punctuation]).lower() # Return a list of wordsreturn text

Now we apply these to our df (dataframe)

df['question1'] = df['question1'].apply(clean)
df['question2'] = df['question2'].apply(clean)

Now our dataset looks like …

Now, we will be using BagOfWords with XGBoost model.
Go through this for an excellent grasp on BOW.
And refer Kaggle XGboost to see how it’s implemented in various scenarios.

BOW is basically CountVectorizer method which tokenizes(seperates) the documents and counts the occurrences of token and return them as a sparse matrix.

This is how we start…

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')count_vect.fit(pd.concat((df['question1'],df['question2'])).unique())trainq1_trans = count_vect.transform(df['question1'].values)trainq2_trans = count_vect.transform(df['question2'].values)labels = df['is_duplicate'].valuesX = scipy.sparse.hstack((trainq1_trans,trainq2_trans))y = labels

Then we split in train,test sets..

X_train,X_valid,y_train,y_valid = train_test_split(X,y, test_size = 0.33)

Now, we do

xgb_model=xgb.XGBClassifier(max_depth=50, n_estimators=80,learning_rate=0.1, colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic', eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train) #predictionxgb_prediction = xgb_model.predict(X_valid)

Now, we do accuracy check and printing formalitites…

from sklearn.metrics import f1_score, classification_report, accuracy_scoreprint('training score:', f1_score(y_train, xgb_model.predict(X_train), average='macro'))print('validation score:', f1_score(y_valid, xgb_model.predict(X_valid), average='macro'))

which outputs…

training score : 0.807145348750
validation score : 0.75432398438329

Great ! From using such a simple model.. we achieved this..

Can it be improved? Or can we analyze in a different way? Of course… Let’s see that in our next part (#Part-2). Have a great day. See ya!

Originally published at blog.lipishala.com on December 14, 2018.

--

--