NLP for beginners: How simple machine learning model compete with the complex neural network on Quora task— Part 1
co-authors: Nanxi Wang, Shea Thomas; co-editors: Haolin Hong, Nancy Fei
Find the code here; For beginners, a simple way is to initialize class ‘process’ and ‘model’, using the functions in the code to transform (model.dtm) and train your own text data (model.nb_logreg) for sentiment or other classification analysis.
Natural Language Processing (NLP) is a big topic in the area of content management for the online platform where people are free to post their opinions, questions and so forth. Many websites are facing how to handle the toxic and divisive content so as to provide their audience with a more comfortable setting. Since “Quora is a well-known question-and-answer website where questions are asked, answered, edited, and organized by its community of users in the form of opinions” (Wikipedia), the platform has always focused on screening out those insincere questions based on a false premise or intent to make a statement rather than ask questions.
In this article, a simple machine learning model (Naive-Bayes logistic regression) will be introduced to train on the Quora questions dataset retrieved from Kaggle.com, which will result in a performance as good as an RNN (LSTM) models. Before that, some basic text preprocessing methods will be reviewed.
- Data exploration (word frequency, wordcloud, etc.)
- Data engineering (Lemmatization, dtm/tfidf, etc.)
- Data training (Naive Bayes, Logistic Regression, LSTM, etc.)
DATA EXPLORATION
Read the data and view the last 10 lines:
data_raw = pd.read_csv('train.csv')
data_raw.tail(10)
The data has three fields: qid, the unique question identifier; question_text, Quora question texts; target, a question labeled “insincere” has a value of 1
, otherwise 0.
We will treat question_text as the predictor and target as the dependent variable.
Let’s look at some basic statistics derived. It is an obviously imbalanced dataset, with the insincere questions only accounting for 6.19%, which means that if the machine makes a naive/dull guess that all the predictions are ‘sincere’ questions, that will result in a 93.81% accuracy. For this reason, we would resort to F1 score and ROC as metrics.
bars = plt.bar(range(2),p_n.target,color='gr')
plt.xticks([0,1])
for bar in bars:
h = bar.get_height()
plt.text(bar.get_x()+bar.get_width() / 2, h,str(h)+' (%.2f%%)' % (h/len(train_raw)*100) , ha='center', va='bottom')
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
So, what do the sincere and insincere questions look like, and what kinds of words and phrases contribute the most to those questions? We will give two examples of sincere (target = 0) and insincere (target =1) questions respectively, as well as two word-clouds.
ins_ques = train_raw.question_text[train_raw.target==1]
s_ques = train_raw.question_text[train_raw.target==0]
ins_ques.sample(2,random_state=1).values
s_ques.sample(2,random_state=1).values
NLTK
Natural Language Toolkit (NLTK) is a commonly used tool in Python to conduct the text analysis:
Frequencies — bar chart
There are different ways to count the frequencies of each token. We used the functions in NLTK to achieve that. The whole question_text is divided into two groups and counts the frequencies of words separately. The symbols and stop-words were removed.
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
from nltk.probability import FreqDist
from nltk.util import ngramsstop = stopwords.words('english')
question_text = train_raw['question_text']re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub('', s).split() def group(sin_text,ins_text,stopwords):
stop = stopwords.words('english')
sin_words = sin_text.apply(tokenize)
sins = [j for i in sin_words for j in i]
sin_filtered = [w for w in sins if w not in stop]
ins_words = ins_text.apply(tokenize)
ins = [j for i in ins_words for j in i]
ins_filtered = [w for w in ins if w not in stop]
return sin_filtered, ins_filteredsin_f, ins_f = group(question_text[train_raw['target']==0],question_text[train_raw['target']==1],stopwords)
bigram=list(ngrams(sin_f,2))
bigram_ins=list(ngrams(ins_f,2))def print_top_20(text,ngram=1):
freq = FreqDist(text)
most_freq = freq.most_common(20)
if ngram==1:
words = [w[0] for w in most_freq]
else:
words = [' '.join(w[0]) for w in most_freq]
num = [n[1] for n in most_freq]
plt.barh(words,num,alpha=0.8)
print_top_20(sin_f)
print_top_20(ins_f)
print_top_20(bigram,2)
print_top_20(bigram_ins,2)`
Frequencies — word cloud
def wordcloud(text,stopwords,ngram=1):
# text: if ngram>1, text should be a dictionary
wordcloud = WordCloud(width=1400,
height=800,
background_color='black',
stopwords=stop)
if ngram ==1:
wordc = wordcloud.generate(' '.join(text))
else:
wordc = wordcloud.generate_from_frequencies(text) plt.figure(figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
wordcloud(train_raw.question_text[train_raw['target']==1],stop)
bigram_ins=list(ngrams(ins_f,2))
freq_bi_top1000 = FreqDist(bigram_ins).most_common(1000)text_dict = {}
listLen = len(freq_bi_top1000)
# Get the bigram and make a contiguous string for the dictionary key.
# Set the key to the scored value.
for i in range(listLen):
text_dict['_'.join(freq_bi_top1000[i][0])] =
freq_bi_top1000[i}[1]
wordcloud(text_dict,stop,ngram=2)
It appears that some words and phrases are prone to be in the insincere questions, such as Donald Trump, Indian, Black people, Muslim and so forth. This could be due to those words being used in political statements, personal complaints, discrimination, etc.
DATA ENGINEERING (DTM or TFIDF)
For this part, we wrote a class ‘process’ that encapsulates different functions that process the dataset with various methods. We will present corresponding functions for each part, with the entire code available in GitHub.
We encoded the text to numeric vectors as input data into the training models. The most two common methods in the machine learning area are the Document-Term Matrix and TF-IDF. Before that, we have another choice to Lemmatize the text in order to shrink the data size.
Lemmatization
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
For example, “am, are, is” will be lemmatized into “be”; “ car, car’s, cars’, cars” into “car”. After the process of lemmatization, the importance of some specific words will be enhanced.
def lem(self):
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
question_text = self.data['question_text']
question_token = question_text.apply(self.tokenize)
lemmatizer = WordNetLemmatizer()
self.question = [] for line in question_token:
lemmatized = [lemmatizer.lemmatize(w,'v') for w in line]
self.question.append(' '.join(lemmatized))
The question is a list with the lemmatized quesiton_text dataset, we can tell from the example that, the words ‘were’ and ‘founded’ in the original dataset turned into ‘be’ and ‘found’ after lemmatizing.
Document Term Matrix
Before building a model on the training data, we need to create a numerical presentation that can be processed by the computer.
DTM is fairly a simple way to represent the documents as a numeric structure. Representing text as a numerical structure is a common starting point for text mining and analytics such as search and ranking, creating taxonomies, categorization, document similarity, and text-based machine learning. If you want to compare two documents for similarity you will usually start with numeric representation of the documents. If you want to do machine learning magic on your documents you may start by creating a DTM representation on the documents and using data derived from the representation as features.
In the DTM, each row represents each observation of the original dataset. Each column is a token. sklearn package provides us with an easy function ‘CountVectorizer’ to achieve a DTM.
Specifically, a DTM will look like this:
We may call this a bag of words representation.
The document contains the sentences, while the term contains all the words that one appears in the sentences.
Take the above matrix as an example. For the first question, “What are amazing facts about your race?”, all the words that once appeared in this phrase are labeled as 1. In the second question“What are amazing facts about your country?”, the word “race” no longer exists, so we label “race” as 0 in this sentence.
Since both two sentences are classified as sincere, we label both of them as 0. Otherwise, we label insincere questions as 1.
In this way, we could create a Document-Text Matrix, helping us convert text input to numeric data that is much easier to proceed.
For now, what we need to do is to choose the appropriate models to parse the training dataset, apply the models on the test set, and see if the labels we assign to test sentences (either 0 or 1) correspond with actual results.
def dtm(self,stop_words=None,ngram_range=(1,3),
max_features=800000):
veczr = CountVectorizer(tokenizer=self.tokenize,
stop_words=stop_words,ngram_range=ngram_range,
max_features=max_features)
self.X_text = veczr.fit_transform(self.X_train)
self.val_text = veczr.transform(self.X_valid)
return self.X_text, self.val_text
TF-IDF
One of the problems of text mining using DTM is that some words with high occurrences but less importance might be allocated a higher weight, which is not what we want. For example, a corpus=[“I come to China to travel”,
“This is a car popular in China”, “I love to drink tea and Apple “,
“The work is to write some papers in science”]: for the first document, there are two ‘to’s and one ‘China’, one ‘travel’. However, we wouldn’t assume that the ‘to’ is more important than ‘China’ or ‘travel’.
Thus, we introduced Tf-idf to deal with that.
TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus … The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–id.
Let’s take the corpus for instance. For the first document (i.e. the first sentence) contains 6 words wherein the word to appears 2 times. The term frequency (i.e., tf) for to is then (2 / 6) = 0.33. Now, we have 4 documents and the word to appears in two of these. Then, the inverse document frequency (i.e., idf) is calculated as log(4 / 3) = 1.33. Thus, the Tf-idf weight is the product of these quantities: 0.33 * 1.33= 0.438; With the same calculation process, the Tf-idf weight of travel is 1/6 * 4 = 0.66, which is bigger than 0.438, then we would say the word travel is more significant than to in this context.
def tfidf(self,stop_words=None,ngram_range=(1,3), max_features=800000):
tfi = TfidfVectorizer(tokenizer=self.tokenize,
ngram_range=ngram_range,
stop_words=stop_words,max_features=max_features)
self.X_text = tfi.fit_transform(self.X_train)
self.val_text = tfi.transform(self.X_valid)
return self.X_text, self.val_text
TRAINING (machine learning
The same as before, we composed a class named model to wrap all the model functions and you will find the code in GitHub if interested.
For DTM
- Naive Bayes;
- Logistic Regression;
- NaiveBayes-LogisticRegression;
- SVM
For Tfidf
- Logistic Regression;
- SVM;
We created a Naive Bayes classifier for the starting model to get a baseline, then based on the observation, we decided to build the logistic regression model; and a combination of Naive Bayes and regularized Logistic Regression thereafter; In the next article, we resort to SVM with a more powerful version and Word2Vec embedding with deep learning neural network LSTM seeking a better performance.
Model 1: Naive Bayes Classification
Remember, our ultimate objective is to predict whether a given Quora question is classified as 0, which signifies sincere, or 1, which symbols insincere.
The first method we use is Naive Bayes Classification. The core thought is if the probability that the given question is labeled as 1 is greater than the probability that it’s classified as 0, then we predict that this given question is insincere. Here we introduce an expression as follows. If the value of the expression is greater than 1, we say the question is insincere.
It is not quite easy to find the probability involving a chosen question directly, however, we could calculate the probabilities related to the words of this question first (written as below). And then use matrix multiplication to get the value of the probability that involves the corresponding question.
Applying Bayes’ theorem, we could rewrite the expression as follows:
P(1) means the probability of a question labeled as 1.
P(0) means the probability of a question classified as 0.
P(Word|1) means for all the questions classified as 1, the probability that the specific word occurs.
P(Word|0) means for all the questions labeled as 0, the probability that the designated word appears.
Till now, we could say if we get the four probabilities, P(Word|1), P(1), P(Word|0) and P(0), we are able to calculate word probabilities and then find out question probabilities, as well as to predict whether a question is classified as 0 or 1.
With the help of the DTM we created before, it’s easy to find the values of the four probabilities that we are investigating. Take the word “country” as an example,
P(1) means the probability of a question labeled as 1. It equals 2/4 = 0.5
P(0) means the probability of a question classified as 0. It equals 2/4 = 0.5
P(country|1) means for all the questions classified as 1, the probability that the word “country” occurs. It equals 1/2 = 0.5
P(country|0) means for all the questions labeled as 0, the probability that the word “country” appears. It equals 1/2 = 0.5
For the reason that most of the questions in Quora are sincere, the probability that a question is classified as insincere is way too minimal. As a result, the value fraction expression would be near to zero.
For mathematical convenience, fraction expressions are usually converted to log expression.
The first part of the log expression is named as “r” and the second part as “b”. In this way, a new expression is created and it perfectly substitutes the origin fraction we wrote at the very beginning, which only involves question probabilities.
If the value of DTM * r + b is greater than 0, the question is classified as insincere, and vice versa.
Since Bayes classifier is one of the non-parametric models, the coefficients “r” and “b” are generated from the train set. And they will be applied to the test dataset to predict whether the questions in the test set are insincere or not.
def condition(self,y_i):
p=self.X[self.y==y_i].sum(0)
return (p+1)/((self.y==y_i).sum()+1)
def nb(self, binary = False, threshold=0):
b = np.log((self.y==1).mean()/(self.y==0).mean())
self.r = np.log(self.condition(1)/self.condition(0))
prob_preds = self.val_text @ self.r.T + b
if binary == True:
self.X=self.X.sign()
self.r = np.log(self.condition(1)/self.condition(0))
prob_preds = self.val_text.sign() @ self.r.T + b
cl_preds = [1 if i > threshold else 0 for i in prob_preds]
return prob_preds,cl_preds
Model 2: Logistic Regression
The Bayes classifier gave us a starting point of modeling the text dataset, however, the assumption of the model is based on that all the variables are independent, which is impractical in real life language context cause the words in a sentence would be likely to have an effect on each other. When reviewed the Bayes formula after taking the log, we found a linear regression out of it: DTM is the dataset, r and b are coefficients. To apply the sigmoid function on it, we got logistic regression.
In contrast with Bayes Classifier where the coefficients are calculated in theory, logistic regression is a parametric model with the coefficients are learned during the training process.
Model 3: NB_Logistic Regression
We will give our credits for what Jeremy Howard has done in his fastai courses and also the author of this paper Chris Manning and Sida Wangthat who initiated this technique. The basic of the idea is that, is there a way that merges the theoretical inferential coefficients (Bayes) and the learned ones (Logistic Regression) to develop a new model?
To review the formula of Naive Bayes and Logistic Regression again, we made an attempt as shown below: multiplied the dataset with the conditional probability derived from Naive Bayes, then sending into the linear function of the logistic regression.
The rationale behind that can be interpreted as the power of regularization. Below shows the loss function with L2 regularization for the linear function. As normal, since the object is to minimize the loss function to zero, the non-zero coefficients will shrink hardly; But in this situation, the dataset is multiplied by the conditional probability r, therefore it’s saying penalize things where we’re varying from our Naive Bayes conditional probability.
def logreg(self, c=0.1,naive=False,threshold=0.5):
m_logi = LogisticRegression(dual=True, C=c)
X_logit = self.X.copy()
val_X_logit = self.val_text.copy()
if naive == True:
X_logit = self.X.multiply(self.r)
val_X_logit = self.val_text.multiply(self.r)
m_logi.fit(X_logit, self.y)
prob_preds = m_logi.predict_proba(val_X_logit)[:,1]
cl_preds = [1 if x > threshold else 0 for x in prob_preds]
return prob_preds,cl_preds
PERFORMANCE
For the first trial, we only used DTM.
The AUC and F1-score improve accordingly, which proves that logistic regression is better than the non-parametric Bayes Classifier while Naive Bayes Logistic Regression works the best so far with the element of a proper level of regularization. The best f1 score is 0.645 till now.
TBC
In part 2, we will continue talking about SVM, LSTM. Besides, data augmentation, which generates more insincere questions so as to make the dataset more balanced, will be discussed.
I am a new grad student actively looking for a job in the field of data analytics or data science. Here is my resume and feel free contacting me! Thank you so much!
If you like what you read, please hit that ♥ button below — as a writer, it means the world. Also, leave a comment if you spot any mistakes I made. Love & Peace.