Applied Machine Learning: Part 3

Classification Using Naive Bayes, Linear SVM, Logistic Regression, and Random Forest

XQ
The Research Nest
10 min readDec 24, 2018

--

In this digital age, data is everywhere. When it comes to the internet, most of this is in the form of text. We have previously seen how to perform the prediction of a value using linear regression and it’s variants, and then we have seen how to perform image recognition using a CNN (Links to Part 1 & 2 of this series are provided at the end of the article). Now, it’s time to explore another major field, Natural Language Processing (NLP). As the name suggests, it’s all about the natural language we use in our everyday communication and deriving insights from the same.

Unlike numerical data, textual data is difficult to handle. For one thing, using mathematical models directly on them is not possible. Now, let us formulate a problem statement and see how we can solve it using NLP and some basic machine learning techniques. This article like the previous ones in the series will primarily focus on practical implementation than the theoretical or mathematical understanding behind the techniques used.

Step 1: Setting up the Development Environment

  • We will be using the same software and tools we used in our previous projects.
  • We will be using Spyder IDE that comes along with the anaconda installation to do all our programming.

Step 2: Choosing Your Dataset

  • As in most machine learning programs, we first need data. You can get textual data from any website like a movie review website, or Amazon product reviews, and so on.
  • Here, I’ll be using a labeled textual dataset, that can be downloaded here (Edit- Link updated as the dataset was removed from the previous link).
  • We import the basic libraries and then read the dataset. As discussed in my previous articles, use ‘#%%’ to split your code into code blocks for easy execution by pressing ‘Ctrl + Enter’.
#%%
import pandas as pd
import numpy as np
data = pd.read_csv('text_emotion.csv')

Step 3: Understanding What’s Inside Your Dataset

  • This one’s a simple dataset with just four columns, the tweet ID, emotion depicted by the tweet, the author, and the text content of the tweet.
  • We do not necessarily need the author column. Hence we can drop it.
data = data.drop('author', axis=1)
  • The dataset has 40,000 tweets in total, labeled into 13 different human sentiments. Our task here is to build a model, such that give a new tweet or text sentence, it can accurately identify which of the emotions (for which it is trained to recognize) it depicts.
  • For this tutorial let us consider just two of these sentiments for simplicity, ‘happiness’ and ‘sadness’ (which constitute a total of about 10,000 tweets from the entire sample of data). We can thus drop rows with all other labels.
# Dropping rows with other emotion labels
data = data.drop(data[data.sentiment == 'anger'].index)
data = data.drop(data[data.sentiment == 'boredom'].index)
data = data.drop(data[data.sentiment == 'enthusiasm'].index)
data = data.drop(data[data.sentiment == 'empty'].index)
data = data.drop(data[data.sentiment == 'fun'].index)
data = data.drop(data[data.sentiment == 'relief'].index)
data = data.drop(data[data.sentiment == 'surprise'].index)
data = data.drop(data[data.sentiment == 'love'].index)
data = data.drop(data[data.sentiment == 'hate'].index)
data = data.drop(data[data.sentiment == 'neutral'].index)
data = data.drop(data[data.sentiment == 'worry'].index)
  • Remember that if you are missing any library used in this tutorial, you can always download it using the pip install command in Anaconda Prompt.

Step 4: Preprocessing the Data

  • Obviously, we can’t perform math on text and machine learning models are all mathematical models.
  • So, how do we convert all these textual data into mathematical data? Remember that we have to take care of countless combinations, special characters, and not to mention, the SMS lingo and slang for which even the dictionary can’t be used for reference.
  • First, let’s bring some uniformity to the text by making everything lowercase, removing punctuation, and stop words (like prepositions).
#Making all letters lowercase
data['content'] = data['content'].apply(lambda x: " ".join(x.lower() for x in x.split()))
#Removing Punctuation, Symbols
data['content'] = data['content'].str.replace('[^\w\s]',' ')
#Removing Stop Words using NLTK
from nltk.corpus import stopwords
stop = stopwords.words('english')
data['content'] = data['content'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
  • To gain any proper insight, we need to get all the words to their root form, i.e the variants of a word within the text (for example plural forms, past tense, etc) must all be converted to the base word it represents. This is called lemmatisation. Along with that, I have added code to revert repetition of letters in a word with the assumption that hardly any word has letters repeated more than twice, consecutively. Though not very accurate, it can help in some corrections.
#Lemmatisation
from textblob import Word
data['content'] = data['content'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))#Correcting Letter Repetitions
import re
def de_repeat(text):
pattern = re.compile(r"(.)\1{2,}")
return pattern.sub(r"\1\1", text)
#%%
data['content'] = data['content'].apply(lambda x: " ".join(de_repeat(x) for x in x.split()))
  • Next consideration is the idea that if a word is appearing only once in the entire sample of data, then it most likely has no influence in determining the sentiment of the text. Hence we can remove all the rarely occurring words from the dataset which are generally proper nouns and other insignificant words with respect to the current context.
# Code to find the top 10,000 rarest words appearing in the data
freq = pd.Series(' '.join(data['content']).split()).value_counts()[-10000:]
# Removing all those rarely appearing words from the data
freq = list(freq.index)
data['content'] = data['content'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
  • The next biggest challenge with natural text is dealing with spelling mistakes, especially when it comes to tweets. Apart from that, what do we do about sarcasm or irony in the text? Due to the complexity of dealing with these issues, let us ignore them for now.
  • Extending further, one can think of replacing words with their most common synonyms. That could help in building better models. That is being skipped here as well.
An Overview of Approaches to Sentiment Analysis

Step 5: Feature Extraction

  • Once you make the text data clean, precise, and error-free, each tweet is represented by a group of keywords. Now, we need to perform ‘Feature Extraction’, i.e extracting some parameters from the data that can be presented numerically. In this article, we consider two different features, TF-IDF & Count Vectors (Remember, we need numeric data for the math!).
  • Split the data into training and testing parts before performing feature extraction.
#Encoding output labels 'sadness' as '1' & 'happiness' as '0'
from sklearn import preprocessing
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(data.sentiment.values)
# Splitting into training and testing data in 90:10 ratio
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(data.content.values, y, stratify=y, random_state=42, test_size=0.1, shuffle=True)
  • Term Frequency-Inverse Document Frequency (TF-IDF): This parameter gives the relative importance of a term in the data and is a measure of how frequently and rarely it appears in the text. This can be directly extracted in python as follows-
# Extracting TF-IDF parameters
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, analyzer='word',ngram_range=(1,3))
X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.fit_transform(X_val)
  • Count Vectors: This is another feature we consider and as the name suggests we transform our tweet into an array having the count of appearances of each word in it. The intuition here is that the text that conveys similar emotions may have the same words repeated over and over again. This is more like the direct approach.
# Extracting Count Vectors Parameters
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(data['content'])
X_train_count = count_vect.transform(X_train)
X_val_count = count_vect.transform(X_val)

Step 6: Training Our Models

  • With the numerical representations of the tweets ready, we can directly use them as inputs for some classic machine learning models.
  • Here, we trained four different machine learning models as demonstrated in the code below. We are focusing only on the implementation part. These four methods can, in fact, be used for tackling any kind of classification problem. In our case, we want to classify if a given tweet is a happy tweet or a sad tweet.
  • With that being said, I am not going into the details of the inner workings of these algorithms (However, if you are interested to learn more, a simple google search should help you out). For now, being aware of them should suffice. Also, please note that the syntax for implementing these models is standard.
  • First, let us build some models using the TF-IDF features-
from sklearn.metrics import accuracy_score# Model 1: Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)
y_pred = nb.predict(X_val_tfidf)print('naive bayes tfidf accuracy %s' % accuracy_score(y_pred, y_val))
naive bayes tfidf accuracy 0.5289017341040463
# Model 2: Linear SVM
from sklearn.linear_model import SGDClassifier
lsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)
lsvm.fit(X_train_tfidf, y_train)
y_pred = lsvm.predict(X_val_tfidf)print('svm using tfidf accuracy %s' % accuracy_score(y_pred, y_val))
svm tfidf accuracy 0.5404624277456648
# Model 3: logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1)
logreg.fit(X_train_tfidf, y_train)
y_pred = logreg.predict(X_val_tfidf)print('log reg tfidf accuracy %s' % accuracy_score(y_pred, y_val))
log reg tfidf accuracy 0.5443159922928709
# Model 4: Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train_tfidf, y_train)
y_pred = rf.predict(X_val_tfidf)print('random forest tfidf accuracy %s' % accuracy_score(y_pred, y_val))
random forest tfidf accuracy 0.5385356454720617
  • The best model had an accuracy of just 54.43% (Logistic Regression) which implies that our model is hardly classifying anything properly. This is no good. This might be because of the complex nature of the textual dataset we are using.
  • Now, let’s build models using count vectors features-
# Model 1: Multinomial Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_count, y_train)
y_pred = nb.predict(X_val_count)print('naive bayes count vectors accuracy %s' % accuracy_score(y_pred, y_val))
naive bayes count vectors accuracy 0.7764932562620424
# Model 2: Linear SVM
from sklearn.linear_model import SGDClassifier
lsvm = SGDClassifier(alpha=0.001, random_state=5, max_iter=15, tol=None)
lsvm.fit(X_train_count, y_train)
y_pred = lsvm.predict(X_val_count)print('lsvm using count vectors accuracy %s' % accuracy_score(y_pred, y_val))
lsvm using count vectors accuracy 0.7928709055876686
# Model 3: Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1)
logreg.fit(X_train_count, y_train)
y_pred = logreg.predict(X_val_count)print('log reg count vectors accuracy %s' % accuracy_score(y_pred, y_val))
log reg count vectors accuracy 0.7851637764932563
# Model 4: Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500)
rf.fit(X_train_count, y_train)
y_pred = rf.predict(X_val_count)print('random forest with count vectors accuracy %s' % accuracy_score(y_pred, y_val))
random forest with count vectors accuracy 0.7524084778420038
  • By using count vectors, we have a significant improvement in performance. The best model, linear SVM achieved up to 79.28% accuracy.
  • This might be because of the nature of this specific dataset where the emotion of the text is heavily dependent on the presence of some significant adjectives.
  • Let us now test how it performs in reality by giving this model some random text input.
#Below are 8 random statements.
#The first 4 depict happiness
#The last 4 depict sadness
tweets = pd.DataFrame(['I am very happy today! The atmosphere looks cheerful',
'Things are looking great. It was such a good day',
'Success is right around the corner. Lets celebrate this victory',
'Everything is more beautiful when you experience them with a smile!',
'Now this is my worst, okay? But I am gonna get better.',
'I am tired, boss. Tired of being on the road, lonely as a sparrow in the rain. I am tired of all the pain I feel',
'This is quite depressing. I am filled with sorrow',
'His death broke my heart. It was a sad day'])
# Doing some preprocessing on these tweets as done before
tweets[0] = tweets[0].str.replace('[^\w\s]',' ')
from nltk.corpus import stopwords
stop = stopwords.words('english')
tweets[0] = tweets[0].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
from textblob import Word
tweets[0] = tweets[0].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
# Extracting Count Vectors feature from our tweets
tweet_count = count_vect.transform(tweets[0])
#Predicting the emotion of the tweet using our already trained linear SVMtweet_pred = lsvm.predict(tweet_count)
print(tweet_pred)
[0 0 0 0 1 1 1 1]
  • That’s interesting. Remember our encodings for the output. ‘0' is for happiness and ‘1’ is for sadness. Our model detected the emotion correctly for all the 8 sentences!
  • But then why is our best accuracy only 79.28%? Notice that the sentences I used for testing are standard grammatically correct and direct sentences. There were no typos, no usage of slang, irony, or other complex figures of speech and modifications making it easy for our model to classify.
  • The actual Twitter data can be quite difficult to preprocess. Nevertheless, we can conclude that for normal grammatically correct tweets, our model works pretty well. Using this we could identify the overall view of a group of people whether they are feeling sad or happy correlated to a certain incident or topic in real time. We could also train the model to detect other specific emotions.
  • There are several ways to further improve our accuracy like using better preprocessing techniques or using more relatable features. One can also tweak some parameters in the model function to get higher scores.
  • And so, our prototype text emotion detection machine is ready!

End Notes:

Note that the approaches discussed in this article can be literally used on any textual dataset with minor modifications as per the application. For additional reference, you can directly look up at the official documentation of various libraries used. They are pretty detailed and provided with useful code samples.

These aren’t the only models that can be used in text classification or NLP in general. In fact, we haven’t touched any deep learning techniques yet. Some of the popular ones include RNNs, LSTM, GRU etc.

Let us save it for another day. Until then, stay tuned for more insightful updates from The Research Nest.

Check out the Part 1 & 2 of this series here (if you haven’t already):

  1. Applied Machine Learning: Part 1 (Prediction Using Linear Regression, LassoCV, ElasticNet, RidgeCV, and xgboost)
  2. Applied Machine Learning: Part 2 (Convolutional Neural Networks for Image Recognition)

Clap if you found this useful and feel free to ask any doubts in the responses below!

--

--

XQ
The Research Nest

Exploring tech, life, and careers through content.