In the previous part, we have learned the theory behind the genius of Naive Bayes Classifier in Sentiment Analysis. If you haven’t read it, I highly recommend you go and give it a try to understand the basic of Conditional Probability, Bayes’ Theorem and Laplace Smoothing. I’ll be here waiting for you!
In this tutorial, we’ll build a Naive Bayes Classifier in Golang purely with the knowledge of Bayes Theorem & Laplace Smoothing and without any 3rd party libraries. I do believe that even if Golang isn’t your cup of tea 🍵, this tutorial can easily be followed with any programming language. Build it in your favorite language and link it in the comments, I would love to check out! At the end of this tutorial, we will expect our classifier to run like this:
If you cannot wait for the source code, it can be found here https://gist.github.com/port3000/27c62464c4aaa83dba5becbcfa78f134
Table of Contents
The Dataset
The training dataset used in this tutorial are sentences labeled with positive or negative sentiment. This dataset was created for the Paper From Group to Individual Labels using Deep Features, Kotzias et. al,. KDD 2015. It contains sentences labeled with positive or negative sentiment. Score is either 1 (for positive) or 0 (for negative). The sentences come from three different websites/fields: imdb.com, Amazon .com, yelp.com. For each website, there exists 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews. Selected sentences have clearly positive or negative connotations, the goal was for no neutral sentences to be selected. The link to the dataset is https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
To model this dataset, we use constants to describe the two classes positive
&negative
. For a different problem such as email spamming detection, the classes can be spam
& ham
.
The Classifier
The Classifier uses the words’ frequency algorithm. It works extremely well for simple Machine Learning models. Indeed, if a word appears frequently in the negative dataset and the input sentence contains it, then the input sentence has a higher chance of being negative.
We use two structs to build our classifier, wordFrequency
and classifier
. classifier
has train
and classify
methods. Right now, that two methods are empty and we’ll need to populate it later.
classifier
also has dataset
and words
properties.
dataset
mapspostive
andnegative
to list of sentences from training data. This tells us the ratio of positive and negative sentences in training data (usually 50%)words
maps each word to theirwordFrequency
. Meanwhile,wordFrequency
count appearance of a word inpositive
andnegative
. This tells us whether a word is more positive or negative.
newClassifier
build a new and clean classifier with empty dataset
and words
. train
will make this new classifier helpful.
Train
train
takes a map of sentences to their class and populate the classifier with that knowledge. Here, for each sentence in our training dataset, we:
- add it and its class to the
dataset
property. tokenize
it (transform to lowercase, remove stopwords and split into an array of single words), then add each word to thewords
property.
addSentence
to dataset
property & addWord
to words
property can be implemented as below:
Notice how we call tokentize
in train
, we’ll have to create that utility method. For clarity purpose, let’s put it in another file utilities.go
:
Here, firstly we clean up the sentence by transforming all characters to lowercase and removing none alnum(alpha and number). Then we split the sentence into words, remove stopwords and return the remaining words as an array. train
will use that array of words and add to words
property. We store stopwords in a map for performance purpose only as searching for words inside a map is way faster than an array!
Classify
classify
method takes a sentence and returns the probability of it being positive and negative. Here, firstly we tokenize
it into words and evaluates these words in probability
helper method.
probability
method is the implementation of Bayes Theorem with Laplace Smoothing.
Recall the equation we have derived in part 1 (The Theory), the probability of A is True given B is:
where A
is positive
or negative
, B
is the input sentence, bi
is each word in the sentence. For each P(bi|A)
:
P(A)
in the equation is returned by prorProb
method. The first loop in probability
is to calculate the numerator. The second loop is to divide the numerator by the denominator.
We use one utility method here called zeroOneTransform
. The idea behind this method is simple. If argument x
is 0, it returns 0. Otherwise, it returns 1. We use the utility method to help calculate the totalDistinctWordCount
main.go
Now we have the implementation of Naive Bayes Classifier in place. Let’s actually use it! First, we open the dataset file (yelp reviews in this example), make a map from it and feed it to the classifier by calling train
. We then prompt for user’s input then classify
that sentence.
Build it by running $ go build -o naivebayes main.go Naivebayes.go utilities.go
Try it out
Yay!
Summary
In this tutorial, we have been successfully building a Natual Language Processing model in Golang. Our Naive Bayes Classifier works great, but there are things that can be improved:
- Word lemmatization: Group different inflections of a word together. For example, eat, ate, eating should be interpreted as one single word.
- Words chunk: Group different words that make a phrase together. So one token can be “chicken chop” instead of two tokens “chicken” and “chop”.
- Advanced algorithms: Word frequency works. However, we can incorporate it with other algorithms to interpret the sentences better. TF-IDF is a good candidate.
Once again, the full source code is here https://gist.github.com/port3000/27c62464c4aaa83dba5becbcfa78f134
I hope you have enjoyed the tutorial and learned something new today! If you have, please consider give it a clap 👏 and follow me to never miss future tutorials. Happy coding! 🖥