Naive Bayes Classifier From Scratch | Part 2 (NLP in Golang)

Harry Cao
5 min readJan 13, 2019

--

In the previous part, we have learned the theory behind the genius of Naive Bayes Classifier in Sentiment Analysis. If you haven’t read it, I highly recommend you go and give it a try to understand the basic of Conditional Probability, Bayes’ Theorem and Laplace Smoothing. I’ll be here waiting for you!

In this tutorial, we’ll build a Naive Bayes Classifier in Golang purely with the knowledge of Bayes Theorem & Laplace Smoothing and without any 3rd party libraries. I do believe that even if Golang isn’t your cup of tea 🍵, this tutorial can easily be followed with any programming language. Build it in your favorite language and link it in the comments, I would love to check out! At the end of this tutorial, we will expect our classifier to run like this:

Naive Bayes Classifier in Golang

If you cannot wait for the source code, it can be found here https://gist.github.com/port3000/27c62464c4aaa83dba5becbcfa78f134

Table of Contents

The Dataset

The training dataset used in this tutorial are sentences labeled with positive or negative sentiment. This dataset was created for the Paper From Group to Individual Labels using Deep Features, Kotzias et. al,. KDD 2015. It contains sentences labeled with positive or negative sentiment. Score is either 1 (for positive) or 0 (for negative). The sentences come from three different websites/fields: imdb.com, Amazon .com, yelp.com. For each website, there exists 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews. Selected sentences have clearly positive or negative connotations, the goal was for no neutral sentences to be selected. The link to the dataset is https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Snippets of yelp dataset

To model this dataset, we use constants to describe the two classes positive &negative. For a different problem such as email spamming detection, the classes can be spam & ham.

Constants describe two classes

The Classifier

The Classifier uses the words’ frequency algorithm. It works extremely well for simple Machine Learning models. Indeed, if a word appears frequently in the negative dataset and the input sentence contains it, then the input sentence has a higher chance of being negative.

The Classifier

We use two structs to build our classifier, wordFrequency and classifier. classifier has train and classify methods. Right now, that two methods are empty and we’ll need to populate it later.

classifier also has dataset and words properties.

  • datasetmaps postive and negative to list of sentences from training data. This tells us the ratio of positive and negative sentences in training data (usually 50%)
  • words maps each word to their wordFrequency. Meanwhile, wordFrequency count appearance of a word in positive and negative. This tells us whether a word is more positive or negative.

newClassifier build a new and clean classifier with empty dataset and words. train will make this new classifier helpful.

Train

Train method

train takes a map of sentences to their class and populate the classifier with that knowledge. Here, for each sentence in our training dataset, we:

  • add it and its class to the dataset property.
  • tokenize it (transform to lowercase, remove stopwords and split into an array of single words), then add each word to the words property.

addSentence to dataset property & addWord to words property can be implemented as below:

Helper methods for Train

Notice how we call tokentize in train, we’ll have to create that utility method. For clarity purpose, let’s put it in another file utilities.go:

Utility methods for Train

Here, firstly we clean up the sentence by transforming all characters to lowercase and removing none alnum(alpha and number). Then we split the sentence into words, remove stopwords and return the remaining words as an array. train will use that array of words and add to words property. We store stopwords in a map for performance purpose only as searching for words inside a map is way faster than an array!

Classify

Classify method

classify method takes a sentence and returns the probability of it being positive and negative. Here, firstly we tokenize it into words and evaluates these words in probability helper method.

probability method is the implementation of Bayes Theorem with Laplace Smoothing.

Helper methods for Classify

Recall the equation we have derived in part 1 (The Theory), the probability of A is True given B is:

where A is positive or negative, B is the input sentence, bi is each word in the sentence. For each P(bi|A):

P(A) in the equation is returned by prorProb method. The first loop in probability is to calculate the numerator. The second loop is to divide the numerator by the denominator.

We use one utility method here called zeroOneTransform. The idea behind this method is simple. If argument x is 0, it returns 0. Otherwise, it returns 1. We use the utility method to help calculate the totalDistinctWordCount

Utility method for Classify

main.go

Now we have the implementation of Naive Bayes Classifier in place. Let’s actually use it! First, we open the dataset file (yelp reviews in this example), make a map from it and feed it to the classifier by calling train. We then prompt for user’s input then classify that sentence.

Naive Bayes Classifier usage in main.go

Build it by running $ go build -o naivebayes main.go Naivebayes.go utilities.go

Try it out

The results

Yay!

Summary

In this tutorial, we have been successfully building a Natual Language Processing model in Golang. Our Naive Bayes Classifier works great, but there are things that can be improved:

  • Word lemmatization: Group different inflections of a word together. For example, eat, ate, eating should be interpreted as one single word.
  • Words chunk: Group different words that make a phrase together. So one token can be “chicken chop” instead of two tokens “chicken” and “chop”.
  • Advanced algorithms: Word frequency works. However, we can incorporate it with other algorithms to interpret the sentences better. TF-IDF is a good candidate.

Once again, the full source code is here https://gist.github.com/port3000/27c62464c4aaa83dba5becbcfa78f134

I hope you have enjoyed the tutorial and learned something new today! If you have, please consider give it a clap 👏 and follow me to never miss future tutorials. Happy coding! 🖥

--

--