A glimpse to Machine Learning through Sentiment Analysis

Amirhossein Goudarzi
Annotdot
Published in
7 min readDec 9, 2016

Probably anyone who’s been living in the 21st century has at least thought of or heard of intelligent robots, or perhaps has dreamed of them during childhood due to the fast development of technology. However, we normally don’t think about how and with what these robots are built. Today, I’d like to explain about a system that can surely become part of a complete robot: emotion detection. As we all know, humans are able to distinguish between happiness, sadness, darkness, brightness and many other emotions. It is easy for humans to differentiate between emotions, but it is quite difficult for a computer to sense these emotions since it is a machine that only understands 0 and 1. In this post, I will explain one of the easiest ways of detecting emotions using Twitter which will hopefully help you get a glimpse of Machine Learning and of the complexity of building a complete autonomous and emotional robot.

Sentiment analysis is a fancy word that data scientists use instead of emotion detection. Many scientists are trying to train computers to understand the common human feelings using sentiment analysis techniques. Here, I will show you the basic concept of sentiment analysis that detects the emotions of tweets. This system can be extended to any shorter or even longer texts. Of course, sentiments have a very broad spectrum which we can observe, but to make things as simple as possible, let’s say we only want “positive” and “negative” sentiments. Positive tweets can be referred to a sentence like “I love you” while negative tweets can be referred to a sentence like “I hate you”.

Generally in Machine Learning techniques, we have to provide pre-labeled texts to train our machine and then make predictions. Therefore, preparing the labeled data and training the machine are the most important tasks which need to be taken care of when it comes to Machine Learning. Since we are dealing with a set of data called ‘tweets’, we can simply collect the data using a twitter API. However in some other Machine Learning projects, preparing the data may be the most costly and time consuming part.

First, let’s have a look at the steps required for our analysis. We will be going through them step by step eventually.

1. Downloading a large set of data (Tweets)

2. Refining and cleaning the tweets

3. Using a classifier to train the system

4. Testing the classifier

In a more complicated process, there would be more steps, but you will be surprised at the interesting results these simple four steps will give you. In order to create these steps in computer, I have prepared a set of codes that I will use throughout this tutorial.

Now let’s get started!

1. Preparing a large set of data

Preparing the right set of data is the most important step in any Machine Learning technique. In sentiment analysis, we need to make sure to select and collect the best representing data. This representing data shouldn’t be biased, damaged, or wrong. Biased data will affect the output and predictions. The collected data will be used to train the computer how to sort positive and negative tweets. Using the classification technique, we will train the computer sentence by sentence and word by word so it can mimic human behavior. This process will be explained in more details in the classification section.

We need a large set of tweets for this analysis. If you want to create a sentiment analysis system for a specific field such as Biology or Politics, then it is best to search for tweets with the preferred tags or keywords. The accuracy of the sentiment analysis system will depend on the amount of tweets. In this project, however, you don’t have to download the data set as I have provided a processed module in the sentiment analysis code I mentioned previously. But if you wish to use another set of tweets for your model, You can read through this walk-through for the twitter API. The code is written in Javascript, but you can find many other alternatives in different languages such as C++ or Python.

2. Refining and cleaning the tweets

In Machine Learning, data can’t be always used as it is. Usually, data has to be cleaned and refined to be used in the training process. Data was collected but not cleaned and labeled in the previous step which means that unnecessary data needs to be removed and corresponding labels attached. By unnecessary data, I am referring to words and sentences that are unrelated to the sentiment you wish to analyze.

Let’s say you are teaching a child the words “positive” and “negative” through tweets. Tweets mostly look like this:

“I’m in love with my new company#working #startup” http://hub.am/1m6RPK4 by @kellykranz’’

As you can see, the child wouldn’t understand how the username, hashtags, and urls are contributing in making the sentence “positive” or “negative”. Therefore, we can call these kinds of data “unnecessary”. In the cleaning process, we need to remove these until we only have “I’m in love my new company” left. This process can be done in different ways. We could remove hashtags and usernames, or we could remove the words that don’t carry the sentiment of the sentence, such as “with” or “and” to only have the main intended words in the sentence.Removing “with” and “and” from the text, does not change the sentiment, but it decreases the processing time as there are less words to process. You can find this process in getFeautures.js in the sentiment analysis code I provided you with. In this piece of code, we have various functions that remove unwanted marks or tags in the tweet. These functions can also detect the negations and emoticons in the tweet.

After the cleaning process, we need to match each tweet with a label. In this analysis, the labels would be “positive” and “negative” stickers. Through these labels, the computer will sort out the tweets representing each emotion. We must prepare this part very carefully because computers can’t tell the difference between good and bad, so their whole vocabulary and way of thinking is based on how broad and unbiased was the injected information. It is just like a little child. Once we run the preprocess.js file in the code I provided you with, our data file with the labeled tweets will be turned into a cleaned JSON data object automatically in the form of a file. JSON object is a format of objects that are mostly used in Javascript to keep structured data in the code rather than in another file.

3. Using a classifier to train the system

Classifiers are the core of a Machine Learning system. These are pieces of code that break down the sentences into the words and analyze the probability of each word in different situations. Let’s say we analyze the word “love”, and find 100 tweets including this word. If 70 of these tweets carry positive label, then we conclude 70/100 = 0.7 probability for the word “love” to be positive. Also, there are other statistical techniques that can help us find the probability of having a positive sentence when both the words “love” and “mother” exist in a sentence. If we put a series of these probabilities together, we can make an accurate prediction on the emotion of the new tweet.

In the sentiment analysis code, SVM classifier is used as the classifier. SVM is one of the most famous classifiers known in the Machine Learning world. It tries to group different words and sentences by drawing a line across their probability. This approach is not very accurate, but it usually gives an accuracy above 60% in sentiment analysis only when the training data is chosen properly. Of course, you are not limited to only use this classifier, however, SVM library (node-svm) is embedded in the sentiment analysis code, so you can learn more about this classifier by following the process. SVM library was imported to the code and used in the train.js file. This file uses the JSON data that was created in the previous section, and by using the SVM classifier, it creates a trained library to be used for predictions. This training data is saved in model.json so that this training won’t be necessary to run again in the future.

4. Testing the classifier

After creating the model.json file through the classifier, we can insert new tweets and observe the predicted sentiment. Test.js file demonstrates how to find the sentiment of a newly added tweet. However, in order to test the system and prove its accuracy, it is necessary to test it with other sets of labeled tweets. This means that more of those cleaned and labeled tweets need to be prepared. Also, the number of correct predictions show the accuracy of the classifier. We can increase this accuracy by using a different set of data, cleaning method, labeling method, or classification methods. In order to test our system, we should run the classifier in a node environment. To make this more understandable we can refer to index.js. Once we add new tweets, we can see that the predict function works well and gives a predicted sentiment of that tweet.

So this example was a very basic Machine Learning method! I hope you found this helpful. We will talk more about these Machine Learning techniques in future posts so stay tuned!

Please visit us at annot.io and check out what our awesome developers are working on! Also follow us on twitter @annotdot for more info about us or for any feedback! :)

--

--