Sentiment Analysis with supervisioned and unsupervisioned learning

The Portuguese version of this article is available in Análise de Sentimento com aprendizado supervisionado e não supervisionado

The Sentiment Analysis, or opinion mining, has the objective of identifying someone’s sentiment about something through natural language text. This analysis is done to find polarities on the text, whether some phrase is positive or negative, and not necessarily to find more detailed emotions. This technique is frequently used by companies to, for example, measure a product's acceptance.

The analysis is composed by statistical and machine learning techniques. In this article I'll explain how to classify these texts. The supervisioned approach will use Neural Networks and the unsupervisioned approach will use Semantic Orientation (for a better understanding of machine learning initial concepts, see this article here).


Recurrent Neural Networks (RNN)

RNNs recognize patterns from sequential inputs and can be used with several types of inputs. The decisions taken on t-1 time will affect the decision on t time. Different from usual Neural Networks, the Recurrent Networks don't receive only the dataset's inputs, but also the state of last unit.

However, basic RNNs are not good for long memory dependent sequences. For this reason we'll use the LSTM (Long-Short Term Memory), a RNN-based architecture.

LSTM

σ = sigmoid function; tanh = hyperbolic tangent

The image above represents the architecture of a LSTM. This kind of Networks contain 3 components:

forget gate: Here the network will "forget" what is not necessary. The new input and the last layer output will go through a sigmoid layer. This is where everything which will be “forgotten” becomes 0. This layer is multiplied by the last cell’s state (memory).

input gate: This part is responsible for adding information to the cell’s state. Initially a verification is done with a sigmoid function, like in the “forget gate” component. This is done to repass only important information, that should be added to the cell’s state. Afterwards, a vector which has all possible values that can be added to the cell’s state is created by tanh. The result of these two steps are multiplied and summed with the forget gate output.

output gate: On this step, the network decides its output. As in previous steps, there is a sigmoid layer to normalize and select which values should remain in the output. Also, the current cell state is passed through another tanh to generate all possible values. These two layers are multiplied, creating the output.

Hands on!

The dataset used here is a public dataset, called "Yelp" that contains thousands of reviews about different types of business.

This dataset comes with an attribute called "stars", which is the business score. Here I added the attribute "sentiment" to label the reviews polarity.

For any task involving text mining some preprocessing is necessary. In this case, we'll use the tokenization. This method uses a text sequence and split all words, removing some punctuations. In the code below all special characters were also deleted.

The built model is very simple, just one LSTM layer and one Dense layer at the end of it.

This super simple network gets 75% of accuracy with 15,000 data instances (very little data).

Train on 10800 samples, validate on 1200 samples
Epoch 1/1
10800/10800 [==============================] - 1334s 124ms/step - loss: 0.7541 - acc: 0.6944 - val_loss: 0.6426 - val_acc: 0.7492
Loss score: 0.61
Test Accuracy: 75.37

This dataset (pos = 9908; neutral = 2385; neg = 2707) has “pos” as its predominant class, amounting to 66% of the data. Given that, a baseline for this study could be an accuracy of 66%, because if we “guess” all data instances as positive we’d get 66% accuracy. This means that our model actually learned something.

Although, be mindful. The sampling method that was used, even having the same predominant class ratio of the dataset, doesn’t consider the data distribution when selecting the instances. This may generate bias. A better method is to select the instances maintaining the original distribution.

What if the data doesn't have any score or numerical classification, that is, a label?


Semantic Orientation

There are some other ways to classify text without labels. The utilized method in this article is the semantic orientation (SO) of a word, which computes the distance from a term to another like 'good' or 'bad'. This distance is set by PMI (Pointwise Mutual Information), where t1 and t2 can be any word and its probabilities of appearing in the text are P(tx).

The semantic orientation of a word is calculated by PMI results, using a term (t) of the analyzed sentence and comparing with a term belonging to a set of positive (V+) and negative (V-) terms (t').

Hands on!

For this task I collected some tweets with the #WomensWave hashtag (one of Twitter’s Trending Topics about American elections).

As in the previous code, here tokenization preprocessing is done as well. The tweets can have a lot of expressions which are not recognized like a token. That’s why it's necessary to make them explicit. These expressions are being matched with a regex.

Besides the preprocessing, it's necessary delete some stopwords. Stopwords are words that usually don't have any meaning, like conjunctions and articles.

In this part, we collect the frequency of each word and the frequency of each co-occurrence between two words in the text.

When a text is analyzed, use the context makes this analysis more real than to look at words singly. Because of it the co-occurrence matrix is built.

Here, the term’s occurrence probability is calculated by the PMI and SO posteriorly.

The positive and negative vocabulary and the PMI and SO calculation are shown below.

The most positive and negative words:
TOP POS: 
[(‘diversity’, 13.247582522786834),
(‘#usmidtermelections’, 11.880054410636202),
(‘guy’, 10.995531628056138),
(‘beating’, 10.647393653845928),
(‘general’, 10.647393653845928),
(‘function’, 10.440942776378503),
(‘adding’, 10.358613835175579),
(‘diverse’, 9.880054410636202),
(‘goddamned’, 8.866248611111173),
(‘#electionresults2’, 8.866248611111173)]
TOP NEG: 
[(‘control’, -6.451211111832329),
(‘sad’, -9.451211111832329),
(‘believe’, -9.451211111832329),
(‘1950’, -9.451211111832329),
(‘blinded’, -10.451211111832329),
(‘https://t.co/nsvwwzd1dx', -10.451211111832329),
(‘cant’, -10.451211111832329),
(‘feel’, -10.451211111832329),
(‘civilty’, -10.451211111832329),
(‘flow’, -10.451211111832329)]

You can see that some words don't seem to be exactly what they were classified as. This happened because the analysis done here was very simple, only getting the near terms. For a better unsupervisioned learning a linguistic analysis about how the sentence is built may also be necessary. In order to identify natural expressions.

To finish, I created a word cloud with the positive terms result:


References

This article was mainly based in these two other articles about Twitter data mining and LSTM. Below I left some other references and my github repository with both complete projects, including the tweets collection and code for the word cloud.

My github
Generating WordClouds in Python
LSTM Networks
Yelp dataset