Natural Language Processing

Sentiment Analysis on Tweets with NLP Achieving 96% Accuracy

Full code available at my Github repository. The primary source is available on ODSC

Michelangiolo Mazzeschi
Jun 6 · 4 min read
Image for post
Image for post
Photo by Sara Kurfeß on Unsplash

One of the most complicated uses of AI is NLP (Natural Language Processing). The reason why, is that language differs from all the other kinds of data. While numerical and image data, for example, has the advantage of being objective, written text is relative. Its interpretation varies cross-culturally: the same sentence can signify completely different things for two different persons.

Of course, data analysts have found functioning solutions that allow a machine to have a ‘basic understanding’ on the content of a text.

Sentiment Analysis

The first step in learning NLP models is creating a sentiment analysis. Given some text, the AI should be trained to recognize if its meaning is positive or negative. Practically, you can use this tool to understand the overall customer perception of products or news, especially when there are no numerical measures (such as ratings), but only text.

Machine Learning vs. Deep Learning

In my experience, we have to use different tools depending on the complexity of the problem. In movie reviews, given the complexity of every single element, we would need to use neural networks, but for tweets, we can use machine learning with very promising results.

nltk

I will be using a machine learning library specialized for NLP, called nltk. I prefer using scikit-learn for creating machine learning models, but it is a library specialized for tabular data rather than natural language processing.

Steps

In this article, I will follow the following steps:

  1. Importing Modules
  2. Creating Features and Labels (encoding)
  3. Creating train and test (splitting)
  4. Using the model: Naive Bayes Classifier
  5. Performance Estimation

As usual, AI is not standardized. There are several ways of reaching the same results. In a regular NLP we would need to preprocess the data in this way:

  • Tokenization: splitting sentences into individual words
  • Encoding: converting these individual words to numbers
  • Creating the NLP Model

I will be using dictionaries, therefore I won’t be encoding text into numbers, rather into Boolean values.

1. Importing Modules

!pip install nltk
import nltk
#per risolvere un bug, altrimenti da errore
nltk.download('punkt')
#tokenizer
def format_sentence(sent):
return({word: True for word in nltk.word_tokenize(sent)})

2. Creating Features and Labels

In this section I will separately import the datasets containing both positive and negative tweets, preprocessing them separately.

The dataset in question contains a sample of 617 positive tweets and 1387 negative tweets, for a total of 2000 tweets.

#   X + y
total = open('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/pos_tweets.txt')
Xy_pos = list()
#word tokenization
for sentence in total:
#print(sentence)
Xy_pos.append([format_sentence(sentence), 'pos'])
#saves the sentence in format: [{tokenized sentence}, 'pos]
#Xy_pos
# X + y
total = open('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/neg_tweets.txt')
Xy_neg = list()
#word tokenization
for sentence in total:
#print(sentence)
Xy_neg.append([format_sentence(sentence), 'neg'])
#saves the sentence in format: [{tokenized sentence}, 'pos]
#Xy_neg

As a result, I will have a dictionary nested in a list:

[dictionary, sentiment]

If we have a look at the first element of the positive_tweets, we can see how our data has been encoded:

Xy_pos[0]
[{"''": True,
"'m": True,
',': True,
'.': True,
':': True,
'Ballads': True,
'Cellos': True,
'Genius': True,
'I': True,
'``': True,
'and': True,
'by': True,
'called': True,
'cheer': True,
'down': True,
'iPod': True,
'listening': True,
'love': True,
'music': True,
'my': True,
'myself': True,
'of': True,
'playlist': True,
'taste': True,
'to': True,
'up': True,
'when': True},
'pos']

3. Creating train and test

We are working in the domain of Supervised Learning. Unfortunately, compared with the analysis of tabular data, at least for the nltk tool, preprocessing acts a bit differently.

If I had to analyze tabular data I would create my portions divided into:

X_train, y_train, X_test, y_test

In this particular case I need to merge labels and features, remaining with:

Xy_train, Xy_test

The reason for this change is that the nltk model only accepts one parameter, X_train and y_train merged: in our case Xy_train

Splitting

To create my train and test portion I will have to merge pos and neg as they contain both the train and the test portion. Then I will split the combined datasets.

def split(pos, neg, ratio):
train = pos[:int((1-ratio)*len(pos))] + neg[:int((1-ratio)*len(neg))]
test = pos[int((ratio)*len(pos)):] + neg[int((ratio)*len(neg)):]
return train, test
Xy_train, Xy_test = split(Xy_pos, Xy_neg, 0.1)

4. Using the model: Naive Bayes Classifier

It is now time to create the Machine Learning model.

from nltk.classify import NaiveBayesClassifier#encoded thrugh dictionaries
classifier = NaiveBayesClassifier.train(Xy_train)
classifier.show_most_informative_features()
Most Informative Features no = True neg : pos = 20.6 : 1.0 awesome = True pos : neg = 18.7 : 1.0 headache = True neg : pos = 18.0 : 1.0 beautiful = True pos : neg = 14.2 : 1.0 love = True pos : neg = 14.2 : 1.0 Hi = True pos : neg = 12.7 : 1.0 fan = True pos : neg = 9.7 : 1.0 Thank = True pos : neg = 9.7 : 1.0 glad = True pos : neg = 9.7 : 1.0 lost = True neg : pos = 9.3 : 1.0

The model has associated one value to each word in the dataset. It will perform a calculation on all the words contained in every tweet it has to analyze, and then make its estimation: positive or negative.

5. Performance Estimation

As we can see, we obtained (by excess approximation) a 96% accuracy on all the tweets! Amazing result.

from nltk.classify.util import accuracy
print(accuracy(classifier, Xy_test))
0.9562326869806094

Towards AI

The Best of Tech, Science, and Engineering.

Sign up for Towards AI Newsletter

By Towards AI

Towards AI publishes the best of tech, science, and engineering. Subscribe to receive our updates right in your inbox. Interested in working with us? Please contact us → https://towardsai.net/contact Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Michelangiolo Mazzeschi

Written by

AI Developer, open to job opportunities, https://www.linkedin.com/in/michelangiolo-mazzeschi-3a6a6881/

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Michelangiolo Mazzeschi

Written by

AI Developer, open to job opportunities, https://www.linkedin.com/in/michelangiolo-mazzeschi-3a6a6881/

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium