Sentiment Analysis Using Python and NLTK
How are we going to be doing this?
Python, being Python, apart from its incredible readability, has some remarkable libraries at hand. One of which is NLTK. NLTK or Natural Language Tool Kit is one of the best Python NLP libraries out there. The functionality it leaves at your fingertips while maintaining its ease of use and again, readability is just fantastic.
In fact, we’re going to be completing this mini project under 25 lines of code. And you’re most probably going to understand each line as you read through it. Crazy, I know.
Let’s get right into it !
- IDE
Personally whenever I’m doing anything even relatively fancy, in Python, I use Jupyter Lab. Being able to see what each line does makes it really easy to debug and it’s also strangely therapeutic. Shrugs.
But you’re free to use whatever you want. It’s a free world. Mostly.
2. Dependencies
Now, we’ve got to get hold of the libraries we need. Just 4, super easy to get libraries.
- NLTK
- Numpy
- Pandas
- Scikit-learn
To install NLTK, run the following in the terminal
pip install nltk
To install Numpy, run the following in the terminal
pip install numpy
To install Pandas, run the following in the terminal
pip install pandas
To install Scikit-learn, run the following in the terminal
pip install scikit-learn
So intuitive. I mean, come on, it really can’t get any easier.
Time to code
First things first. Let’s import NLTK.
import NLTK
Now, there’s a slight hitch. I did say 4 dependencies, didn’t I ? Ok, here’s the last one, I swear. But this one’s programmatic.
nltk.download(‘vader_lexicon’) # one time only
This is going to go ahead and grab, well, the vader_lexicon.
What is this ‘VADER’ ?
While this is the official page for NLTK’s VADER, it’s actually the code and not an explanation of VADER which by the way, does not, refer to Darth Vader, very sad, I know.
It actually stands for Valence Aware Dictionary and sEntiment Reasoner. It’s basically going to do all the sentiment analysis for us. So convenient. I mean, at this rate jobs are definitely going to be vanishing faster. (No, I’m kidding)
The way this magical downloadable works, is by mapping the word you pass into it, to lexical features with emotional intensities. In English, since you ask, that means figuring out, let’s just call them synonyms for now, to figure out what that word relates to and then gives it a score. A sentiment score, to be precise.
So now that each word has a sentiment score, the score of a paragraph of words, is going to be, you guessed it, the sum of all the sentiment scores. Shocking, I know.
Now, you might go thinking, ok, fine it goes ahead and gets the score of each word fine. But does it understand context ? Like for example, the difference between did work and did not work ?
DUH !!!
I mean otherwise why would it be ‘one of the best’ ?
Another really important thing to keep in mind, is that VADER actually pays attention to capitalization and exclamations. It will give a higher positive score to AWESOME!!!!! than AWESOME and awesome.
That’s it class, theory’s over.
Now, back to business
Let’s now import the downloaded VADER module.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
and then make an instance of the SentimentIntensityAnalyzer, by doing this
vader = SentimentIntensityAnalyzer() # or whatever you want to call it
By now your code should look something like this
Upon running it, you should see something like this. If you get the same error as me, don’t worry, it’s basically warning you that the Twitter module from NLTK is not installed and so you won’t be able to tap into that functionality.
Now let’s try out what this ‘VADER’ can do. Write the following and run it
sample = ‘I really love NVIDIA’
vader.polarity_scores(sample)
So, it was 69.2% positive. Which might not be perfect, but it definitely gets the job done, as you’ll see.
In case you’re wondering, the compound value is basically the normal of the 3 values negative, positive and neutral.
Now, try this
sample = ‘I really don\'t love NVIDIA’
vader.polarity_scores(sample)
54.9% negative, whew, by the skin of its teeth.
Now let’s work on some real world data
Here’s a file with Amazon reviews of a product from which we’re going to be extracting sentiments. Go ahead and download it. Also ensure that it’s in the same directory as the python file you’re working on. Otherwise remember to add the correct path to it.
We’re going to be needing both pandas and numpy now
import numpy as np
import pandas as pddf = pd.read_csv(‘wherever you stored the file.tsv’, sep=’\t’)
df.head()
In the above code, we’ve initialized a Pandas Dataframe object, and called it to view the top 5 objects in the dataframe.
This dataset already has all the reviews categorized under positive and negative. This is just for you to cross check the values you get back from VADER and calculate your metrics.
To see how many positive and negative reviews we have, type in the following
df[‘label’].value_counts()
Let’s try one of the objects out, shall we ?
But before we do that, let’s ensure that our dataset is nice and clean, i.e, ensure that there aren’t any blank objects.
df.dropna(inplace=True)empty_objects = []for index, label, review in df.itertuples():
if type(review)==str:
if review.isspace():
empty_objects.append(i)
df.drop(empty_objects, inplace=True)
This little convenience function will drop any blank dataframe objects. The
inplace=True
method ensures that the dataframe keeps the changes made by dropping any blank objects, and not cheekily throwing them away despite all our effort. Very much like a commit in Github.
However, this particular dataset had no empty objects, but still, it doesn’t harm to be careful.
Currently there’s a couple of problems:
- We can’t compare the extracted sentiment to the original sentiment as doing that for each sentiment is time consuming and quite frankly, completely caveman.
- The extracted sentiment is printed out, which, in my opinion is plain flimsy.
Let’s fix it.
Let’s add the sentiment to the dataframe alongside its original sentiment.
df[‘scores’] = df[‘review’].apply(lambda review: vader.polarity_scores(review))df.head()
The above code will create a new column called ‘scores’ which will contain the extracted sentiments.
But currently the scores column has just the raw sentiment which, we can’t really compare programmatically with the ‘label’ column which already has all the data, so let’s find a workaround.
Let’s use the compound value.
If the compound value is greater than 0, we can safely say that the review is positive, otherwise it’s negative. Great ! Let’s implement that now !
Well then let’s check our score now, shall we ?
There’s definitely room for improvement. But, do keep in mind that we got this score without making any changes to VADER and that we didn’t write any custom code to figure out the sentiment ourselves.
Alright then, if you have any queries feel free to post them in the comments and I’ll try to help out ! Peace.