Data Science
Sentiment Analysis on Reddit Tech News with Python
A quick guide to sentiment analysis with NLTK on the subreddit r/technews.
Sentiment Analysis is the process of determining whether a piece of text is considered to be positive, negative, or neutral.
It’s an application of Natural Language Processing that has tons of use cases.
As stated in Wikipedia:
Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
Imagine you’re a business owner, and you have over 10,000 product reviews for your product. You want to know what your customers think about your product, but you don’t have the time to sift through them one by one.
With sentiment analysis, you can automate that process or even have real-time monitoring to deal with feedbacks swiftly.
Below is an example of sentiment analysis in action on product reviews.
To showcase how you can perform sentiment analysis in Python, in this article, I will use the PRAW library to interact with the Reddit API to grab posts from the subreddit technews.
Then, I’ll use the NLTK library, specifically using the VADER sentiment analysis to perform sentiment analysis on the post titles.
As always, here’s where you can find the code for this article:
This post was inspired by the article “Sentiment Analysis on Reddit News Headlines with Python’s Natural Language Toolkit (NLTK)” on learndatasci.com.
Create a Reddit application
The first step is to create a Reddit app. To do so, you would first need a Reddit account. If you don’t have one, you can register one here.
After you’re logged in, head over to reddit.com/prefs/apps, and you will see this interface.
There are 3 essential things you need to do:
1. select the script option
2. name: your_reddit_username
3. redirect url: http://localhost
After that, you can hit create app
, and on the upper left corner, you will see something like this.
From the above image, what you want to note down is the client_id
and client_secret
, which you’ll use to build a Reddit client.
Now that you have the credentials, we can move on to the code!
Load Libraries
First things first, we import all the necessary libraries for this project.
pprint
— a Data pretty printer that outputs data structures in a cleaner format.itertools
— iterators for efficient looping, one of which ischain
which I used to join chain together multiple lists into a single list.NLTK
— Natural Language Toolkit, an open-source Python library for NLP, containing a set of text processing libraries for classification, tokenization, stemming, and tagging.PRAW
— The Python Reddit API wrapper allows you to interact with Reddit API using Python.
Downloading NLTK’s databases
nltk.download()
is used to download a particular dataset/model. For this article, there are three things to download.
Vader lexicon
— Dataset of lexicons containing the sentiments of specific texts which powers the Vader Sentiment Analysispunkt
— Pre-trained models that help us tokenize sentences.stopwords
— Dataset of common stopwords in English.
With that, we can set up the client.
Setting up Reddit client
With the credentials you generated earlier, you can pass in the user_agent your Reddit user name and the rest as follows. Note that the check_for_async
was set to False
just so that it won’t generate warnings later on.
Selecting subreddit and sorting type
As mentioned in the subtitle of this article, we’ll be scraping the subreddit r/technews, but you can choose any subreddit you want to analyze, replace 'technews'
with the subreddit name of your choosing.
Here I’m getting the top posts all time, and I set the limit to None
to get the maximum amount of posts possible (the limit is 1000 posts).
You can find more options, such as sorting by new, hot, rising, etc., in PRAW’s quick start guide.
Notice the *
symbol, this is known as the star expression, and it has the functionality to unpack iterables. In this case, what it does is unpack the output generated by the function into a list.
Printing the length, tells us we obtained a total of 967 posts.
Grabbing the first post we scraped by indexing 0
, you can see that you can get various kinds of information from the — number of upvotes, date and time, number of comments, total upvotes, and number of awards given.
You can run vars
on the first post object to get all the information to contain within a single post (warning: the output is huge).
For this article, we only need the title, so what we’ll do is extract the title for each post and dump it into a list.
With this list of headlines, we can now form a Pandas data frame.
Going to the subreddit on Reddit, you can see we grabbed the post titles!
With over 900 post titles in a data frame, it’s time for some sentiment analysis!
Sentiment Analysis with VADER
What is VADER?
According to their Github:
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.
In other words, it’s a pre-trained sentiment analysis model for text sentiment analysis. This model relies on the vader_lexicon
the dataset we downloaded earlier, which will map lexical features to the sentiment scores.
When given a string of words, VADER returns a dictionary containing the four scores:
neg
— negativeneu
— neutralpos
— positive- compound (normalization of three scores above)
Below you see examples of VADER in action.
Notice that the words ‘awesome’ and ‘bad’ skews towards positive and negative polarity based on their respective sentiment.
Also, the intensity of emotion is considered as capitalizing the word ‘awesome,’ and adding an exclamation mark increases the positive score.
You can view more examples on their Github.
Now you know a little bit about what Vader is and what it can do, let’s apply it to our data frame.
With the scores calculated in dictionaries, we create a data frame using from_records
and then concatenate it to our data frame on an inner join.
Now that we have the scores, the next step is to choose a threshold to label the text as positive, negative, or neutral.
Choosing the threshold
The VADER Github readme tells us that the typical threshold is 0.05. But following this article, which also did sentiment analysis on news headlines, I’ll use the value 0.2
VADER on individual words
If you’re curious about how VADER ended up labeling the sentiment of the titles, here’s a broken-down version that shows which word it categorizes as positive, neutral, and negative.
Notice there were no positive words in this sentence, and there were three negative words. Since there are more negatives than positives, it makes sense that this was labeled as negative.
If you want to go a step further and learn how the compound score is calculated, check out this StackOverflow post.
Now that we have our labels, we can do a quick value count on each label.
With our selected threshold, we have mostly neutral titles and more negative titles than positive titles.
Are the labels accurate?
Taking random samples of each label and using a custom function that outputs the news titles, we can get a sense of how well our threshold performs in categorizing news as positive, neutral, and negative.
From the output, the labels seem to be pretty accurate.
A side tangent: Usually sentiment analysis makes more sense when applied on a “target subject”, such as reviews on a book, or comments on a YouTube video. News headlines are, on the other hand, pretty descriptive and neutral, so sentiment analysis might be misleading.
Let’s now move on to tokenization.
Tokenization
What is it?
Tokenization is the process of breaking down a piece of text into smaller components known as tokens. A token can be a word, a part of a word, or any character like punctuation, symbol or even emojis 🤯.
Why we do it?
Tokenization builds the foundation for any NLP tasks, as these tokens provide context and help computers interpret the meaning of the text. Different kinds of tokens can serve different purposes, but the main idea is to turn them into a usable form for computers.
You can use many different tools to tokenize strings, but NLTK already has a set of tokenizers we can utilize.
NLTK tokenizers
NLTK has many built-in tokenizers that you can use for specific purposes.
A few notable tokenizers are:
word_tokenize
— Splits string by punctuation other than periodssent_tokenize
— Splits a string into sentencesRegexpTokenize
— Splits string based on a regular expression.- more in their documentation
Above, you can see an example of a text being split by the tokenizers.
Notice how each of the tokenizers words differently based on how it’s split.
The first one splits by punctuation, which splits the word “Let’s”
into "Let"
and "'s"
, whereas the second one that splits by whitespace keeps the word Let's
. As for the last one, splitting by word results in the punctuation being removed.
One thing that comes up when you learn about tokenization is stop words. They’re basically the most common words in the English language, and we remove them so we can focus on more important features (words) instead.
By downloading the ‘stopwords’ database with NLTK earlier on, we have access to a total of 179 of them, which we will use to filter them out from our text.
Custom tokenize
In some cases, you would also do further preprocessing to get the result that you want.
In this function, I remove the single quote so that words like “Let’s” will become “Lets”, and I also removed hyphens so the word “covid-19” would be “covid19”, instead of being separate as “covid” and “19”.
Note: I removed the single quote because I’m only using the tokens for visualization. If you decide to use it to build a model, it would destroy the meaning behind the original words, i.e. from it’s to its, which are two different things.
The text was also lowercased, and stop words are filtered with a list comprehension.
Using Pandas’ nifty apply
function, we can apply our custom function onto each title in our data frame.
The tokens
object is a nested list (multiple lists within a list). Since we want all the words in a single list, the method chain
which comes from the itertools
library helps us do exactly that.
The end result is two lists, containing the words of titles that were labelled as positive and negative.
Visualize tokens
Top 20 words
With our list of words, we can utilize NLTK’s built-in function FreqDist
as a counter for the words within our list, and most_common
to return the top words based on the count.
From our list of positive words, we see the word “apple” and “google” are the top words. Notice how the numbers 5 and 000 are present in our list, they can also be filtered if you want to with more preprocessing.
Usually, when visualizing tokens, a better option is to use word clouds, as the size of the words correlates with their count, so you have a better idea of which words are important.
Word clouds
Here is the word cloud generated for the positive and negative words list.
We can imagine what positive news was related to these words from the positive word cloud.
The words “Apple” and “Google” could be the good deeds that the big tech companies are doing.
We also see the words “Elon Musk”, “Tesla”, and “SpaceX” amongst the top positive words, which is most likely some technological advancements or maybe philanthropy works of Elon.
To find out the exact news, I wrote up a function to extract the titles.
When given the words Elon Musk, these titles were extracted.
Now let’s have a look at the negative words.
At first glance, we can tell the big tech companies are more prominent in the negative words, along with the words “ban”, “internet”, “data”, and “Trump”. This suggests it was the news about Donald Trump being banned from social media companies.
In this word cloud, negative words are also more evident. As words like “fake”, “misinformation”, “lawsuit”, “hacked”, “attack”, “blocking”, etc. are popping up.
Extracting the titles on the word “Facebook”, and sure enough, it was about him being banned.
Notice the second title — being positive news — is labeled as negative because of the words “banning” and “misinformation”, which shows you the limitation of VADER.
There you go! You scraped Reddit tech news headlines, did sentiment analysis on them, tokenize the titles, and generated word clouds!
This was just a glimpse into what NLTK can achieve in terms of NLP, and there are definitely improvements you can make to the sentiment analysis to label the posts more accurately.
If you want to know more, I listed a few articles below for you to dive deeper into this topic!
That’s all for this article, and I hope you learn something new from it!
Thanks for reading 😉 !
Links
Further readings
- Sentiment Analysis: A Definitive Guide
- Simplifying Sentiment Analysis using VADER in Python (on Social Media Text) by Parul Pandey
- Tokenization in NLP — Types, Challenges, Examples, Tools
- How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)
Liked what you read? Here are some articles you may enjoy:
- Using Data Science to Predict Viral Tweets
- Scraping 100+ Free Data Science Books with Python
- Ethereum Price Prediction with Python
If you like these kinds of articles, be sure to follow the bitgrit Data Science Publication for more!
Follow bitgrit’s socials 📱 to stay updated on talks and upcoming competitions!