Sentiment Analysis on Twitter Data with R

Patrick Gichini
MindNinja
Published in
5 min readMar 29, 2019

Hello fellow ninjas!

Today is a beautiful day to do some cool data stuff.

In the spirit on nerd-ness and great boredom alleviation, We are going to fetch some streamed tweets and do some sentiment analysis on it using R. That sounds like a plan.

R is a statistical programming language that is just awesome when it comes to Machine Learning. With lots of packages for doing almost anything, it’s one of the easiest languages to learn yet very powerful when it comes to actually achieving cool stuff. It does require some knowledge of math to be able to use it to its full potential but it’s still usable on the noob level often with very fascinating results.

In the most layman way I can muster up, Sentiment Analysis (which is what we’ll be doing today) is the art of picking text data, analyzing it and then classifying it as either positive or negative. We will be doing this on Twitter data.

The first thing you’ll need to have is R.

Here is a link to a very nicely explained way to do it in Windows, Mac, and Linux. You can also use R with Visual Studio Code or Sublime Text but you’ll have to install a bunch of extra packages.

The next thing you’ll need to do is create a Twitter app that’ll allow you to connect to the Twitter API via. You can follow this tutorial right here.

Update: Am not sure but I guess Twitter went ahead and started charging for making these apps. They used to be free. I tried to make extra ones but I couldn’t. Somebody, please confirm this.

Now that we are all set with R and the Twitter app. We can get our hands dirty.

Open your Rstudio, navigate to your work directory and create a new file: sentiments.R

To do the magic we are intending, you’ll need a couple of libraries:

  1. rtweet: which allows you to connect to twitter and fetch the data
  2. dplyr: which provides tools for manipulating datasets
  3. tidytext: which allows conversion of text to and from tidy formats. [I would recommend installing the whole tidyverse. It will come in handy with all your R magic stuff]
  4. ggplot2: which is a super awesome plotting library

To install the packages, simply run:

install.packages('<package name>')

To install the rtweet development version, run :

## install remotes package if it's not already
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}

## install dev version of rtweet from github
remotes::install_github("mkearney/rtweet")

Load all your libraries then create a connection token as a global variable:

#load all required libraries
library(rtweet)
library(dplyr)
library(tidytext)
library(ggplot2)
#create token
token = create_token(
app = "smugninja",
consumer_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
access_token = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
access_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

Copy your tokens and keys and paste them in their respective slots.

Then, we’ll define what kind of tweets we want to fetch from Twitter. We can do this by defining what keywords we want to look out for.

query <- "Climate,meme,challenge,Climatechange,trashtag"

With our query defined, we set the amount of time we want to stream. I’ll set mine to thirty minutes.

#define stream period
streamtime <- 30 * 60

We’ll also need to define where to store this data and then start the stream:

#define storage file for streamed content
filename <- "stream.json"
#start the stream and store in a df
streamdata <- stream_tweets(q = query, timeout = streamtime, file_name = filename)

The stream will take however long you specified in the streamtime.

Twitter data comes in when dirty so you’ll always need to clean it up before you can analyze it.

The first thing we can do is remove links. There are better ways to do this with tidy but am too lazy to debug the errors am getting so fuck it, I’ll do it old school style:

#clean up links from the data
streamdata$clean_text <- gsub("http.*","", streamdata$text)
streamdata$clean_text <- gsub("https.*","", streamdata$clean_text)

I will also remove any punctuations and lowercase everything

#clean out punctuations 
streamdata_clean <- streamdata %>%
dplyr::select(clean_text) %>%
unnest_tokens(word, clean_text)

While doing sentiment analysis and other types of analysis too, one has to consider stop words. Stop words are generally most used words in a language and they tend to skew analysis. The best practice is to remove them.

First, I load the stop words:

data(“stop_words”)

Then, I count the number of words in our dataset. This is not a must, I just do it to get a picture of what is changing.

#count unique words
nrow(streamdata_clean)

Keep that number in mind, then I remove all the stop words and count the words again.

#remove all stop words
streamdata_cleanwords <- streamdata_clean %>%
anti_join(stop_words)
#count all the uniques words again
nrow(streamdata_cleanwords)

You’ll notice the number of words reduces by close to half. Now, the data is clean enough for some very simple analysis.

The first thing we can do is plot a graph of the top 10 most mentioned words from our stream.

streamdata_cleanwords %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Frequency of Words",
x = "words",
title = "Top 10 most used words in tweets",
subtitle = "Stop words have beenremoved")

The graph will look like:

To do the sentiment analysis, we need to understand a few things. One of the most popular ways of doing sentiment analysis is by using lexicons. Lexicons are sort of like a collection of words with each word assigned a positive or negative value. To then determine the negativity or positivity of a text, one can just sum up the value of all the words in the text and see if it leans on the positive or negative side.

There are three general purpose lexicons:

  1. AFINN
  2. bing
  3. nrc

We won’t get much into the lexicons, you can read up on them later.

For this small exercise, we’ll use bing. Below is the code for the analysis:

sentiments_tweet <- streamdata_cleanwords %>%  
inner_join(get_sentiments("bing"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("red2", "green3")) +
facet_wrap(~sentiment, scales = "free_y") +
ylim(0, 2500) +
labs(y = NULL, x = NULL) +
coord_flip() +
theme_minimal()

Plot the graph by running:

sentiments_tweet

This will give you a plot of the 10 most positive and negative words used in your dataset:

Our analysis is done! Now you can marvel at your genius over a bucket of nuggets.

All the code for this exercise can be found on this Github repo.

--

--

Patrick Gichini
MindNinja

Linux Ninja | Data Enthusiast | Sentimental Poet | Agent Boyfriend