Analyzing Tweets with R

Arindam Mitra
4 min readFeb 3, 2023

--

With TweeterR Package

This is a simple tutorial to read tweets and analyze them in R. TwitteR package provides a convenient way to do this.

Reproduced from my old blog from Jan 2014.

Authentication with ROAuth

It’s mandatory to use OAuth to access twitter programmatically. We use package ROAuth to do this.
We must first create a Twitter Application and get the keys and URLs that are required for making a connection.

Twitter Application

You may follow the step-by-step instructions as described in twitteR documentation on CRAN.

Next, authenticate the application to interface with R. An OAuth object cred is created with all the necessary information to authenticate the application.


library(ROAuth) # For OAuth in R

# Use the keys and URLs from Twitter Application to create an OAuth object
cred <- OAuthFactory$new(consumerKey=yourconsumerKey,
consumerSecret=yourconsumerSecret,
requestURL="https://api.twitter.com/oauth/request_token",
accessURL="https://api.twitter.com/oauth/request_token",
authURL="https://api.twitter.com/oauth/request_token")

# Use the signature and do a OAuth handshake.
# You will receive a PIN for using in R Session.
# The arguments that are passed in handshake function are necessary for Windows.
cred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "Rcurl"))

library(twitteR) # For OAuth in Twitter (and to read Tweets)

# Channel the Twitter API call via Twitter OAuth mechanism
registerTwitterOAuth(cred)

Read and Analyze Tweets

Let’s read what people are talking about Amazon by searching @amazon tweets. We then clean the tweets up by removing punctuations, numbers and make it free of any other distractions. Finally, we and make a wordcloud.

#Read the tweets
amTweets <- searchTwitter('@amazon', n=1500, lang='en', cainfo="cacert.pem")
# '@amazon', to list conversations that are pertaining to Amazon.
# n=1500, as Twitter API has an upper limit of retrieving 1500 tweets per call.
# lang='en', to exclude non-English tweets.
# cainfo="cacert.pem". Necessary for Windows - check the reference for details.

library(tm) #text mining library
#back up the tweet and make a working copy of the data
baseTweets <- amTweets

# The tweets are first converted to a data frame using Twitter's twListToDF function
# This also can be done by do.call("rbind", lapply(baseTweets, as.data.frame))
baseTweetsDF <- twListToDF(baseTweets) #use builtin function

# Build a corpus
# VectorSource specifies that the source is character vectors
baseTweetsCorpus <- Corpus(VectorSource(baseTweetsDF$text))

# Remove punctuation
baseTweetsCorpus <- tm_map(baseTweetsCorpus, removePunctuation)

# Change to lower case (do after removePunctuation)
baseTweetsCorpus <- tm_map(baseTweetsCorpus, tolower)

# Remove numbers
baseTweetsCorpus <- tm_map(baseTweetsCorpus, removeNumbers)

# Remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
baseTweetsCorpus <- tm_map(baseTweetsCorpus, removeURL)

# Identify stopwords
# standard english stopwords and few more additional ones specefic to amazon tweets like rt, amazon etc.
stopwordsMod<-c(stopwords('english'), "rt", "amazon", "via")

# remove stopwords from corpus
baseTweetsCorpus <- tm_map(baseTweetsCorpus, removeWords, stopwordsMod)

library(wordcloud)
wordcloud(baseTweetsCorpus,min.freq=2,max.words=100, random.order=F, colors=brewer.pal(6,"Dark2"))
Wordcloud: @amazon tweets

List the devices that tweets were sent from

It would be interesting to learn which platform/ device were used to tweet these messages.

# From the dataframe, read source device where the tweet has originated from
baseTweetDevice <- baseTweetsDF$statusSource

# Our text of interest is lying between thr <a> </a> tags. We can extract and retain it as follows
baseTweetDevice <- gsub("</a>", "", baseTweetDevice)
baseTweetDevice <- strsplit(baseTweetDevice, ">")
baseTweetDevice <- sapply(baseTweetDevice, function(x) ifelse(length(x) > 1, x[2], x[1]))

# Convert the list of characters to a data frame (for plotting later)
baseTweetDeviceDF <- as.data.frame(baseTweetDevice)
# change the column name to 'Source'
colnames(baseTweetDeviceDF) <- c("Source")

# The 'Source' datatype is converted from character to categorical.
# Specifically, it is converted to a factor with same levels as the total variety of different sources
## Also, set the levels in order for plotting in descending order of popularity
baseTweetDeviceDF <- within(baseTweetDeviceDF,
Source <- factor(Source, levels=names(sort(table(baseTweetDeviceDF),
decreasing=FALSE))))

## plot the source devices from where the tweets originated
library(ggplot2) #use ggplot library for better looking plots
ggplot(baseTweetDeviceDF, aes(x=Source)) +
geom_bar(binwidth=1) +
coord_flip() +
labs(x="", y="Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Devices used to tweet @amazon

We can do much more by analyzing these tweets. I will do more in coming days, like plotting the trending tweets on a map.

References

--

--

Arindam Mitra

Intersection of Product Management and Data Analytics. I grow and monetize free-to-play mobile games.