Twitter API meets Text Sentiment Analysis: [part I]

Ariel Goldberger
Duke AI Society Blog
6 min readSep 23, 2019

--

Text Sentiment Analytics. A pragmatic tool that can help companies to improve their services.

Watching the emotions of your customers in real-time is a valuable tool when you are interested in providing the best customer experience. Having the ability to identify problems, who are related to them and acting fast can make a difference when it comes to customer retention and profits.

Think about the thousands of flights and those people who face scheduling issues, an airline could immediately identify the passager, location, and potential issue. Just like any other company, like an Internet Service provider, a Credit Card provider or even an Air Conditioning provider, the benefits of being in tune with customer sentiments is increasingly important.

One way to learn about what your customers are thinking (and feeling) about your service or product is to see what they are writing on Twitter (or any other social network). This article is an example of how to scrap the information from twitter regarding a particular company and study their customers feedback in real-time.

Natural Languages Processing

This next tool derives from all the efforts to create a Natural Language (a language that contains a large and diverse vocabulary with several different meanings, linguistic faux pas and ambiguous meanings that humans understand) Processing (an interdisciplinary science combining computer science and linguistics). Specifically I will show you how to implement some basic ideas using the R language and the internet.

Connecting to Twitter

It is possible to download the tweets (text)as data an a larger scale using the Twitter API, for that we need to first set up an app using our twitter account to obtain keys and tokens (codes) that will let us connect directly to Twitter servers and download information from R. To do this you must register at https://developer.twitter.com/en/apps. It could take up to a few days until once you gain access to the API.

The first step is to log in and click on create an app

Then fill out the rest of the form with the requested information.

After submitting the required information to twitter you will have access to your secret tokens and keys.

The algorithm

We will be using an important package that will help us to set up the connection. [twitteR] Get your info and secret numbers to log in from R in the following way:

library(twitteR)consumerKey    = "##########################################"  
consumerSecret = "##########################################"
accessToken = "##########################################"
accessSecret = "##########################################"
options(httr_oauth_cache = TRUE)
setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret,access_token = accessToken, access_secret = accessSecret)

Selecting a user

Once we are successfully connected we can select the user, in our case, the companies latest tweets in which we are interested in analyzing

user_id = 'SouthwestAir'user_id = as.character(user_id)

Querying tweets

As an example, I will request the most recent 2000 tweets associated with our subject company and save that information in a variable (type: object) called tweets

# n is the number of the most recent tweets that mention (@tag) the selected usern = 2000# getting tweets as a list format:
tweets <- searchTwitter(user_id, n)

Naturally, We are interested in having this information organized in a data frame since I am looking to manipulate the data. For that, we can use the following packages and create the following data mutation flows:

library(stringr)
library(dplyr)
library(tidytext)
tidy_tweets <- tibble(
screen_name = tweets %>% map_chr(~.x$screenName),
tweetid = tweets %>% map_chr(~.x$id),
created_timestamp = seq_len(length(tweets)) %>%
map_chr(~as.character(tweets[[.x]]$created)),
is_retweet = tweets %>% map_chr(~.x$isRetweet),
text = tweets %>% map_chr(~.x$text)
) %>%

mutate(created_date = as.Date(created_timestamp)) %>%

#to avoid using simply Re-Tweets
filter(is_retweet == FALSE,substr(text, 1,2) != "RT")

Now our data looks like a table with one row per tweet.

However, we need to have every word as a variable, every word as a row. The process of separating every word individually is called Tokenization. And this is a simple way to parse the text with R:

tweet_words = tidy_tweets %>%
select(tweetid,
screen_name,
text,
created_timestamp ) %>%
unnest_tokens(word, text)
tweet_words

Removing ‘stop words’ and numbers

Now that we have groomed the dictionary, we can see some of these tokens contain words that might not help our analysis and also appear frequently like (I, you, he, on, in, or. etc). We called these words ‘stop words’ and we will remove them as well as numbers since they provide little value to the study. To do that, load a dictionary that contains the most common ‘stop words’. Also, keep in mind that we can add new words that we believe should be removed.

# Custom stop words list:my_stop_words <- tibble(
word = c("https","t.co","rt","amp","rstats","gt",
tolower(as.character(user_id))
), lexicon = "twitter")
# Pre-set stop words list:all_stop_words = stop_words %>%
bind_rows(my_stop_words)
# Remove numbers:suppressWarnings({
no_numbers = tweet_words %>%
filter(is.na(as.numeric(word)))
})
no_stop_words = no_numbers %>%
anti_join(all_stop_words, by = "word")

Sentiment dictionary

Now it is time for us to relate every word with a particular sentiment. Some words, depending on the context, might need to be adjusted to the desired sentiment. To associate a sentiment to each word we can use one of the following dictionaries:

(1) AFINN from Finn Årup Nielsen, from Informatics and Mathematical Modelling, Technical University of Denmark (2011)

(2) BING from Bing Liu and collaborators, and

(3) NRC Word-Emotion Association Lexicon (NCR) from Saif Mohammad and Peter Turney.

We can access all of them thanks to the previously loaded packages. For instance, if we want to have a positive or negative sentiment for each word then we can use, the bing dictionary:

# Loading binary sentiment dictionary
bing = get_sentiments("bing")
# Joining our data with the sentiments
nrc_words %>%
inner_join(bing)

If we are looking to gather multiple sentiments like fear, surprise, and others then we can use the NCR dictionary:

# Loading multiple sentiment dictionary
nrc_dic = get_sentiments("nrc")
# Joining our data with the sentiments
nrc_words = no_stop_words %>%
inner_join(nrc_dic, by = "word")
nrc_words

Now that we have every word assigned to a sentiment we can start exploring some interesting insights about them. Statistical metrics like:

Which are the most commons words customers use to express about our service? What is the time of the day where anger is the most dominant sentiment? and so on.

library(wordcloud)
library(reshape2)
bing <- get_sentiments(“bing”)
nrc_words %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = “n”, fill = 0) %>%
comparison.cloud(colors = c(“#D22903”, “#037DD2”),
max.words = 350) #<-- Number of words to be shown = 350
# Sentiments across timeggplot(nrc_words, aes(x = created_timestamp, y = sentiment, group = sentiment, fill=sentiment)) +
geom_density_ridges(scale = 3, size = 0.25, rel_min_height = 0.03) +
theme_ridges()

The most common words within this are:

ggplot(top_words, aes(x = sort(word,decreasing=TRUE ), y = n)) +
geom_bar(stat = "identity",
fill = "#6600CC",
width = 0.80) +
geom_text(aes(label = n),
family = "Open Sans",
size = 5.5,
hjust = -0.15) +
coord_flip() +
(expand = c(0, 0),
limits = c(0, max(final_data$Sales) * 1.3)) + my_theme()

Of course, it is useful to study each word separately, however, it is possible to also consider the relationship between words and contexts (for example, the correlation) and create networks

As well as to address the linguistic variations using a method called lemmatization or stemming:

library(textstem)
vector <- c(“gets”, “getting”, “got”,”get”,”gotten”)
lemmatize_words(vector)
result: 'get','get','get','get','get'

The conquest of learning is achieved through the knowledge of languages.

— Roger Bacon

--

--

Ariel Goldberger
Duke AI Society Blog

M.Sc. Quantitative Analytics, Duke University - M.Sc. Financial Engineering, Adolfo Ibanez University