Exploring tweets in R

Trafford Data Lab
6 min readJan 8, 2020

--

This tutorial shows you how to use the rtweet R package to retrieve tweets from Twitter’s REST API and explore the results using functions from the tidyverse and tidytext packages. We’ll look at tweets with the hashtag #ClimateEmergency over the New Year period and identify the most common emoji, hashtags, username mentions and words.

Install and load rtweet

You can install the rtweet package from CRAN by executing install.packages(“rtweet”). We then need to load the package along with the tidyverse suite of R packages to manipulate the data and tidytext to tokenise the tweet text into words.

library(rtweet) ; library(tidyverse) ; library(tidytext)

Getting Twitter API access

Before you can retrieve tweets from Twitter’s REST API you need to have a Twitter account and be authorised to use their API. There are a couple of ways to gain access to the Twitter API: You can either apply for a developer account and register a Twitter application or you can authorise rtweet’s embedded rstats2twitter app via your web browser. The latter is much easier but you will need to install the httpuv package to enable browser-based authentication. To authorise the rstats2twitter app you just need to run one of the functions from the rtweet package. You will then receive a message in your web brower like: “Authorize the rstats2twitter app by logging into Twitter, or selecting ‘Authorize app’”.

Searching for tweets

The search_tweets() function from the rtweet package retrieves tweets from the last 6–9 days that match the submitted query. The function is very flexible so you can search for @mentions, hashtags, keywords, exact phrases and even use boolean operators. The function returns the 100 most recent matching tweets by default but we can increase it to the API’s 18,000 limit. Those who want more than 18,000 tweets can use the retryonratelimit argument.

Here we’ll search for tweets that contain the #ClimateEmergency hashtag. We use the include_rts = FALSE argument to exclude retweets, `-filter` = “replies” to exclude replies and lang = “en” to return only English language tweets.

tweets <- search_tweets(q = "#ClimateEmergency", 
n = 18000,
include_rts = FALSE,
`-filter` = "replies",
lang = "en")

17,540 tweets have been returned covering the period between 28 December 2019 and 3 January 2020 along with a number of useful variables including “screen_name”, “created_at”, “text”, “favorite_count” and “retweet_count”. We can pull out a random sample of tweets and look at them more closely.

tweets %>% 
sample_n(5) %>%
select(created_at, screen_name, text, favorite_count, retweet_count)
Random sample of 5 tweets with a #ClimateEmergency hashtag

If we want to export our tweets as a CSV for safe keeping we can use the write_as_csv() function.

write_as_csv(tweets, "tweets.csv")

The tweets that we retrieved for this tutorial are available here.

Exploring tweets

Timeline of tweets

The ts_plot() function from the rtweet package enables us to plot the frequency of tweets over a variety of time intervals (e.g. “secs”, “mins”, “hours”, “days”, “weeks”, “months”, “years”) in a ggplot2 plot.

ts_plot(tweets, "hours") +
labs(x = NULL, y = NULL,
title = "Frequency of tweets with a #ClimateEmergency hashtag",
subtitle = paste0(format(min(tweets$created_at), "%d %B %Y"), " to ", format(max(tweets$created_at),"%d %B %Y")),
caption = "Data collected from Twitter's REST API via rtweet") +
theme_minimal()
Frequency of tweets with a #ClimateEmergency hashtag

Top tweeting location

Most Twitter users turn off their location in their privacy settings but those that don’t add valuable location information to their tweets. We can count unique values of the “place_full_name” variable to obtain the most frequent tweet location. Here we exclude missing values from the “place_full_name” variable, count, sort in descending order and print the top 5.

tweets %>% 
filter(!is.na(place_full_name)) %>%
count(place_full_name, sort = TRUE) %>%
top_n(5)
Top 5 locations of tweets with a #ClimateEmergency hashtag

Most frequently shared link

The “urls_expanded_url” variable provides the full URL of shared links. Here we exclude tweets without a shared link, count, sort the frequency of links in descending order and print the top 5.

tweets %>% 
filter(!is.na(urls_expanded_url)) %>%
count(urls_expanded_url, sort = TRUE) %>%
top_n(5)

Most retweeted tweet

The Twitter API helpfully returns a “retweet_count” variable whose values can easily be sorted. Here we sort all the tweets in descending order by the size of the “retweet_count”, slice off the top row and print the date, handle, text and retweet count.

tweets %>% 
arrange(-retweet_count) %>%
slice(1) %>%
select(created_at, screen_name, text, retweet_count)

If you want a screenshot of the most retweeted tweet you can use the tweet_screenshot() function from the tweetrmd package. Just provide the “screen_name” and “status_id”.

tweet_screenshot(tweet_url("MikeHudema", "1212806892390666241"))
Most retweeted tweet with a #ClimateEmergency hashtag

Most liked tweet

To find the most liked tweet we can sort our tweets by the “favorite_count” variable in descending order and print the rows with the top 5 highest counts.

tweets %>% 
arrange(-favorite_count) %>%
top_n(5, favorite_count) %>%
select(created_at, screen_name, text, favorite_count)

Top tweeters

To identify the most active tweeters we can use the “screen_name” variable to tot up the number of tweets by Twitter handle. We can then add back the @ symbol using the paste0() function.

tweets %>% 
count(screen_name, sort = TRUE) %>%
top_n(10) %>%
mutate(screen_name = paste0("@", screen_name))

Top emoji

To identify the most frequently used emoji we can use the ji_extract_all() function from the emo package. This function extracts all the emojis from the text of each tweet. We can then use the unnest() function from the tidyr package to split out the emojis, count, sort in descending order and identify the top 10.

library(emo)
tweets %>%
mutate(emoji = ji_extract_all(text)) %>%
unnest(cols = c(emoji)) %>%
count(emoji, sort = TRUE) %>%
top_n(10)
Most frequently used emoji in tweets with a #ClimateEmergency hashtag

Top hashtags

To pull out the hashtags from the text of each tweet we first need to convert the text into a one word per row format using the unnest_tokens() function from the tidytext package. We then select only those terms that have a hashtag, count them, sort in descending order and pick the top 10.

tweets %>% 
unnest_tokens(hashtag, text, "tweets", to_lower = FALSE) %>%
filter(str_detect(hashtag, "^#"),
hashtag != "#ClimateEmergency") %>%
count(hashtag, sort = TRUE) %>%
top_n(10)
Other most frequently used hashtags in tweets with a #ClimateEmergency hashtag

Top mentions

Here we tokenise the text of each tweet and use str_detect() from the stringr package to filter out words that start with an @ .

tweets %>% 
unnest_tokens(mentions, text, "tweets", to_lower = FALSE) %>%
filter(str_detect(mentions, "^@")) %>%
count(mentions, sort = TRUE) %>%
top_n(10)
Most username mentions in tweets with a #ClimateEmergency hashtag

Top words

To extract the words from the text of each tweet we need to use several functions from the tidytext package. First we remove ampersand, greater-than and less-than characters, URLs and emoji from the text, then we tokenise the text into a row per word format, filter out stop words such as “the”, “of”, and “to”, remove any numbers and filter out hashtags and mentions of usernames. Then we select the variables of interest, count the frequency of each word and sort in descending order.

words <- tweets %>%
mutate(text = str_remove_all(text, "&amp;|&lt;|&gt;"),
text = str_remove_all(text, "\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)"),
text = str_remove_all(text, "[^\x01-\x7F]")) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"),
!str_detect(word, "^#"),
!str_detect(word, "@\\S+")) %>%
count(word, sort = TRUE)

Then we use the wordcloud package to create a visualisation of the word frequencies.

library(wordcloud) 
words %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = "#F29545"))
Most common words in tweets with a #ClimateEmergency hashtag

Conclusion

We’ve only provided a glimpse of what’s possible using the rtweet and complementary tidytext packages. If you want further inspiration and guidance we’d suggest checking out some of the following books, blogs and workshop materials:

--

--

Trafford Data Lab

Supporting decision-making in Trafford by revealing patterns in data through visualisation.