Analytics Vidhya
Published in

Analytics Vidhya

Analyzing Donald Trumps Tweets in R

Over the past 5 years we have seen the impossible come to fruition. The man who starred in the celebrity apprentice became the acting president of the United States. Many people, experts and critics alike, were in awe of this transition, and Donald Trumps unconventional ways of achieving his goals. He brought a standard of sharing information, in a more unconventional way than his path to presidency. By tweeting frequently, he rallied his supporters, became a world wide presence and shared information in the best way our modern world knows how, 160 characters or less.

As Donald Trump’s term and 2020 come to an end, we can look back on the past 5 years and reflect. In this case we look back on the past 5 years by analyzing Trumps tweets. We will analyze the most frequent words used, calculate the tf-idf (term frequency — inverse document frequency) of each word and use sparse regression to determine which word has the largest affect on the total number of retweets.

The data used in this project was collected from and saved as a csv file for import into R studio. We start by downloading the needed packages: readr, tidyverse, tidytext, stringr, & tokenizers. We will then import the dataset using read_csv, setting the id column to type character & cleaning the data. We start by tokenizing the tweets to separate each tweet into individual words & changing the id column to type character. We want to see words with significance so we will remove stop words, tweets with no spaces, usernames, Donald, Trump, &amp, amp and urls using the str_detect function.

We will then convert the data column to only show the year and select tweets between 2015 & 2020, while renaming the column to show this change. Following this conversion, we plot the top 20 words used and display them in a bar chart using ggplot2.

Next we want to look at the top 20 tweets in each year from 2015–2020. We will plot the word vs count faceted by year using ggplot2 and filling by the color palette “Set1”.

Next we will work towards calculating the tf-idf by using the year as the document, word as the term and the bind_tf_idf function. To view the created dataframe we use the arrange function in descending order by the tf_idf value. The created dataframe is far too large to be displayed as an accurate image here but shows the hashtag #celebapprentice. Following the creation of the tf_idf column, we group the data frame by year and plot the top 20 words by tf-idf value faceted by year. It’s really interesting to see the trends of most common words in each year throughout time. The transition from starring in celebrity apprentice to becoming the president is very evident in this graph.

Next we will use the glmnet package to perform sparse regression and determine how many retweets a tweet will get. We are looking to examine the coefficients of the relationship between a word & the number of retweets a tweet receives. To do this we need to create a document term matrix of all the tweets in the tidy_trump dataframe and show a plot of every coefficient. The goal of sparse regression is to eliminate variables where the coefficient fades to zero.

We will determine the value of the sparsity parameter lambda using cross validation and report the number of non zero coefficients that occur. As seen in the images below, lambda.lse returns the lesser number of non zero coefficients so we will select this value to analyze our dataframe of words & their respective coefficients.

To view the word values and their coefficients using the sparsity parameter lamda.lse, we need to convert the c2 output to a matrix, then convert to a dataframe and perform some tidying. After completing the data tidying, we can see the words & their respective coefficient values below using the arrange function sorted by descending values.

The word with the highest correlation to the number of retweets is “#fnn” or fake news network, followed by “quarantine”. As we look at this list it is easy to see which words rally the twitter world as well as Donald Trump’s base of supporters.

Throughout this project we have seen the trends of Donald Trumps rhetoric change from TV star to president. In each year we saw the most prevalent terms show the socioeconomic environment of each year. In 2016 & 2017, his most common words were based around running for election with slander and hashtags aimed to reach a large audience. In 2018 & 2019, we see the top terms showing “witch hunt” and other terms around the impeachment. Using sparse regression we were able to determine that “#fnn” had the highest relationship between a word and retweet count. As we look forward into 2021 it is important to review the past 5 years and learn from our societal trends.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

Aiding and aBetting Horse Racing” Model

Are you living in a happy country? Let’s find out with Power BI.

Final Moment In Generation GIGIH 2021

5 Papers to Read on Dimensionality Reduction Method in 2022

Simple yet stunning and meaningful geospatial visualization using Happiness and Confict data

Close the loop … please !

Day 23 : 60 days of Data Science and Machine Learning Series

Time Complexity for Data Scientists

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cole Crescas

Cole Crescas

More from Medium

Understanding the native R pipe |>

the function mtcars |> (function(.) plot(.$hp, .$mpg))() on a black background

Polychoric Correlation in R

How Vectors Influence When To Use For While Loops in R programming

Building a Butterfly plot in R