What do the Top Trending Tweets tell us about Trump, Obama and Biden?
From 56,000+ Tweets gathered with Python, dating back to the start of 2008
Introduction
With the lead up to the 2020 US Presidential elections and simply being bamboozled by the amount of stuff Donald Trump gets away with on Twitter, for instance the tweet below which happens to be his 7th most retweeted tweet of all time since 2008…
I started to get curious and wanted to find a way to do a side by side comparison of how Joe Biden, Barack Obama and Donald Trump utilise Twitter as a means of communication.
With a bit of Python code and head scratching. Voila!
I was able to gather some interesting observations from extrapolating tweet level data going back to the start of 2008. Let’s start by taking a look at some top line numbers.
How frequently did Trump, Obama and Biden tweet over the years?
- We can see that Obama was quite an active tweeter during his second presidential term (2013–2016), particularly during the early years (2013–2014). Though since leaving office, he has become quite a passive tweeter.
- Trump on the other hand has been a highly active tweeter with the frequency of tweets being pushed out really exploding from 2013 onwards. In a TV interview with CBS after having won the US presidency he stated his use of social media would be “very restrained, if I use it at all.” This doesn’t appear to be the case. I wonder what he’s most often posting…#China, #MakeAmericaGreatAgain #MAGA etc.
- Joe Biden only started tweeting from 2012 onwards. Whilst we can see that he was quite an active tweeter during the course of 2012, he very rarely tweeted from 2013 through to 2018 where he spent the majority of this period serving as the Vice President. Although his frequency of usage has certainly increased from 2019 onwards, as we lead up to the 2020 Presidential elections. We’ll see further along what his most popular tweets look like.
From looking at this graph many questions pop to my mind, one of which being how many retweets and engagements are coming off the back of these tweets? Let’s take a look …
What are the average retweet per tweet numbers for Trump, Obama and Biden over the years?
Look at that, Obama is way way way ahead!
- It’s interesting to see how Obama’s average Retweet per Tweet numbers have really exploded from 2017 onwards despite having become a far more passive tweeter since his second presidential term ended.
- Worth noting that the follower base for each has been growing over the years, which would largely explain why we are seeing greater retweet numbers in recent years. Particularly with Trump whose follower base has exploded since taking up the presidency, gaining a whooping ~70 million extra followers from a base of ~18 million. See figures below for follower growth stats over the years (sourced from trackalytics.com).
Let’s have a look and see if there are any common trends across each of Obama, Trump and Biden’s most retweeted tweets.
Based on the observational data above I have focused on producing a custom WordCloud with Python to analyse sentiment across the top 250 most retweeted tweets for each of Obama, Trump and Biden from 2017 onwards.
What does the Custom WordCloud show us for Trump, Obama and Biden?
It isn’t at all surprising and a tad bit worrying to see that Trumps most retweeted tweets take a rather nationalistic tone with words such as:
‘Make’, ‘America’, ‘Great’ , ‘Iran’, ‘Law’, ‘Order’ ‘enemy’ and ‘China’
frequently appearing.
It is interesting to see how Biden’s most retweeted tweets are those where he is frequently going on the attack against Donald Trump in the build up to the 2020 Presidential elections. This election is certainly proving to be one of the most polarising of modern times.
Read further on to see how I came to produce the above visualisations and extrapolated tweet level data going way back to 2008 via Python…
How is this done?
Step 1: Import necessary packages
import GetOldTweets3 as got;
import pandas as pd
import glob
import datetime as dt
import csv
import GetOldTweets3 as got;
import glob
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import matplotlib.gridspec as gridspec
import re
Step 2: Extract Tweet level data
Given the limitations of only being able to gather 3,200 tweets via the free basic Twitter API access, working with the ‘GetOldTweets3’ library is a useful hack for scraping an infinite amount of tweet level data. This method works through web scraping the Twitter user feeds, versus the traditional method of being required to access the data through a backdoor API connection. Long story short as long as the data is visible on the webpage i.e. (timestamp, text, mentions, hashtags etc)…we can gather the data.
The function build below will enable us to scrape any user’s twitter feed within a specified start and end date (inspiration for code taken from fellow Medium blogger).
def get_tweets(username, start_date, end_date):
# specifying tweet search criteria
tweetCriteria = got.manager.TweetCriteria()\
setUsername(username)\
.setSince(start_date)\
.setUntil(end_date)
# scraping tweets based on criteria
tweet = got.manager.TweetManager.getTweets(tweetCriteria)
# creating list of tweets with the tweet attributes
# specified in the list comprehension
text_tweets = [[tw.username,
tw.id,
tw.text,
tw.date,
tw.retweets,
tw.favorites,
tw.mentions,
tw.hashtags] for tw in tweet]
# creating DataFrame, assigning column names to list of
# tweets corresponding to tweet attributes
tweet_df = pd.DataFrame(text_tweets,
columns = [‘User’, ‘tweet_id’, ‘Text’,
‘Date’, ‘Favorites’,
‘Retweets’, ‘Mentions’,
‘HashTags’])
return tweet_df
As we are looking to scrape tweet level data going all the way back to 2008 + the sheer volume of tweets to gather (~56K tweets!). To ensure the function doesn’t crash midway, I created a looping script below which will store the tweet level data in a series of CSV files broken out by year. With the handy ‘glob’ method, we can concatenate the CSV files into a single Pandas DataFrame.
# Define the list of Twitter users we want to scrape tweet level data from
user_names = ['JoeBiden', "realDonaldTrump", "BarackObama"]#List of year ranges we want to extrapolate tweets from:
year_range = [["2020-01-01", "2020-09-15"],
["2019-01-01", "2020-01-01"],
["2018-01-01", "2019-01-01"],
["2017-01-01", "2018-01-01"],
["2016-01-01", "2017-01-01"],
["2015-01-01", "2016-01-01"],
["2014-01-01", "2015-01-01"],
["2013-01-01", "2014-01-01"],
["2012-01-01", "2013-01-01"],
["2011-01-01", "2012-01-01"],
["2010-01-01", "2011-01-01"],
["2009-01-01", "2010-01-01"],
["2008-01-01", "2009-01-01"]]#Start scraping Twitter user feeds and save the data into a series of CSV files broken out by year
for year in year_range:
tweet_df = get_tweets(user_names,
start_date = year[0],
end_date = year[1])
year_name = year[0][:4]
file_name = '{user_names}_{year_name}.csv'\
.format(year_name = year_name,
user_names = user_names)
tweet_df.to_csv(file_name, index = False)
#Import the yearly CSV files and store in a list which we can use with the handy Glob method
files = glob.glob('*.csv')#Convert the CSV files into DataFrames and store these in a list
list_df = []for file in files:
df = pd.read_csv(file)
list_df.append(df)
#Concatenate the list of DataFrames into a single big DataFrame
user_names_df = pd.concat(list_df)
And Voila! The above code can be used to scrape any Twitter user’s feed within a specified date range. After a bit of data cleaning, we can now jump into the juicy explanatory analysis.
Step 3: Explanatory Analysis — Visual graphs with Seaborn
You have the option of using either Seaborn or Matplotlib to produce visual graphs. I have a preference for working with Seaborn which is what I used here given that it requires less syntax in comparison to Matplotlib.
#Code to produce Frequency of Tweets graph by Yearplt.figure(figsize = (15, 10))sb.set_context('notebook', font_scale = 1.75)palette ={"JoeBiden": "C0", "BarackObama": "C1",
"realDonaldTrump": "C2"}sb.countplot(data = user_names_df, x = 'Year', hue = 'User',
palette = palette, hue_order = ["JoeBiden",
"BarackObama",
"realDonaldTrump"])
plt.ylabel('Frequency of Tweets')plt.title('Frequency of Tweets by year')plt.yticks(np.arange(0, 8001, 1000),
[0,'1k','2k','3k','4k','5k','6k','7k','8k'])plt.show()
#Code to produce Average Number of Retweets per Tweet graph by Year#We'll need to do some cleaning first.#Groupby to obtain sum of retweets by year
tweet_retweet = user_names_df.groupby(['User', 'Year'])\
.Retweets.sum().reset_index()#Groupby to obtain the frequency of tweets by year
tweet_frequency = user_names_df[['User', 'Year', 'Tweet_id']]\
.groupby(['User', 'Year'])\
.Tweet_id.count().reset_index()tweet_frequency.rename(columns = {'Tweet_id': 'Tweet_frequency'},
inplace = True)#Merge the above two DataFrames so we get retweets and frequency of tweets by year in the same table
tweet_frequency_retweet = pd.merge(tweet_retweet, tweet_frequency)#Create additional column which calculates the average retweet per tweet ratio
tweet_frequency_retweet['Retweet_per_Tweet_ratio'] = tweet_frequency_retweet.Retweets / tweet_frequency_retweet.Tweet_frequency#Plot Graph
plt.figure(figsize = (15, 10))sb.set_context('notebook', font_scale = 1.75)sb.barplot(data = tweet_frequency_retweet, x = 'Year',
y = 'Retweet_per_Tweet_ratio', hue = 'User',
palette = palette,
hue_order = ["JoeBiden",
"BarackObama",
"realDonaldTrump"])plt.ylabel('Average Number of Retweets per Tweet (million)')label_range_2 = np.arange(0, 8, 1)plt.yticks(label_range_2 * 100000, label_range_2 / 10)plt.title('Average Number of Retweets per Tweet by Year')plt.legend(loc='center')plt.show()
We now have graphs plotted with Seaborn. Next step…
Step 4: Building the CustomWordCloud
Here we get to work with the WordCloud library in order to generate our very own custom WordCloud.
- In order to visualise a WordCloud to a customised shape. Which in our case is to the shape of Trump, Obama and Biden. We’ll need to source a black & white background image to serve as a mask (images below will suffice).
2. We’ll need to break the dataset out into individual DataFrames for Trump, Obama and Biden. With each DataFrame ordered from the most retweeted to least retweeted.
#Create individual DataFrames for Obama, Trump and Biden
obama_df = user_names_df[user_names_df_clean.User == \
'BarackObama']
trump_df = user_names_df[user_names_df_clean.User == \
'realDonaldTrump']
biden_df = user_names_df[user_names_df_clean.User == \
'JoeBiden']#Sort DataFrames in order of most retweeted and drop tweets without text
obama_retweet_df = obama_df.sort_values(by = ['Retweets'],
ascending = \
False).dropna(subset = \
['Text'])
trump_retweet_df = trump_df.sort_values(by = ['Retweets'],
ascending = \
False).dropna(subset = \
['Text'])
biden_retweet_df = biden_df.sort_values(by = ['Retweets'],
ascending = \
False).dropna(subset = \
['Text'])#Setting a global variable for the range of years in our dataset
year_range = np.arange(2008, 2021, 1)
Now we can go ahead and build out the function below…
def wordcloud_retweet(top_number, trump_year = year_range, obama_year = year_range, biden_year = year_range):
#Loop through the individual DataFrames and store the Tweets into a list
obama_ls = [re.sub(r"http\S+", "", tweet).strip() for tweet\
in obama_retweet_df[obama_retweet_df.Year
.isin(obama_year)].dropna\
(subset = ['Text']).Text.head(top_number)]
trump_ls = [re.sub(r"http\S+", "", tweet).strip() for tweet\
in trump_retweet_df[trump_retweet_df.Year\
.isin(trump_year)].dropna\
(subset = ['Text']).Text.head(top_number)]
biden_ls = [re.sub(r"http\S+", "", tweet).strip() for tweet\
in biden_retweet_df[biden_retweet_df.Year\
.isin(biden_year)].dropna\
(subset = ['Text']).Text.head(top_number)]
#Create empty lists to house the individual words from the individual Tweets gathered
obama_series = []
trump_series = []
biden_series = []
#loop through obama_ls, trump_ls and biden_ls and populate empty list with words. We'll want to strip out the punctuation for the last ending word in each Tweet, given how Trump often finishes off Tweets with excessive punctuation i.e. (?!?!!!)
for tweet in obama_ls:
splits = tweet.split()
for word in splits:
obama_series.append(word.strip("-").strip("\"")\
.strip(".").strip(":").strip(",")\
.strip("?").strip("!").strip("?")\
.strip("-").strip("\"").strip("!"))
for tweet in trump_ls:
splits = tweet.split()
for word in splits:
trump_series.append(word.strip("-").strip("\"")\
.strip(".").strip(":").strip(",")\
.strip("?").strip("!").strip("?")\
.strip("-").strip("\"").strip("!"))
for tweet in biden_ls:
splits = tweet.split()
for word in splits:
biden_series.append(word.strip("-").strip("\"")\
.strip(".").strip(":").strip(",")\
.strip("?").strip("!").strip("?")\
.strip("-").strip("\"").strip("!"))
#Combine all the Tweets into one big text
obama_text = " ".join(word for word in obama_series)
trump_text = " ".join(word for word in trump_series)
biden_text = " ".join(word for word in biden_series)
#Generate masks for Obama, Trump and Biden
mask_obama = np.array(Image.open('obama_1.png'))
mask_trump = np.array(Image.open('trump_3.png'))
mask_biden = np.array(Image.open('biden_edit.png'))
#Create a stopword list, i.e. a list of words to ignore when generating a Custom WordCloud
stopwords = set(STOPWORDS)
stopwords.update(['will', 'of', 'a', 'well', 'way', 've',
'don', 'let', 'thing', 'day', 'thing',
'keep', 'two', 'see', 're', 'today',
'week', 'far', 'now', 'act'])
#Setup parameters for WordCloud image
wordcloud_obama = WordCloud(background_color="white",
stopwords = stopwords,
max_words = 50, mask=mask_obama,
contour_width=3,
contour_color='steelblue',
collocations=False)\
.generate(obama_text)
wordcloud_trump = WordCloud(background_color="white",
stopwords = stopwords,
max_words = 50, mask=mask_trump,
contour_width=3,
contour_color='steelblue',
collocations=False)\
.generate(trump_text)
wordcloud_biden = WordCloud(background_color="white",
stopwords = stopwords,
max_words = 50, mask=mask_biden,
contour_width=3,
contour_color='steelblue',
collocations=False)\
.generate(biden_text)
#Display the Custom WordCloud image:
plt.figure(figsize = (32, 16))
plt.subplot(1, 3, 1)
plt.imshow(wordcloud_trump, interpolation = 'bilinear')
plt.axis('off')
plt.subplot(1, 3, 2)
plt.imshow(wordcloud_obama, interpolation = 'bilinear')
plt.axis('off')
plt.subplot(1, 3, 3)
plt.imshow(wordcloud_biden, interpolation = 'bilinear')
plt.axis('off')
plt.show()
Finally, we now have a function which will enable us to generate a Custom WordCloud from the top retweeted tweets for a given year range. In our case we want to generate a WordCloud off the back of the top 250 most retweeted tweets for each person from 2017 onwards. We can do this by inputting the following parameters into the built function below:
wordcloud_retweet(250, np.arange(2017, 2021, 1), np.arange(2017, 2021, 1), np.arange(2017, 2021, 1))
And there you have it! Our very own custom WordCloud built to work with any tweet level dataset. The data has been made available through the following DataStudio link, if you wish to have a gander: Link Here
You can also check out my GitHub repo below:
-
References:
https://medium.com/@AIY/getoldtweets3-830ebb8b2dab
https://pypi.org/project/GetOldTweets3/
https://github.com/amueller/word_cloud
https://www.trackalytics.com/