Tokenizing Thoughts on Twitter on Election Day 2020

Source: https://www.wsj.com/articles/in-first-debate-trump-and-biden-to-compete-for-few-persuadable-voters-11601387621

It’s Tuesday, November 3rd, and whether or not you’re an American citizen, if you currently reside in the United States, you know that the 2020 presidential election is going to be an important one for this country. With the COVID-19 pandemic still going on, the recent nomination of Amy Coney Barret to the Supreme Court, Black Lives Matter protests, and just the general political climate overall, voting in this election was important to many. And so as Election Day came, millions of Americans watched the results roll in despite the fact that a winner was not declared that night.

During the day, many people stated their thoughts on Twitter. What was on the minds of Americans throughout the day and night?

At the time I’m writing this article, it’s evidently been a couple of weeks since Election Day and we know that Joe Biden is projected to be the 46th President of the United States. And while Joe did win the popular vote (in addition to the electoral college), what were users saying on Twitter the day of? What kinds of users were tweeting that day and what were they saying? In my opinion, this election was like no other in United States history. The opinions of individuals can help determine what the political climate will be moving forward. Before diving into who these users are and what they were saying, I’ll be discussing how I cleaned up the data and some initial analysis on the dataset of tweets.

I used a Twitter dataset obtained from Kaggle that contains tweets that used the hashtag #Biden, #JoeBiden, #Trump, and #DonaldTrump from 10/15/20 until 11/8/20. This dataset encompasses tweets using these hashtags around election season and can be analyzed as such. The name of this dataset for those interested in US Election Dataset 2020. However, I wanted to particularly focus on what was happening on Election Day specifically so I filtered out tweets that were tweeted on 11/3/20 only from midnight (0:00) until 11:59 pm (23:59). For purposes of this analysis, the primary focus will be comparing tweets and Twitter activity between people who used #Biden/#JoeBiden and #Trump/#DonaldTrump in their posts. It’s important to indicate that people who use either hashtag do not necessarily support that candidate or voted for that candidate (i.e. this is not an analysis of tweets between Democrats and Republicans or Biden supporters and Trump supporters). This dataset of tweets is also not representative of the American public opinion either.

Cleaning the Dataset

There were two datasets used in this analysis–one that consisted of Biden tweets and one that consisted of Trump tweets. These were provided by Kaggle as separate CSV files. Cleanup was done with Python and was where the majority of my analysis was done. Packages I used for both cleaning up the data and for the analysis as a whole were pandas, csv, nltk, matplotlib, re, ipykernel, numpy, and matplotlib.pyplot. As earlier stated, the dataset I obtained from Kaggle was filtered to include tweets that were posted between 11/3/20 0:00 to 11/3/20 23:59. Note that these tweets were not time zone aware and the analysis may not reflect as such. There is no specific time zone associated for these tweets. Columns I decided to include were time the tweet was posted at (created_at), the actual tweet (tweet), number of likes (likes), number of retweets (retweet_count), bio of the users (user_description), number of followers of a user (user_follower_count), and location of the user (user_location). I also did look through the datasets to see if there were any outlying tweets that weren’t “correctly filled out” (missing values or didn’t have a valid timestamp under “created_at) or had NAs for values. These were excluded from the analysis. A total of 41497 tweets were analyzed from the Biden set and 67267 tweets were from the Trump set for a total of 108764 tweets. These datasets were converted to pandas data frames as well.

Peak Twitter Activity

Below are Figure 1a and Figure 1b which depict the number of tweets tweeted out on election day by the hashtags. Users with Trump-related tweets were much more active in general than users with Biden-related tweets. As it can be seen, the graphs follow a similar pattern as to when users on Twitter were active with peaks around hour 15 (3:00 pm) and hours later at night (hour 18 until hour 23). This makes sense since the results began coming in and were streamed around 5 pm EST that day and more and more users tweeted about the election results as they continued to come in. I would assume that a similar trend would occur when results in key battleground states came in as well (WI, MI, GA, PA).

Figure 1a) Number of Tweets Tweeted Throughout Election Day By Military Time (0:00–23:59) Containing #Trump or #DonaldTrump | Figure 1b) Number of Tweets Tweeted Throughout Election Day By Military Time (0:00–23:59) Containing #Biden or #JoeBiden

Around 15:00 is when there’s a peak for Twitter users that used either hashtag. I wanted to take a look at what users were saying at that time. There was also a peak for Trump-related tweets at approximately 22:00 as well as a peak for Biden-related tweets around 23:00.

Most Retweeted

Below are some of the top tweets posted on Election Day and are indicated as such. It’s clear and expected that most of the retweeted tweets posted that day are by people who are verified (who are likely have many followers). There is also a variety of opinions that are voiced in these tweets regardless of political affiliation.

Some Top Tweets Throughout The Whole Day with #Biden or #JoeBiden

Some Top Tweets Throughout The Whole Day with #Trump or #DonaldTrump

Top Tweet During Hour 15 for #Biden

Some Top Tweets During Hour 15 with #Biden or #JoeBiden

Some Top Tweets During Hour 15 with #Trump or #DonaldTrump

Tokenizing Tweets

The largest part that I focused on during this brief analysis was tokenizing the tweets in this dataset. In order to do this, I had to create a list of all of the tweets from the data frames. For each of the Biden and Trump data sets I did the following:

  1. Create an empty list and filter out the tweet column.
  2. Turn the data frame (with only the tweet column) into a list so that it’s a list of tweets instead.
  3. Using TweetTokenizer from the nltk package, tokenize each tweet and append each token into the empty list.
  4. Using for loops, make each token lowercase, then remove stop words, punctuation (except for #), and Spanish stop words (as there were a number of tweets in Spanish).
  5. In additional token sets, I made an additional token set for each candidate’s set of tweets with tokens that started with # (ie tokens with only hashtags).

The top tokens (apart from #biden, #joebiden, #trump, and #donaldtrump) of each category are all very similar to each other. Some of these tokens include #electionday, #election2020, election, #usaelections. Two of the tokens were @joebiden and @realdonaldtrump. Other common tokens include #covid19, #maga, win, and president. These are all fairly neutral terms and do not determine anything surprising.

However, I will note that down the list of the top tokens are some emojis. Since Twitter is also a common app to use on a mobile device, emojis are expected in some tweets. Some of the most commonly tweeted emojis from the top 100 tokens of the Biden set are 💙, 😂, 🤣, and 🙏 . From the top 100 tokens of the Trump set, some of the most commonly tweeted emojis are 😂, ☠️, ❤️, 🤣, 👇, and 🙏 .

General Conclusions

There weren’t any outstanding tweets or tokens from my brief analysis. As I mention as one of my limitations below, the computing power of my machine is limited and limited how much I could complete in the span of a couple of days (or is it bad procrastination habits of a college student?). There were a couple of more potential trends I wanted to explore which included if there were any particular locations tweets were being posted from on Election Day, if the number of followers really mattered in the number of retweets and likes a tweet would get, common tokens in user profile bios, and exploring tokens of tweets from verified users and non-verified users.

The tweets span all political views and kinds of support (or lack thereof) for both candidates, so there didn’t appear to be one overwhelming kind of tweet identified from the tokens. It was earlier identified, though, that there were more tweets with #Trump and #DonaldTrump than #Biden or #JoeBiden. While I was investigating some individual tweets, I saw tweets that had both a Biden or Trump hashtag, so that tweet would get counted twice when looking at the two datasets side by side.

What was somewhat surprising, however, is that there were tweets in languages other than English. Spanish was a language I excluded stop words from since initially, I noticed there were Spanish words that were tokens, and (based on my limited high school Spanish knowledge) I knew they didn’t mean anything significant. Other than Spanish, I only noticed there were other languages due to the characters being used, such as ones from the Arabic alphabet. This may simply mean that there are American users who are writing in a language other than English. However, I noted that some of the user_location values were from locations other than the United States. This was a surprise to me but after a quick Google search, I found that it’s apparently more common for international folks to watch US election results than I realized. After all, CNN has a guide on how to watch the election results as a non-American.

Limitations and Ethics

  1. There may be ethical concerns to inferring information about individuals even though I did not particularly infer anything about the users in these tweets.
  2. There is a lot of overlap between the two datasets so some of the same tweets are included in both of the datasets as previously stated.
  3. Filtering emojis was something I wanted to implement. While there is a way to do it, it’s quite the process and there’s not yet a direct way to do this on Python with pandas or nltk to my knowledge.
  4. This is a personal limitation (and not of one of the datasets) but computing time and power are limited on my machine and hence, running the code for this analysis took longer than expected. Occasionally, the page would crash.

So what does that mean for the political climate? Nothing conclusive, unfortunately. Many people have been keeping an eye on this election for the past couple of years and speculating on social media what may or what is happening in terms of it. Between users who used #Biden/#JoeBiden and users who used #Trump/#DonaldTrump, the line that separates them is rather blurred (which is where analyzing Twitter bios would come in handy). All we can say that even if you’re not tweeting out about the election, there’s a chance you’re probably liking or retweeting or at least reading about the election on your Twitter feed. (Trump hasn’t even conceded at this point in time.) Even though the election is essentially over and Joe Biden is projected to be the next president, I doubt we will stop talking about this election for a while.

--

--

Megan Resurreccion
Social Media: Theories, Ethics, and Analytics

Hello! I’m a PhD student in Information Systems at NJIT. Feel free to connect with me through LinkedIn! https://www.linkedin.com/in/megan-resurreccion/