Political Polarization on Twitter

12 min readMay 1, 2019

Alex Sahai, Chadd Yuan, Lilian Gjertsson, Marissa Baker, & Ooga Nam

Twitter is a social media platform with hundreds of millions of postings per day from hundreds of countries around the world. With comments ranging in topics from Game of Thrones to Presidential Debates, Twitter is a data goldmine.

For our final project, we wanted to look into politics in the United States. What can be gleaned about President Trump from his midnight Twitter rampages? How about from a live stream of all mentions of Trump? And more generally, we wanted to get a landscape of Twitter as a global means for discussion; our project thus further evolved to look into what trends are seen throughout the world in an interactive map.

But first to start with one of the most prolific (and controversial) tweeters of all time, President Trump; his tweets capture the attention of the globe. With over 59.9 million followers, his words reverberate across the world. For our final project, we wanted to understand a few key trends about President Trump and his tweeting habits. Luckily, President Trump tweets more than 10 times per day and has more than 41,600 tweets on his timeline as of May 1st, 2019.

Data (mis)adventures

In order to analyze his tweets, we first needed to collect tens of thousands of tweets over the past 10 years of his Twitter history. Our first thought was to use Twitter’s API and access it through Tweepy, however, the Twitter API limits developers to the most recent 3200 tweets, which obviously would not be HUGE enough. As we all know, Trump is the biggest, baddest, and greatest tweeter there ever was. After banging our heads against the wall and thinking of work arounds, highly considering pivoting, we were blessed with www.trumptwitterarchive.com, which maintains an up-to-date archive of President Trump’s tweets and allows users to export to both CSV and JSON formats. We chose the JSON format which we have previously worked with in class.

This website saved us a lot of time and energy and if you are analyzing Trump’s tweets, this is the easiest way to gather all of his tweets. They even have handy filters if you want to specify the collection of tweets more. Next, we used our various Python IDEs and Jupyter notebook to analyze the data.

Twitter Key-Word Live Streams

While the Trump Twitter archive allowed us to access and aggregate the historical data we needed, we were also interested to see how the world reacted to and conversed about all things Trump in real time. Utilizing Twitter’s live stream API to access tweets containing a given keyword in real time, we could track how Trump was perceived throughout the day. However, the data returned was extremely disorganized. While simple query parameters such as limiting languages to English filtered a lot of unwanted data, the format of the JSON data returned was inconsistent. To solve the issue various “cases” were made to handle the different formats returned — through these cases, we were able to scrape just the full text of each tweet streamed. We then needed to clean the tweet by removing stopwords, any URL links, user handles (@user_name), etc… in order to get more accurate sentiment scores.

Using the Afinn and Plotly libraries, we calculated sentiment scores and visualized the data through a histogram showing the distribution of how people felt about Trump. By creating a Dash app we were also able to create a callback function to continuously update the histogram as more data is streamed; now we could see if this distribution of tweet sentiments varied as the day went on.

While there consistently seemed to be a slightly negative sentiment of tweets involving Trump, because the standard Twitter API limits the number of requests per minute, we were not able to track sentiments for very long. However, with an enterprise Twitter API, you could track sentiments for days on end and easily implement tools to mark drastic changes in the distribution of sentiment scores across a day, month, year, etc… Furthermore, you could run multiple streams at once to compare the sentiment of multiple keywords.

For example, comparing tweets containing the word “Trump” versus “Obama” for a month might produce insights on global sentiments in a certain time period. By just saving some sample data to JSON files in independent search queries we were able to create a mock graph of what such a feature might look like, although with such “small” amounts of data you cannot really say much as the sentiments are generally neutral for both people. It is interesting to notice the extreme spike in highly negative sentiment towards topics surrounding Obama but a greater accumulative, albeit less extreme, negative sentiment towards Trump. Perhaps the people who dislike Obama simply express it in a more extreme way.

To see this graph in action or play around with different ways of visualizing the data, you can clone everything you need in this link: https://github.com/cyuan01/birdnest

Politician Comparisons

On the topic of comparing Trump to other politicians, we were also interested in seeing whether we could distinguish his tweets from those of Hillary Clinton. We trained a random forest classifier on a bag of words matrix created from a sample of their tweets. We also appended two features to the matrix — sentiment and subjectivity scores of each tweet. The best score obtained was around .77, though it is interesting to note that Hillary’s tweets are guessed correctly more often than Trump’s tweets are. Consider the following visualization of our confusion matrix:

These results are surprising in light of the idea that Trump is seen as being more outspoken than Hillary and therefore presumably would use stronger, more polarizing language. To test this belief, we analyzed Trump’s most frequent 2 and 3-word phrases in his tweets. Phrase frequency analysis revealed that his top phrases over the years include “crooked Hillary”, “make america great”, and “fake news” — all of which are either subjective (“fake”) or an extreme sentiment in either direction (“great”, “crooked”). In order to find his top phrases, we displayed a word cloud of his 2019 tweets.

In addition to the word cloud, we used sklearn’s CountVectorizer to find phrases that President Trump commonly uses. One can specify the length of the phrases they are looking for — for instance, we wanted the most common 2 and 3-word phrases and passed in those two numbers to the CountVectorizer “n-gram” method. The CountVectorizer returns a list of most frequent phrases of the specified length(s). See the following image:

According to the above graphic, among Trump’s most common phrases were “Kim Jong Un” and the “Southern Border”. Foreign policy clearly matters a lot to Trump for him to express those issues on Twitter as often as he does.

The code for this section can be found in this Github Repo!

Country Sentiment Analysis

Trump often speaks ill of North Korea and Mexico, and we wanted to see if sentiment analysis (of tweets containing those countries) supported that, and how his feelings towards those countries have changed, if at all, over time. On the flipside, we know that Trump views Canada and Israel relatively favorably and, similarly, we performed sentiment analysis to test our hypothesis. Here are the results over time:

He views Israel the most favorably out of all the four analyzed countries. His affinity for Israel peaked in around 2017 and has waned ever so slightly since then. On the other hand, Trump’s sentiment toward Mexico is the most negative. Interestingly, in 2016, his tweets seemed to favor Mexico more than any of the other countries, however, his sentiment has declined sharply ever since. Why would Trump, ever the opponent of Mexico, view it so positively? Perhaps his tweets mentioning Mexico were associated with such positive phrases as “Make America Great Again”. However, according to our results, Make America Great is actually one of Trump’s top phrases year after year. So why would sentiment towards Mexico be so high in 2016 alone? This is a potential future direction of research, as I will be able to filter tweets containing the word Mexico and the words and phrases that are most associated with those tweets.

The code for this section can be found in this Github Repo!

Global Trends Map

While working with the large amounts of tweets regarding Trump, we became curious about what other topics may be trending around the world. We started by attempting to use the Twitter API to retrieve trending topics from every country in the world. However, we quickly realized that Twitter simply does not record trends for every country (i.e. Twitter is blocked in Iran, so there are no trends there!). Luckily, the Twitter API can return the names of each country for which it does have trends, so after pulling those we used them to retrieve trends for every country that Twitter records. After keying into the data, we stored the top three trends for each possible country in a dictionary, with the country name as the key and the trends as the value.

In order to visualize this data, we decided to create an interactive map, where, upon hovering over a country, a small box appears with the country name and the current top three trends in that country. We used Plotly to create the map and discovered that we would need to move all our data into a CSV file in order to make it easily readable for Plotly. To do so, we simply used csv.writer to create a new row for each country in the dictionary and its trends. We decided to use a chloropleth as the foundation of our map model, which had advantages and disadvantages. Chloropleths fill in geographical areas based on a quantitative value, which we initially planned to be the number of Twitter users. However, this data is not accessible through Tweepy and is not reliable elsewhere on the Internet. We decided to continue with the Chloropleth model anyway, which would allow for that feature to be implemented should the data become available. In the meantime, we filled in every country where the trending tweets are available in just a solid color. That way, users can easily tell which countries have data, and which do not. Plotly also used different country codes than Twitter did, so we needed to import the correct codes and map them to their given country.

Another issue we encountered when creating this map was that some trends in non-English speaking countries used a different alphabet, and when we converted our data to a CSV file, these characters were changed to random ASCII symbols. We attempted to fix the problem by using ‘utf-8’ encoding, but when that didn’t work we decided to filter the top trends to only those using English characters. If we had more time to work on the project, we would focus on getting the actual symbols from each country into the top trends, so that the trends of the countries are more accurately depicted. We think this represents as a whole one of the major issues of processing loads of data: minority groups can often be overlooked and underrepresented.

After developing the map, it became much easier to spot any trends or interesting differences between countries. For example, after Messi scored a crowd-inspiring, 27 meter free kick against Liverpool, he was trending in countries around the world. The map makes it clear which events reach a global scale, and which are important within just a country or region.

See the interactive map below, and check out the code at this GitHubRepo.

Bonus!

Now that we have learned a lot about Trump and global trends, we thought “why not apply what we have found to an equally polarizing political figure in Denmark?”

Ulf Aslak.

Although there is no www.aslaktwitterarchive.com, we were lucky that Ulf only has 950 tweets, so we were able to use the Twitter API and Tweepy to scrape his entire timeline. Last semester, several students in Big Data did a project on if Trump’s tweets influence the stock market, and part of their analysis was relevant to our bonus endeavor — here are links to their Blog Post and their GitHub Repo. We modified some of their code and created comparable graphics for Ulf so we can compare the two in terms of word clouds, sentiment scores, tweets by day, and tweets by time. Here is a link to our repo.

Word Clouds

Word clouds are a simple way to visualize common words in a corpus of text. We used wordcloud to create ours below.

**Left:** Last Semester’s Trump Word Cloud **Right:** Our Ulf Word Cloud

It is clear from the word clouds that Trump’s tweets are more about abstract concepts like greatness, thanks, and shockingly, himself. Whereas Ulf’s tweets primarily revolve around data (so surprising), people, and coding.

When creating our text cleaning functions, we limited mentions of other Twitter users, but they were actually quite a significant portion of the word clouds so we put them back in. Trump often mentions himself and Barack Obama’s account whereas Ulf often tags suneman (Sune Lehmann, who runs the lab Ulf works in) and colleagues BenFMaier, and DirkBrockmann. Looks like Ulf could improve his on his political mentions for his future career in politics.

Now that we got a sense of what words are used in their tweets, we became curious if there is there a difference in the sentiments between Trump and Ulf tweets?

Sentiment Analysis

In class, we learned about the Afinn library which allows us to create sentiment scores for text, and in this case, tweets. The library has a score method that returns a positive score if the language of the tweet is overall positive and a negative score if the language of a tweet is overall negative. A sentiment score at or around zero is considered neutral.

As we can see, the majority of both Trump and Ulf’s tweets are centered around zero. However, Trump has a more positive average sentiment of 0.172 compared to Ulf’s of 0.073. We initially expected Ulf’s tweets to be more on the positive side, but perhaps Trump utilizes more powerful and polarizing positive language like mentioned about ‘make america great,’ which may push his average sentiment score to be more positive. Show some excitement Ulf!

Tweets by Day

Beyond what Trump and Ulf say, it’s also interesting to analyze when they say it. We know Trump is well-known for his weekend golf trips but does Ulf exhibit a similar work-hard, play-hard mentality?

According to our analysis, it looks like Ulf just works hard (except on Sundays when he takes a bit of a break). Have some fun Ulf!

Tweets by Time

In addition to the day of the week that Trump and Ulf are posting, we thought it would be interesting to analyze the time of day as we know Trump has a tendency to post late night twitter barrages.

Unlike Trump, the majority of Ulf’s tweeting comes in the morning and early afternoon hours — he must be following the Danish way and reduces his tweeting habits in the afternoon. He also does not seem to be a night owl, with very few tweets coming in between the hours of 11 PM and 6 AM. In fact, he might be a morning person with a tweet-peak between 7–10 AM.

Ultimately, we think that Trump might be better positioned for the political world than Ulf, but we have a few recommendations to improve your twitter game and be more like Trump:

Use powerful phrases like “Make Denmark Great” to stoke the excitement of your supporters
Take a break on the weekends to recharge and cheat at golf
Tweet later in the day — it shows that the grind doesn’t stop

It is important to note that the sample sizes between President Trump and Ulf differ by about 42x, so the conclusions from Ulf’s tweets must be statistically limited as for some elements in the histograms he has <10 tweets. In terms of future exploration, it could be interesting to look more deeply into at Ulf’s tweets over time, although the data was fairly limited so we did not notice any meaningful trends besides a weird affinity for data and people.

We hope that you’ve enjoyed our exploration into Twitter — from sentiment analysis to an interactive global trend map to a comparison between Trump and our teacher, we’ve shown how powerful Twitter data can be!