The 2nd and 3rd presidential debates through tweets

Published in

Extra Newsfeed

6 min readOct 25, 2016

Continuing with my Twitter analysis of the debates, I streamed and analyzed tweets from the last two presidential debates and put together some visualizations comparing the results. I used BigQuery to analyze tweet data from the Cloud Natural Language API (sentiment, syntax) and the Twitter Streaming API (tweet text, hashtags, user location).

Tweet subjects by sentiment

I used the Natural Language API’s sentiment analysis feature to calculate the sentiment for each tweet. It returns two values: polarity is a number in the -1,1 range indicating how positive or negative the text is, and magnitude is a number from 0 to infinity (normalized to the length of the text) which shows the overall strength of the text’s sentiment regardless of polarity.

Since no one likes to read too far without a visual, to create the graph below I’ve run a query to see which subjects were tweeted about most frequently. Then I cross referenced this with the percentage of positive tweets for each subject:

% of positive tweets by subject

There are a lot of data there, so let’s break it down. I’ve pulled the subject of each debate tweet using the Natural Language API’s NSUBJ label. For the subjects with the most mentions I counted the number of positive tweets for each one (trying to be more of a glass half-full person here). Then I divided the number of positive tweets by the total number of tweets to get the percentage for each subject. The blue bars measure tweets from the second presidential debate on October 9th and the orange bars are from the third debate on October 19th. Some notes:

Since the Natural Language API isn’t trained on short text like tweets, I counted positive tweets only as tweets with a polarity value of 1 (the strongest positive value for polarity)
I filtered out pronouns from the results (i, you, he, she, they, we, etc.)

What can we conclude from this?

For topics that trended at both debates, most topics increased slightly in positivity from debate 2 to 3 with the exception of ‘clinton’, ‘obama’, and ‘trump’. While the sentiment of these names declined, the percentage of positive tweets for Clinton and Trump’s official Twitter handles both increased.
Some topics were trending only at one debate. For example, during the second debate there were many tweets with the subjects: ‘bill’, ‘bush’, ‘giuliani’, ‘god’, ‘gop’ , ‘tape’, ‘pence’. Most of these relate to the tape of Trump’s comments about women released before this debate.
Other topics were trending only at the final debate like ‘moore’ (after he declared Trump won the third debate), ‘wallace’ (the debate host), and ‘wikileaks’ (the emails released from Hillary’s campaign).
Tweets about the upcoming election are very negative, which is sad. But it does go along with the theory that negative tweets tend to be more popular.

How did I get from tweets to subject-based sentiment analysis?

I used the Twitter Streaming API to stream all tweets related to the election for ~24 hours around both the second and third presidential debates. Then I ran the text from each tweet through the Natural Language API’s sentiment and syntax analysis and streamed the results into a BigQuery table with the following schema:

With the data in BigQuery, I wrote a user-defined function to get the subject and polarity value of each tweet. Then I counted the number of positive tweets (with polarity = 1) for each subject, and divided this by the total number of tweets for that subject to get the percentage. I ran this query over the tables for both debates:

SELECT subject, ROUND(((pos_count / subject_count)  * 100),4) as percent_pos, subject_count FROM (
  SELECT 
    COUNT(*) as subject_count, subject, COUNT(case when polarity = 1  then 1 end) as pos_count
  FROM 
    JS(
      (SELECT tokens, polarity FROM [sara-bigquery:syntax.debate_1019] ),
      tokens,
      polarity,
      "[{ name: 'subject', type: 'string'}, { name: 'polarity', type: 'float'}]",
      "function(row, emit) { 
        try {
          x = JSON.parse(row.tokens);
          x.forEach(function(token) {if ((token.dependencyEdge.label === 'NSUBJ')) {
              emit({ subject: token.text.content.toLowerCase(), polarity: parseFloat(row.polarity)});
            }
          });
        } catch (e) {}
      }" 
    )
  GROUP BY subject
  ORDER BY subject_count DESC
  LIMIT 100
)
GROUP BY subject, subject_count, percent_pos
ORDER BY subject_count DESC

To see if the changes in positive and negative subjects were correlated, I ran the same analysis above this time looking at the percentage of tweets with a polarity value of -1:

Some of the subjects trending in both debates are correlated. For example, we see an increase in positive tweets with ‘america’ and ‘hillary’ from debate 2 to 3 and a decrease in negative tweets for the same subjects. Conversely, for ‘trump’ we see an decrease in positive tweets and an increase in negative tweets. For other subjects like ‘hillaryclinton’ and ‘realdonaldtrump’, the number of positive and negative tweets increases — which shows sentiment becoming more polarized.

Comparing adjectives from the debates

Which adjectives were used most to describe each debate on Twitter? In the grouped bar graph below, I compared adjectives from both debates.

During this campaign in particular, we’ve seen topics quickly go viral (like Ken Bone, Trump’s comments from the 2005 tape, Hillary’s campaign emails, etc.). This campaign has been fraught with sexism, indicated by the trending adjectives ‘sexual’ and ‘nasty’. Here’s the query behind this graph:

SELECT COUNT(*) as adj_count, adjective
FROM 
 JS(
 (SELECT tokens FROM [sara-bigquery:syntax.debate_1019]),
 tokens,
 "[{ name:'adjective', type: 'string'}]",
 "function(row, emit) { 
   try {
     x = JSON.parse(row.tokens);
     x.forEach(function(token) {
       if (token.partOfSpeech.tag === 'ADJ') {
         emit({ adjective: token.lemma.toLowerCase() });
       }
     });
   } catch (e) {}
 }" 
 )
GROUP BY adjective
ORDER BY adj_count DESC
LIMIT 100

Mapping campaign hashtags

What about the tweet location data? Around 5% of the debate tweets I collected from both debates have location data returned from the Twitter API (that’s 92k tweets with location data out of 1.7M total tweets), and I wanted to see if there were any location trends for particular hashtags. I created two heatmaps, one for each campaign’s top hashtags.

Trump’s hashtags

The Trump hashtag map below includes #MAGA, #NeverHillary, and #TrumpTrain.

Find the full interactive map here.

Hillary’s hashtags

Hillary’s hashtag map includes #ImWithHer, #NeverTrump, and #StrongerTogether.

Full map is here.

Although the maps are similar, if you zoom in on the full maps you’ll notice that the pro-Hillary tweets are more concentrated in big cities and outside the US, whereas we see more pro-Trump tweets in the South and Midwest.

To create the maps I stored the JSON location data from the Twitter API as a string in BigQuery, and wrote a user-defined function to extract the latitude and longitude for each hashtag with location data. Then I exported my query output as CSV and uploaded the result to Carto.

Most tweeted emojis

Last but not least, it’s hard to analyze sentiment on Twitter without looking at emojis. Running a query to get the most used emojis, here are the top 10:

Top emojis, presidential debates 2 and 3

And here they are represented in an emoji tag cloud:

What’s next?

Start analyzing your own streams of text data with the Cloud Natural Language API and BigQuery. Here are all the tools I used:

Twitter Streaming API: ingesting election tweets
Natural Language API: syntactic parsing and sentiment analysis of tweets
BigQuery: running SQL queries on the Twitter + NL data to analyze all the things
Exploratory and Carto: visualizing the data

Have questions or suggestions for future posts? Let me know what you think in the comments or find me on Twitter @SRobTweets.