Machine Learning and the VP Debate

4 min readOct 5, 2016

Using a similar approach to my Twitter analysis here, I analyzed tweets from last nights VP debate with the Cloud Natural Language API, BigQuery, and Exploratory for visualization. This time, in addition to running syntax analysis on every tweet I also used the NL API’s sentiment analysis feature:

Twitter sentiment during the debate

The NL API returns two values for sentiment: polarity and magnitude. polarity is a number from -1 to 1 indicating how positive or negative the text is. magnitude indicates the overall strength of the statement regardless of whether it is positive or negative, and is a number ranging from 0 to infinity. A good way to gauge sentiment is to multiply the two values so that statements with a stronger sentiment (higher magnitude) are weighted accordingly.

I used the Twitter npm package to stream tweets, filtering on the following search terms:

var searchTerms = '#debates,#debates2016,#debatenight,#vpdebate,Mike Pence,Tim Kaine';

With all the tweets I collected (299k total) as rows in a BigQuery table, I wrote the following query to get the average sentiment for all tweets in a given minute:

Note: when this was initially published I incorrectly cast the polarity and magnitude values to integers. I’ve updated it to cast these strings to floats — it didn’t significantly change my results or conclusions.

SELECT 
  LEFT(STRING( SEC_TO_TIMESTAMP(INTEGER(created_at )/1000)),16) as minute, 
  AVG(FLOAT(polarity) * FLOAT(magnitude)) as sentiment 
FROM [sara-bigquery:syntax.vpdebate]
GROUP BY 1
ORDER BY 1

Then I graphed it with Exploratory (thanks Felipe Hoffa for showing me this awesome new data viz tool!):

Twitter sentiment during the October 4th VP debate

The graph starts at 8pm Eastern (an hour before the debate) and ends at 5am the day after (October 5). We can see that sentiment fluctuated slightly during the debate (from 9 — 10:30pm), but was largely negative in the hours that followed, specifically from 1 to 3am. This is around the same time news outlets started publishing articles analyzing the outcome of the debate.

While I haven’t yet come across a timed transcript of the debates, bonus points for anyone who wants to feed an audio file of the debate to the Speech API, run it through NL for sentiment analysis and compare it to the graph above.

We can also get the average sentiment for all tweets collected during the debates (including a few hours before and after):

SELECT 
  ROUND(AVG(FLOAT(polarity) * FLOAT(magnitude)),2) as overall_sentiment, 
  COUNT(*) as num_tweets 
FROM
  [sara-bigquery:syntax.vpdebate]

And here’s the result:

Overall sentiment for tweets during the VP debate

We can compare this to the sentiment for tweets mentioning tax by adding the following to our query:

WHERE LOWER(text) CONTAINS 'tax'

Syntactic analysis

Using the NL API’s text annotation method, we can break down a tweet by parts of speech and use BigQuery to find linguistic trends. For each sentence, the NL API will tell us which word is the subject (labeled as NSUBJ). Since I’ve got the JSON response from the NL API saved in BigQuery, I can write a user-defined function to find the top subjects in tweets about the VP Debate:

SELECT 
 COUNT(*) as subject_count, subject
FROM 
 JS(
 (SELECT tokens FROM [sara-bigquery:syntax.vpdebate]),
 tokens,
 “[{ name: ‘subject’, type: ‘string’}]”,
 “function(row, emit) { 
   try {
     x = JSON.parse(row.tokens);
     x.forEach(function(token) {
       if (token.dependencyEdge.label === ‘NSUBJ’) {
         emit({ subject: token.lemma.toLowerCase() });
       }
     });
   } catch (e) {}
 }” 
 )
GROUP BY subject
ORDER BY subject_count DESC
LIMIT 100

And graph the results:

Interestingly, there were many more tweets about Pence than Kaine (48k vs 34k).

Top debate emojis

Last but not least, how did people express their feelings about the debate in emojis? Here are the results in an emoji tag cloud:

Top emojis used in tweets about the VP debate on Oct 4th

I’m not sure how the taco emoji snuck in there, but I’m guessing it has something to do with October 4th being National Taco Day.

What’s next

Have questions or more ideas for natural language processing? Find me on Twitter @SRobTweets or let me know what you think in the comments. And here’s are the tools I used:

Machine Learning and the VP Debate

Twitter sentiment during the debate

Syntactic analysis

Top debate emojis

What’s next

Written by Sara Robinson