World Series Game 7, as seen on Twitter

Published in

baseballongcp

3 min readNov 3, 2016

To complement my colleague Eric’s World Series analysis, I’ve been streaming all tweets with the following terms for the first hour of game 7:

var searchTerms = '#worldseries,@cubs,@indians,#cubs,#indians,world series,#flythew,#rallytogether';

For each tweet with one of those search terms, I send the text of the tweet to the Cloud Natural Language API for sentiment and syntax analysis, and insert the result into a BigQuery table. During the first hour of the game, I’ve collected 35,846 tweets.

Disclaimer: while I am from Chicago I know almost nothing about sports, so proceed with caution.

Subject-based sentiment analysis

With the NL API, I can find the average sentiment of all tweets by multiplying the polarity (number between -1 and 1 indicating how positive or negative the text is) by the magnitude (number from 0 to infinity indicating the strength of the sentiment in the text, regardless of polarity). I can further break down sentiment by subject to get the following graph showing subjects with the most tweets:

In terms of players it looks like Crisp and Santana have had the most positive tweets so far. And just when I thought I wasn’t analyzing the elections anymore, it seems to have snuck in there.

Here’s the query I used to build the above graph:

SELECT 
  COUNT(*) as subject_count, subject, avg(sentiment) as sentiment
FROM 
  JS(
    (SELECT tokens, polarity,magnitude FROM [sara-bigquery:world_series.game_6_20161101] ),
    tokens,
    polarity,
    magnitude,
    "[{ name: 'subject', type: 'string'}, { name: 'sentiment', type: 'float'}]",
    "function(row, emit) { 
      try {
        x = JSON.parse(row.tokens);
        x.forEach(function(token) {
          
          if (token.dependencyEdge.label === 'NSUBJ') {
            emit({ subject: token.lemma.toLowerCase(), sentiment: parseFloat(row.polarity) * parseFloat(row.magnitude) });
          }
        });
      } catch (e) {}
    }" 
  )
GROUP BY subject
ORDER BY subject_count DESC
LIMIT 100

We can also get the sentiment of all tweets over the past hour with this query:

SELECT 
  LEFT(STRING( SEC_TO_TIMESTAMP(INTEGER(created_at )/1000)),16) as minute, 
  AVG(float(polarity) * float(magnitude)) as sentiment,
  COUNT(*) as c
FROM [sara-bigquery:world_series.game_6_20161101] 
WHERE timestamp_to_msec(SEC_TO_TIMESTAMP(INTEGER(created_at))) > DATE_ADD(current_timestamp() , -1, 'HOUR')
GROUP BY 1
ORDER BY 1

And graph it over time:

Hashtag-based sentiment analysis

Since there are specific hashtags for each team — #FlyTheW for the Cubs and #RallyTogether for the Indians, I can run a simple query to get the average sentiment for each team’s hashtag:

SELECT AVG(FLOAT(polarity) * FLOAT(magnitude)), COUNT(*) as num_tweets
FROM [sara-bigquery:world_series.game_6_20161101] 
WHERE LOWER(text) CONTAINS '#flythew'

And doing the same for #rallytogether I get:

We can see that Cubs fans are slightly more positive than Indians fans. We can also graph the sentiment of each hashtag, using a similar query to the one with the user-defined function above:

Fan emojis

And because we all love emojis (or at least I do), let’s get the top 5 emojis for Cubs fans:

And the top 5 for Indians fans:

Cubs fans appear to be much more emotive than Indians fans, and we can see that they’ve adopted the bear emoji to represent their team.

What’s next

Want to learn more about analyzing baseball with Google Cloud Platform? Check out the rest of the Baseball on GCP posts, or find Eric and I on Twitter at @notthateric and @SRobTweets.

World Series Game 7, as seen on Twitter

Subject-based sentiment analysis

Hashtag-based sentiment analysis

Fan emojis

What’s next

Written by Sara Robinson