Chatter Charts Product Methodology

ChatterCharts
The Startup
Published in
8 min readSep 2, 2020

I. Introducing Chatter Charts

Chatter Charts is a sports visualization that mixes statistics with social media data to create a storyboard retelling of the game through the collective fanbase’s perspective.

Chatter Charts splits a sports game’s social media comments into two-minute intervals and treats these comments as if they are part of a book telling a linear story where each interval is a chapter.

With this approach, it can leverage a statistical method called TF-IDF to rank all words at every interval and filter for the best performing one.

II. Understanding TF-IDF

For stats nerds, TF-IDF calculates the relative word count in an interval — Term Frequency — and weights it based on how often that word appears throughout the entire game — Inverse-Document Frequency.

In short, TF-IDF does two things well, punishing generic words such as “the” and allowing modifying words to outperform their subjects so “hooking” or “dive” can outrank “penalty” if multiple penalties happen in a game.

The two data frames below demonstrates the difference between raw counts and the TF-IDF approach. The left data frame ranks words using a raw count, while the right ranks using TF-IDF. My charts use the top result every interval.

On the left, ranking by raw count. On the right, ranking by tf-idf.

At a glance, this showcases the strength of TF-IDF, particularly how effective IDF weighting is at punishing common words and allowing a more reactionary verbiage to outperform others.

After applying this technique to each interval, here’s a sample of what a final data frame for Chatter Charts looks like:

  • interval is the rounded two-minute interval
  • interval_volume is the number of full text comments inside that interval
  • n is the word’s raw count inside that interval
  • tf is the percentage of that word relative to the total number of words in that interval — ex. 2.4% of words were “scorianov”
  • idf is a weighting based on how often the word occurs throughout the entire corpus — 3.5x is quite high because “scorianov” rarely occurs throughout the entire game
  • tf_idf is tf multiplied by idf

You can read more about TF-IDF from the Wikipedia page.

III. Getting Quality Hockey Comments

The challenge for most data science problems is having quality data. For Chatter Charts, I combine two sources where fans gather to talk about sports during the game.

Reddit Game Threads

r/Canucks game thread

A game thread is a dedicated forum where subreddit members can talk about a specific game. Every sports team has a subreddit. Some are larger than others, but the members are obviously hardcore fans.

There is no shortage of live reactions and hockey comradery. But, if you know Reddit, it can be quite crass — like r/Canucks’ “WIN DA TURD” second intermission chant.

For Reddit, I scrape all the comments off game threads using Python’s {PRAW} package. All you need is the thread’s URL. If you want the code, let me know in the comments!

Note: You’ll need to create a Reddit Web app to use {PRAW}

Twitter Accounts, Hashtags, and Keywords

A tweet that uses #GoStars

Twitter has a much larger professional presence than Reddit. Pundits, fan blogs, and beat reporters share their insights on the game here. So to find quality team-related tweets, I leverage a few techniques.

First, I store a list of team-specific keywords, accounts, and hashtags to search every game. For example, these are the terms I use to find Toronto Maple Leafs tweets.

@MapleLeafs
#TMLTalk
#LeafsForever
#GoLeafsGo
#MapleLeafs
#LeafsNation
#leafs
leafs
TML

This will cover tweets that mention the team’s main account, use popular hashtags, or have a team-specific keywords in it like “leafs”.

Further, I search a game-specific hashtag like #TORvsDAL. I can’t hard-code these, so I write a dynamic line of code to create it every game.

Lastly, I have a VIP list of Twitter accounts for each team. This is comprised of accounts who tweet a lot about their team, but might not use keywords or hashtags all the time.

To build the VIP list, I search team hashtags and dig into suggested accounts. I collect some active tweeters in the community and simply ask them to nominate the other fan accounts they like to follow.

Building my VIP list by asking the communities for nominations

I take all my VIP user’s first 50 tweets and add them into the Reddit and Twitter data — removing duplicates of course.

All of this is possible using {rtweet}’s search_tweets() and get_timeline() functions.

Note: You’ll need a Twitter developer account to get access to API calls and make requests using {rtweet}.

Together, these sources net me enough quality comments to produce quality results.

IV. Creating an Efficient Workflow

I’ve built out a workflow where I only need to provide a few details about a game and everything else will populate. It is written 97% in R and 3% in Python — Python only fetches the Reddit comments for me.

This is my command center:

Canucks POV example: I just need to fill this out and scripts will do the rest.

In the first chunk, team-specific data is pulled using the main_team and opponent variables. My script looks up colours, logos, social media info, and track down a list of fans who tweet about the team from a metadata.csv I’ve created, seen below.

metadata.csv

In the second chunk of my workflow, I paste event-based markers. I have to open Twitter and manually copy the links of tweets containing game start/end, goals, and intermission . What you see pasted is the string of numbers at the end of a tweet.

https://twitter.com/<account>/status/1300474445925167104

My script looks up those numbers, also known as the status_id , then grabs their timestamps and plots them in the correct positions with the correct colours and markers.

a plot with goal markers, intermissions, and game start/end

I went with this workflow because it is flexible to build Chatter Charts for any sport. For instance football, soccer, and baseball all have large events that define a game — touchdowns, goals, and RBIs respectively.

V. Tokenizing and Performing TF-IDF

So now I have my raw comments from Reddit & Twitter pulled as well as some markers. Let’s walk through how the data is being processed.

First, group the comments into two-minute intervals. I do this with the round_date function from {lubridate}. Super easy to use.

### ROUND DATES
rounded_interval_df <- raw_df %>%
round_date(interval, unit = "2 mins")
rounded interval output, see how `created_at` rounds into `interval`

Why two-minutes you ask?
Hockey is fast. Things happen quickly. Anything longer can drown out events.

Why not one-minute?
There’s usually not enough volume to satisfy TF-IDF, especially with smaller fan bases. However, I can use one-minute for ad-hoc charts — like a third period collapse.

Next, calculate the comment volume for each interval.

This allows me plot the line in the chart and acts as the y-axis guide for the animated words to follow.

### CALCULATE VOLUME
interval_volume_df <- rounded_interval_df %>%
count(interval, name = "interval_volume")
interval volume output

Next, tokenize. This means I take data that is currently structured as one comment per row and break it up so each word in a comment has its own row.

{tidytext} does this with unnest_tokens .

### TOKENIZE
unnested_df <- rounded_interval_df %>%
unnest_tokens(word, text, token = "tweets")

Also, remove stop-words. These are words like “I” and “the” — the stop_words variable is made available when you load {tidytext}. TF-IDF does discount these, but I find it’s friendlier to remove them straight up.

### TOKENIZED AND PROCESSED
processed_df <- unnested_df %>%
anti_join(stop_words, by = "word")
A tokenized data frame, see how each non stop-word is pulled from the sentence — — — — →

Next, count the words in each interval as the last data preparation before TF-IDF.

### COUNT TOKENS
counted_token_df <- processed_df %>%
count(word, interval)
word count of a single interval, excuse the cussing!

Lastly, apply TF-IDF. I use thebind_tf_idf function from {tidytext}. You can also try using log-odds from {tidylo} for some variation!

### TF-IDF
important_word_df <- counted_token_df %>%
bind_tf_idf(word, interval, n) %>% # one line!
filter(n >= 3) %>% # number of occurrences to be considered
filter(idf < 4) %>% # limit VERY random words (typically noise)
arrange(interval, desc(tf_idf)) %>%
distinct(interval, .keep_all = TRUE) # take the top term
### COMBINE WITH VOLUME
full_data <- interval_volume_df %>%
full_join(important_word_df, by = "interval") %>%
filter(interval >= min_hour,
interval <= max_hour) %>%
arrange(interval) %>%
fill(word, .direction = "down")

Note: full_join and fill will make sure any instances our intervals do not meet the minimum number of occurrences for TF-IDF will instead forward-fill the interval with the previous word.

And that is how I get the output we looked at above.

VI. Plotting & Animating

The Chatter Chart actually looks like this before animation.

base plot before animation

At this state, I animate the plot using {gganimate}.

animated_plot <- base_plot +
transition_reveal(interval) # animate over interval
animate(plot = animated_plot,
fps = 25, duration = 38,
height = 608, width = 1080,
units = 'px', type = "cairo", res = 144,
renderer = av_renderer("file-name.mp4"))

By using transition_reveal(interval) , I can build dynamic features into the chart. For instance, my scoreboard increases and grows when someone scores. It’s quite similar to creating markers in Adobe After Effects.

When the score changes, the board size will be increased for 2-minutes, then return back to 12.

The rest is intermediate styling in ggplot. Some of the packages I leverage include {ggtext} for adding html styling to the title, {shadowtext} to add the white background to the words, and {extrafont} for importing custom fonts.

Thanks for Reading

I hope that gives you a better grasp on what’s going on behind the scenes.

Of course, I invite you to follow along with me on Twitter or join r/ChatterCharts. My DMs are open for feedback.

Finally, I’m looking for sponsors, affiliates, and hitting up your podcast. Email: chattercharts@gmail.com.

Cheers!

--

--