Doing reproducible data science on Kaggle

Analyzing the first 2016 presidential debate

Megan Risdal
Extra Newsfeed
6 min readSep 28, 2016

--

What can we say about the words exchanged and the one-upping that happened in last night’s presidential throw down? Thanks to The Washington Post making the full, annotated transcript available online we can say a lot and have the data to back up our claims.

But, my analysis won’t actually live in this blog post. Instead, I’ll tell you about the reproducible analytics environment on Kaggle where you can run scripts, fork others’ code, discuss techniques, and generally further our collective insight into the madness that is the 2016 US election. I think that’s a much better way to share data and analysis!

Disclaimer

Before getting to the actual transcript data, I should warn that this post isn’t so much about revealing nuanced details in the recent debate between Hillary Clinton and Donald Trump as it is an excuse for me to talk about this thing I’m very excited about:

Open data and reproducible research.

I’ve recently joined Kaggle’s marketing team. You may know of Kaggle as a competitive machine learning platform, but they are moving things in a powerful new direction towards becoming the home for doing data science. Thanks to the new ability to upload datasets to Kaggle and analyze them in their data science platform (Docker containers with R/Python/Julia installed), rich data like last night’s transcript doesn’t have to live crystallized in an article on The Washington Post.

Some of the “hot” datasets on Kaggle right now.

Ben Hamner, Kaggle’s CTO, recently gave some great examples of who may want to share a dataset for reproducible analysis:

  • scientists & students
  • package authors
  • data vendors
  • companies & non-profits
  • government agencies

My favorite way to use the datasets and analytics environment (called Kernels) is to try out some new code. So, after eagerly cleaning up the presidential debate transcript, I uploaded it as a dataset on Kaggle and set out to calculate the tf-idf from the candidates’ responses during the debate. This would identify the relative importance of word types to Trump and Clinton’s responses respectively.

Making data open

I’ll briefly cover the steps I took to get this unique dataset up on Kaggle ready for community analysis.

Step 1 — Cleaning up the transcript.

I grabbed the text from The Washington Post’s article and put it into a nice computer-readable format. Hopefully this makes it as easy as possible for people unfamiliar with the data to work with it. Here’s how I describe the dataset once it’s been cleaned up:

The dataset and field descriptions on Kaggle.

Step 2 — Uploading the data

From the Datasets page, it’s as simple as clicking on “New Dataset” and filling out all of the details. Here are some helpful things to have prepared ahead of time:

  1. A dataset in a format that is easy for others (woman & machine) to work with. For this reason, CSV files are a good idea! You can also upload a SQL database. You can upload multiple files; no need to zip them first. If you need to make changes or have a new version, you’ll have the chance to upload new files.
  2. A description of your dataset. A helpful title, detailed overview, and description of the files you upload including their fields will all make the data more approachable which means more community engagement!
  3. A license. How do you want your data to be used? Are you providing accurate attribution to the dataset’s source or owner? Read about some common open source licenses here.
  4. A banner image. You’ll need an attractive image to upload as the banner to your dataset. From this banner image, you’ll crop an icon as well. Some great free photos are available at Unsplash. The minimum size is 1900x400 to ensure quality on retina displays.

Collecting and cleaning data is really the hard part, so these steps shouldn’t take you too much effort! The payoff is definitely worth it.

Step 3 — Analysis!

We’ve made it to the best part. Once the dataset is uploaded, anyone can download it or analyze it in a Kernel right on Kaggle. In fact, once you upload a dataset, it’s a great idea to create a “starter script” to show others how to work with the dataset. I chose to write a Rmarkdown notebook calculating the tf-idf of the candidates’ responses in the first debate.

You should check out my Kernel, “The winner of the first debate: Definitive proof” on Kaggle for the full effect, but here’s a snippet of what you’ll get:

# Calculate word frequencies
debate_words <- debate %>%
filter(Speaker %in% c("Clinton", "Trump")) %>%
unnest_tokens(word, Text) %>%
count(Speaker, word, sort = TRUE) %>%
ungroup()

# Calculate tf-idf
debate_words <- debate_words %>%
bind_tf_idf(word, Speaker, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))

# Create plots of the tf-idf for each candidate
# Clinton
clinton <- ggplot(debate_words %>%
filter(Speaker == "Clinton") %>%
top_n(10),
aes(x = word, y = tf_idf)) +
geom_bar(aes(alpha = tf_idf),
stat="identity",
fill = "#4169E1") +
coord_flip() +
#scale_y_continuous(limits = c(0, 0.002)) +
labs(x = NULL, y = "tf-idf", title = "Clinton") +
scale_alpha_continuous(range = c(0.6, 1), guide = FALSE)

# Trump
trump <- ggplot(debate_words %>%
filter(Speaker == "Trump") %>%
top_n(10),
aes(x = word, y = tf_idf)) +
geom_bar(aes(alpha = tf_idf),
stat="identity",
fill = "#E91D0E") +
coord_flip() +
labs(x = NULL, y = "tf-idf", title = "Trump") +
scale_alpha_continuous(range = c(0.6, 1), guide = FALSE)

# Plot
grid.arrange(clinton, trump, ncol = 2)
Top tf-idf words calculated from Clinton and Trump’s 2016 presidential debate responses.

The end result is a nice figure showing the terms most relevant to each candidate. I used code from Julia Silge’s nice blog post on performing tf-idf using principles of tidy data based on her R package `tidytext`. Now anyone can fork the code and improve it or start their own Python or R analysis.

An analysis of Hillary Clinton’s sentiment in emails about various countries. Check out the code in this Kernel on Kaggle.

Once a dataset has received a lot of attention from the community, it in effect becomes a code library for top-notch analyses that can be done with the type of data in question. For example, you can check out a similar dataset of Hillary Clinton’s emails here — so far over 400 kernels have been shared!

If you’d like to see more (and better!) analyses of the first presidential debate, here are a few examples that use the same data:

Conclusion

I hope that this demonstrates how easy it is to go from spotting some interesting data out in the wild to making it publicly available for shared analysis. And that it’s so much more important that the data become usable by all rather than sitting frozen behind my personal blog.

You can easily imagine how a dataset like the debate between Clinton and Trump could be of interest to data journalists and researchers studying language or political science. That’s not to mention the undoubtedly huge number of people with an interest in data science who benefit greatly from learning by doing especially with fresh data. Sorry not sorry, iris!

And don’t worry — I’ll do my best to upload the transcripts from the debates as they become available!

--

--

Megan Risdal
Extra Newsfeed

Kaggle / Google Product Manager. Former Stack Overflow Product Manager. Passionate about open communities and open knowledge.