Introducing Kaggle Datasets

Ben Hamner
5 min readJan 28, 2016

--

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Kaggle Datasets has four core components:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility to the previous work that’s been created on the data
  • Conversation: forums and comments for discussing the nuances of the data

Are you interested in publishing one of your datasets on kaggle.com/datasets? Click “New Dataset” on Kaggle Datasets.

Access

Simple, consistent access to the data

The homepage for the 2013 American Community Survey Dataset

You have a cleanly designed page with a several basic elements: a single download link to get the entire dataset, and a clear description of the data. The license information is also explicitly required for each dataset, so you know whether and how you may use it.

Analysis

A way to explore the data without downloading it

Screenshot of a Kaggle Script being edited on the 2013 American Community Survey

Kaggle Scripts is enabled on every dataset published through Kaggle Datasets. This enables you to run code directly on the datasets, publish the results, and fork other’s scripts in a reproducible way, without ever needing to download the data.

Our standardized R, Python, and Julia computational environments come preloaded with all the analytics and visualization packages data scientists normally use. You don’t need to worry about broken package installs or software version conflicts — you can jump in and start coding right away.

For those curious on the technical details behind Kaggle Scripts, you have 8GB of RAM and 2 compute cores to work with. Code runs in the kaggle/rstats, kaggle/python, and kaggle/julia docker containers, which you can also pull from Docker Hub.

Results

Visibility to previous work that’s been created on the dataset

The Kaggle Scripts Page for the 2013 American Community Survey dataset

Work done in Kaggle Scripts is saved and published publicly by default. This means that, when you’re coming to a new dataset, you don’t have to start from scratch.You have all the work that other data scientists have already created on it to leverage as a starting point.

You can quickly flip through the most popular scripts that have been published to get a better understanding of what’s in the data and what you can do with it. You can even fork any script (which creates an editable copy) and extend it to create your own work. You don’t have to start from a completely blank slate.

The Kaggle Datasets + Kaggle Scripts environment provides a cool way for you to share the insights you discover on the data. Others will have more confidence in your results, as they have the code and data you used to create them. As you use Kaggle more, this has the added benefit of building out your data science portfolio. Every script you publish is automatically saved to your Kaggle profile.

A Kaggle user’s profile, highlighting their scripts

Conversation

Forums and script comments for discussing the nuances of the data

The forum for the 2013 American Community Survey dataset

Every dataset has a story behind it. Real world data doesn’t come from an artificial clean room or a mathematical equation, it’s messy and noisy.

After running hundreds and hundreds of machine learning competitions, we’ve seen our share of messy datasets. A handful of examples include:

  • flights landing before they took off
  • bulldozers manufactured in the year 1000 CE
  • photo of a defecating right whale
  • a perfect-scoring essay that just said “This essay got good marks, but as far as I can tell, it’s gibberish”

Tossing data over a wall and expecting people to do great things with it doesn’t work. The context and the story behind the data matters, and the forums enable discovering this through discussions between data scientists and also with the organizations publishing the data.

Seeding Kaggle Datasets

All of this functionality is meaningless without fun, interesting, and insightful datasets to access through it. We’ve seeded kaggle.com/datasets with a small number of interesting, popular datasets. Some of our favorites include:

Exploratory scripts on these datasets illustrate the benefits of capturing code and results alongside the data: you don’t need to load the data and work with it to have a good understanding of what it contains.

This is our initial foray into the world of public datasets, and it is far from complete. We’ll be actively developing and iterating on the section of the site for the near future. Let us know any feedback you have on it through the forums.

We’ll be expanding the datasets available through Kaggle in the coming weeks, and ultimately enabling any researcher or organization to directly publish data on our platform. Do you have any datasets that you’d love to see available on Kaggle? Let us know by providing a sample through this short form.

This is cross-posted on Medium from Kaggle’s blog. Thanks to Anna Montaya for reviewing drafts of this.

--

--

Ben Hamner

Kaggle CTO. Seeking to understand intelligence through data and espressos, and occasionally on a kiteboard