How are Kagglers using 60 minutes of free compute in Kernels?

More powerful visualizations, more data sources, and more multiprocessing

Megan Risdal
5 min readOct 9, 2017

Kaggle Kernels now gives people from across the world access to machines about as powerful as the one I’m typing from to work on data science projects right in their browsers. Check out the increased resources our team recently made available to the Kaggle community for free:

Since it’s been about a month since our engineers supercharged Kaggle Kernels, I thought I’d explore how our users are spending their new CPU cycles and tripled execution time.

To sum up what I saw from browsing hundreds of popular, compute-intensive Python and R kernels:

  • More powerful visualizations
  • More data
  • More multiprocessing

In this rest of this post, I showcase these exciting examples of community code — from topic modeling of LEGO color sets to parallelized XGBoost code — that until recently couldn’t be shared publicly on Kaggle due to resource limits.

Click on any of the following visualizations created by users to check out their code or create a new kernel of your own.

Create more powerful visualizations

Many Kagglers write kernels to create data science portfolios and one of the best ways to get your work noticed is through beautiful data visualizations. With expanded resources, users are graduating from basic bar charts to more intense visualization techniques.

Finding LEGO Color Themes with Topic Models

In an Rmarkdown script kernel, Nathan Aff uses the LEGO database on Kaggle to model LEGO set color themes with Latent Dirichlet Alocation (LDA). Nathan bases his code on Julia Silge and David Robinson’s new Text Mining in R book to implement this computationally intensive unsupervised machine learning algorithm in a rather colorful analysis.

TED Talk Data Analysis: Finding Related Videos Using Network Analysis

In this Python notebook kernel (based on Jupyter), Kaggler Rounak Banik examines how TED Talks are related (and a lot more) using network analysis. He says:

We will have a look at how every TED Talk is related to every other TED Talk by constructing a graph with all the talks as nodes and edges defined between two talks if one talk is on the list of recommended watches of the other. Considering the fact that TED Talks are extremely diverse, it would be interesting to see how dense or sparse our graph will be.

I encourage readers to fork this kernel and create an interactive version of this network visualization (click the blue “Fork Notebook” button, make your changes to the code, and click “Publish”).

Work with more data(sets)

More data equals more resource usage. And adding more data sources to their kernels is another way our community is putting the doubled RAM to work.

Using Transfer Learning with Pre-Trained Keras Models to Distinguish Dog Breeds

Although GPUs aren’t yet available in Kernels, users can publish pre-trained models as public datasets on Kaggle. In this case, Kaggler beluga published CNNs with pre-trained ImageNet weights and uses this dataset in his kernel to distinguish dog breeds in photographs.

Analysis of Scientific Researcher Migrations

Given Kaggle’s global community, I found Kaggler Eidan Cohen’s analysis of scientific researcher migrations super interesting. Eidan combines multiple data sources in his Python notebook kernel to compare countries that attract and retain talent to those that lose their brilliant minds.

Do more with multiprocessing

The new machines available in Kernels use 4 CPUs up from 2 enabling multithreaded workloads.

Regression During Insomnia

Kagglers like the1owl are making more powerful submissions to competition leaderboards thanks to doubled CPU resources. In their kernel, the1owl uses multiprocessing to train an XGBoost model in the KKBox customer churn challenge.

A Look at the forecastXGB Package

Kaggler Mayur Hampiholi’s R notebook kernel demonstrates parallelization using forEach and doParallel to experiment with the forecastXGB R package to predict pageviews in the Wikipedia Web Traffic competition.

What’s Next

If you’re inspired to try out these new resources for yourself, go check out Kaggle Kernels here and read my getting started guide. Your kernels are private by default so you can experiment first and share when you’re ready.

Plus, check out these extras:

Interested in starting a data project on Kaggle? Reach out to me at @MeganRisdal on Twitter and I’d be happy to help you or your organization get started.

--

--

Megan Risdal

Kaggle / Google Product Manager. Former Stack Overflow Product Manager. Passionate about open communities and open knowledge.