How are Kagglers using 60 minutes of free compute in Kernels?
More powerful visualizations, more data sources, and more multiprocessing
Kaggle Kernels now gives people from across the world access to machines about as powerful as the one I’m typing from to work on data science projects right in their browsers. Check out the increased resources our team recently made available to the Kaggle community for free:
Since it’s been about a month since our engineers supercharged Kaggle Kernels, I thought I’d explore how our users are spending their new CPU cycles and tripled execution time.
To sum up what I saw from browsing hundreds of popular, compute-intensive Python and R kernels:
- More powerful visualizations
- More data
- More multiprocessing
In this rest of this post, I showcase these exciting examples of community code — from topic modeling of LEGO color sets to parallelized XGBoost code — that until recently couldn’t be shared publicly on Kaggle due to resource limits.
Click on any of the following visualizations created by users to check out their code or create a new kernel of your own.
Create more powerful visualizations
Many Kagglers write kernels to create data science portfolios and one of the best ways to get your work noticed is through beautiful data visualizations. With expanded resources, users are graduating from basic bar charts to more intense visualization techniques.
Finding LEGO Color Themes with Topic Models
In an Rmarkdown script kernel, Nathan Aff uses the LEGO database on Kaggle to model LEGO set color themes with Latent Dirichlet Alocation (LDA). Nathan bases his code on Julia Silge and David Robinson’s new Text Mining in R book to implement this computationally intensive unsupervised machine learning algorithm in a rather colorful analysis.
TED Talk Data Analysis: Finding Related Videos Using Network Analysis
In this Python notebook kernel (based on Jupyter), Kaggler Rounak Banik examines how TED Talks are related (and a lot more) using network analysis. He says:
We will have a look at how every TED Talk is related to every other TED Talk by constructing a graph with all the talks as nodes and edges defined between two talks if one talk is on the list of recommended watches of the other. Considering the fact that TED Talks are extremely diverse, it would be interesting to see how dense or sparse our graph will be.
I encourage readers to fork this kernel and create an interactive version of this network visualization (click the blue “Fork Notebook” button, make your changes to the code, and click “Publish”).
Work with more data(sets)
More data equals more resource usage. And adding more data sources to their kernels is another way our community is putting the doubled RAM to work.
Using Transfer Learning with Pre-Trained Keras Models to Distinguish Dog Breeds
Although GPUs aren’t yet available in Kernels, users can publish pre-trained models as public datasets on Kaggle. In this case, Kaggler beluga published CNNs with pre-trained ImageNet weights and uses this dataset in his kernel to distinguish dog breeds in photographs.
Analysis of Scientific Researcher Migrations
Given Kaggle’s global community, I found Kaggler Eidan Cohen’s analysis of scientific researcher migrations super interesting. Eidan combines multiple data sources in his Python notebook kernel to compare countries that attract and retain talent to those that lose their brilliant minds.
Do more with multiprocessing
The new machines available in Kernels use 4 CPUs up from 2 enabling multithreaded workloads.
Regression During Insomnia
Kagglers like the1owl are making more powerful submissions to competition leaderboards thanks to doubled CPU resources. In their kernel, the1owl uses multiprocessing to train an XGBoost model in the KKBox customer churn challenge.
A Look at the forecastXGB Package
Kaggler Mayur Hampiholi’s R notebook kernel demonstrates parallelization using forEach
and doParallel
to experiment with the forecastXGB
R package to predict pageviews in the Wikipedia Web Traffic competition.
What’s Next
If you’re inspired to try out these new resources for yourself, go check out Kaggle Kernels here and read my getting started guide. Your kernels are private by default so you can experiment first and share when you’re ready.
Plus, check out these extras:
- Read about other new features in Kernels including code folding in notebooks and improved reliability across the board.
- Every Thursday, our Datasets team convenes virtually to pore over the latest kernels to select a winner for our weekly $500 Kernels Award (it’s getting more difficult to choose each week!). Learn how to participate and win.
- We’re also awarding $10,000 in prizes each month to the top three high quality datasets published on our public data platform. Check out the details here and see what our community can do with your dataset.
Interested in starting a data project on Kaggle? Reach out to me at @MeganRisdal on Twitter and I’d be happy to help you or your organization get started.