My Bot to Write Baseline Kaggle Kernels

Kaggle is the world’s largest data science community and a best place to learn practical and applied data science. Apart from data science competitions, kaggle is also popular for kernels. The community of data scientists on kaggle shares a variety of kernels which includes exploratory data analysis kernels, data visualisation kernels, baseline models, end-to-end frameworks, and deep learning models etc. These kernels are a great source of learning and understanding the practical implementation of data science concepts.

A good source to get started with any new competition or a dataset are the baseline starter kernels. These kernels are composed of an end-to-end walkthrough of data preprocessing, analysis, and a simple model or multiple models. By following the content of these kernels, one can understand how to proceed with a given dataset or a competition.

I have personally learned a lot from such kernels and Recently, even I started sharing baseline analysis / modelling kernels on kaggle datasets and a few competitions. After writing a couple of such kernels, I realised that majority of the contents of every new kernel shares a same fundamental process. This means that many parts of the code and markdown text are quite similar to each other. Every time I created a new baseline kernel, I had to write the same content from scratch or in worst cases, copy and pasting of code / text from my previous kernels.

Recently, Kaggle team updated the Kaggle public API with new features for creating and maintaining Kernels. Additionally, kaggle team also launched a bot — kerneler to write starter kernels for datasets.

Kerneler — a kaggle bot

I decided to use the API and create my own bot to automate the process of creating baseline kernels on kaggle. I used the same public API and created a python based module: “aster” which can be used to write baseline analysis and modelling kernels. Using this bot, I was able to create 5 kernels in less than a minute on different datasets. Not only, most of the process is automated but also the a lot of time is reduced writing new kernels from scratch.

This module acts like a bot as it accepts user inputs in a config and accordingly generates code / markdown templates. I derived the name of the bot from NASA’s terra satellite name aimed to provide the next generation remote sensing images from outer space.

Some cool features about Aster are dynamic code selection and multi datasets kernels capabilities. It can create baseline modelling kernels for datasets having binary or multi-classification targets or datasets having text fields or and numerical columns. As part of the kernel preparation process, aster can dynamically choose the most relevant codes and markdown cells from a template repository. For example, it will create word clouds for numerical fields while pair-plots for continuous fields.

The major components of the baseline kernel created by aster are : Quick Exploration, Preprocessing, Feature Engineering, and Modelling. Aster is controlled and run by a config file in which user can provide different key-value pairs which then decides the kernel content. Lets look at few examples, Following is an example to generate a simple baseline kernel on titanic dataset hosted on kaggle.

# import library
from aster.aster import aster
# prepare the config
config = {"COMPETITION" : "titanic",
"_TARGET_COL" : "Survived",
"_ID_COL" : "PassengerId"}
ast = aster(config) # aster object with config 
ast._prepare() # prepare the kernel
ast._push() # push the kernel on kaggle

Here is another example to create a text classification kernel on spooky-author-identification dataset. Notice in this case, additional key-value pairs are defined which corresponds to the text columns in the data.

from aster.aster import aster
config = {"COMPETITION" : "spooky-author-identification", 
"_TARGET_COL" : "author",
"_ID_COL" : "id",
"_TAG" : "doc",
"_TEXT_COL" : "text"}
ast = aster(config) # aster object with config 
ast._prepare() # prepare the kernel
ast._push() # push the kernel on kaggle

Lets look at the content of the kernel generated by this bot.

Dataset Snapshot:

Variable Correlations :

Model Evaluations

Here are few links of kernels generated by aster:

1. Binary Classification on Numerical Data, Competition Data

2. Multi Classification on Text Data, Competition Data

3. Binary Classification, Non Competition Data

Thanks for reading and please share the suggestions and feedback. Do send a PR on Github if you have some cool features in mind which can be implemented.

Edit — This bot was chosen as the winner of the mini bot challenge on kaggle and helped me to win a Kaggle Swag Prize. :)