How My First Kaggle Competition Changed My Data Science Learning Experience

Beng Chew
4 min readFeb 15, 2022

--

Photo by Giorgio Trovato on Unsplash

“Don’t waste your time on Kaggle competition. It won’t be helpful when dealing with real business problem.”

Someone once told me this when I first started my data science career. As a young, clueless data scientist, I followed this advice for 2 years and had been avoiding Kaggle competitions. Recently, I stumbled upon Abhishek Thakur’s LinkedIn post promoting Song Popularity Prediction Kaggle competition.

After reading his post, I had this urge to join his competition and justify if the past advice was true. So, here are my 5 keys takeaways after completing the competition.

  1. Generous code sharing in community forum

At first, I thought the overall experience would be mainly competing with everyone and finding out who has the best score. To my surprise, I am amazed by how generous the community is. Even at the beginning of competition, they already started sharing useful EDA codes, feature engineering techniques & etc. I personally felt this was more like everyone working together instead of competing with each other.

2. Enriched my experience in other types of ML problem

My job experience mainly focused on time series forecasting & optimization. It is actually a good hands-on opportunity for me to experience common issues faced in different types of ML. For example, this Song Popularity Prediction (classification) problem enabled me to try out resampling methods to tackle imbalanced data issue and understand how to use correct metrics to evaluate models.

3. Exposed to creative feature engineering & missing imputation techniques

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos, University of Washington

With many models to choose from, many people have shifted their focus to model/algorithm development rather than data quality or feature engineering techniques. Personally, I find it more important to focus more on data quality and advanced feature engineering techniques while using simple models.

In the forum, there are tons of feature engineering and missing data imputation techniques being shared. To be honest, I was not able to try out all of those which I had bookmarked. Here are the followings techniques which I find it interesting:-

  • Clusters models (e.g. KMeans) to generate new features
  • PCA, a dimensionality-reduction method to reduce the dimensionality features
  • Apply log, box-cox, power transformation for data scaling
  • LGBM imputer for missing data
  • Feature interaction between 2–3 variables based on domain knowledge
  • Polynomial feature transformation

4. Learnt how to write efficient data analysis & plotting code

As I have just recently transitioned from R to Python, I am more familiar with ggplot than matplotlib/seaborn code structure. Oftentimes, I will either encounter many code errors or end up with not so pretty plots. So, what I really like about the code sharing forum is they don’t just share code on how to build models, instead they also shared many detailed EDA notebooks with their own findings. Here are two of my personal favorites from the competition.

5. Learnt how to use optuna package to tune models

As I was scrolling through the sharing notebooks, I found a few notebooks with an interesting term — Optuna. So, this got me curious and I decided to check out one of them. And, that’s it! I will definitely be using it for my next ML problem.

If you have not heard of it just like me, click this link to find out more. Basically, it is an open source hyperparameter optimization framework to automate hyperparameter search.

Closing thoughts —

No doubt, Kaggle is a great platform for beginner or intermediate data scientist to practise their coding skills.

For advanced user, I find that joining those distinguished competitions such as M5 Forecasting competition will be a challenging yet rewarding experience.

I share about my data science journey & learning experience. Follow me if you want to learn more from my sharing.

Hope you enjoyed this read. Have a nice day!

--

--

Beng Chew

Data Scientist. I share about technology, analytics & learning experience.