First Thoughts on Kaggle
About two weeks ago, I registered for my first Kaggle competition, the Mercedes-Benz Greener Manufacturing contest. My code can be found here, and a log of what I attempted can be found here. Here’s my first impression of Kaggle:
The Learning Curve
Entering the contest, my only knowledge in machine learning came from MIT’s introductory class (6.036). So, I was pleasantly surprised to find that Kaggle contests are perfectly accessible to those with minimal prior experience. This is largely due to Kaggle’s kernels, which allow the more experienced Kaggle users to publicly share their code with others. Kernels allow even those completely new to machine learning to be competitive in the rankings — by simply copying code, anyone can achieve results at par with Kaggle veterans.
I found that from my attempts to improve code from kernels, I received a brief introduction to multiple concepts in machine learning, many of which I hope to write about more thoroughly in the future:
- Gradient Boosted Trees
- Hyperparameter Tuning
- Dimensionality Reduction: PCA, ICA, tSNE, Logistic PCA, TSVD, GRP, SRP
- Overfitting, K-fold cross validation, Out-of-fold predictions
- Ensembling, Stacking, and Averaging
- Sklearn Models: LassoLars, ElasticNet, etc.
- Basic Feature Selection and Feature Engineering
- Likelihood encoding (post-contest)
This contest seemed to me like a great way to quickly ‘learn by doing’. It would be hard to find any other resource online which facilitates learning concepts in data science as well as Kaggle does.
Unpredictability of the Leaderboard
As evidenced by the massive shakeup from the final rankings, the public leaderboard was entirely unreliable for predicting the private leaderboard: almost everyone in the lead throughout the contest dropped hundreds of ranks at the end. However, even cross-validation proved to be useless: my final model, evaluated with 5-fold CV, performed no better than my heavily-overfitted XGBoost model made on my 4th day.
In the end, it turned out that there were reliable ways to test a model — for the most part, though, contestants (including me) weren’t thorough enough with evaluating their model’s performance.
Kaggle’s Contest Community
I have nothing but positive things to say about Kaggle’s community. User-submitted Kernels and threads about the competition do a lot to encourage collaboration between the contestants. During the contest, many users worked together in open forums, improving each others’ models and discussing properties of the dataset. This competition, one of the many results of the contestants’ combined efforts was the discovery of 32 y-values from the test dataset, obtained through leaderboard probing.
I found participating in this contest to be very enjoyable! While my final ranking (~1400th place) was a bit disappointing, the competition was extremely fun and allowed me to learn a lot, and I plan on becoming more active on Kaggle in the future. Please let me know if you have any feedback — perhaps more Kaggle write-ups will be coming soon. :)