First Attempts at Kaggle
Near the end of June, I signed up on Kaggle and registered for the Mercedes-Benz Greener Manufacturing contest with about two weeks until the end of the competition. Here is a log of what I attempted — my code can be found here. This story is intended as a supplement to this one.
I spent the first day getting acquainted with what to do:
The task given in the contest is a regression problem. The contestant is given a training dataset consisting of 4209 rows. For each row, 8 categorical variables and 368 binary variables are given. We are asked to predict a single variable which seems to usually be somewhere in the range of [75, 175], with a singular outlier at 275. The contest rankings are evaluated using R².
This kernel was a particularly good introduction to the dataset. After reading it, I made my first attempt at the problem: logistic regression implemented in TensorFlow, with and without adding PCA dimensions as additional features. This performed horribly, and I scrapped the code by the end of the day.
After reading other kernels, I copied code from a kernel providing a baseline for XGBoost. This kernel uses PCA and ICA for additional features, and then uses XGBoost to train a gradient-boosted tree. With no previous knowledge of gradient boosting or ICA, running this put me around the 50th percentile of the leaderboard, at around rank 1600.
I then read through multiple kernels which all detailed a similar approach: XGBoost, with dimensionality reduction techniques for additional features. Each of these seemed to perform with a score of about R²=0.555, which didn’t change much even after tuning hyperparameters. However, one of the kernels used a model which averaged an XGBoost model with a stacked model, resulting in an improved score of R²=0.568. At this point, I was ranked at about 575th out of 3300 contestants without producing much original thought towards the problem.
This helpful guide towards ensembling and stacking encouraged me to find additional models to average with. After browsing some more kernels, I decided to average this model with my previously copied XGBoost/stacking model, obtaining an R² of 0.56943, raising my rank to 294/3464 and placing me within the cutoff for a bronze medal (top 10%).
At this point, I decided to keep track of my progress in the contest on this post, and to upload my code here.
Time to completely refactor my code — I moved each model into separate files to make them easier to test and run individually. Apart from this, I made little progress; none of my various attempts at feature engineering seemed to work.
After realizing that the leaderboard public scores might not be an accurate indicator of the performance of my models, I decided to again refactor my code and run cross-validation tests. Tweaking hyperparameters, I achieved a validation MSE of 68.78, which achieves a score of 0.55775 on the leaderboard, but with a model that no longer overfits. I also tried playing with Keras and neural nets, but it didn’t achieve very good validation. Any architecture I tried had extremely variable training and validation scores, both between epochs and between separate trials.
Thanks to the Kaggle discussion threads, I realized that a simple ElasticNetCV model with the MaxAbsScaler transformation performed better than my current mixture model, so I incorporated MaxAbsScaler into each of my individual models. This yielded a significant improvement in my local CV (R² score went from 0.567 to 0.575), but no change in the leaderboard. This discussion confirmed my suspicions that the public leaderboard was unreliable.
On Day 10, adding out-of-fold R² cross-validation and hyper-parameter tuning didn’t significantly bolster my score. For the remainder of the week, I didn’t make much progress — I tried various binary classification models to detect outliers and tried using t-SNE, but neither seemed to be too helpful.
A lot of last-minute tweaking made a lot of improvement to my cross-validation score! I moved away from stacking models after not being able to improve them further, and used feature engineering to reduce the number of features on each model. Each model now trains and runs much faster, and with improved accuracy. My final local CV achieved an R² score of 0.5777 on a 5-fold out-of-fold prediction. My final model consisted of an average of:
- Model 1: Gradient Boosted Tree (CV ~0.572)
- Model 2: Random Forest Regressor (CV ~0.572)
- Model 3: Ridge Regressor (CV ~0.573)
- Model 4: ElasticNet (CV ~0.572)
- Model 5: Decision Tree Regressor (CV ~0.573)
My predictions with the averaged model looked something like this:
In total, I probably spent about 50–60 hours working on this. In the end, none of my last week of efforts mattered: my final model placed 1400th, while one of my models from Day 8 would’ve gotten a bronze medal! My thoughts about this competition can be found here.