The Ultimate Student Hunt was a week-long machine learning competition hosted by AnalyticsVidhya, in which I came solo 3rd place. Here’s a blog post going in-depth into my solution and thought process.
Step 1: Identify the problem
The moment the competition started, the first thing I did was click download on the data. As it was downloading, I had a quick look through the problem statement. The important information was that we were trying to predict the number of people that would visit parks on future days, using information such as weather conditions. One key thing about this competition is that it was time-series, something that I will go more detail into later.
Step 2: Preliminary model
Unlike a lot of people, the first thing I always do in a competition like this one is to submit a simple preliminary model. This is to ensure that I am treating the data okay, to check if my first validation strategy works, and to set a ‘benchmark’ score that I could compare future more complex submissions against. I find this vitally important, as it lets me know how much improvement basic feature engineering gives.
In this competition, like most competitions, I decided to just select all the numeric features and run a untuned XGBoost (my favourite algorithm) model on the data. As a validation strategy, I knew that a random split was not going to work. Because the competition is a time series prediction problem, meaning that the train/test split in data was based on time, I also split the training set into train and validation sets based on time, taking the last three years of training data as a validation set. My first model had 190 RMSE in validation, and scored 200 on the leaderboard!
Oh, and it meant that I got to be the first (and only) person on the leaderboard, even if only for a few minutes :)
Step 3: Basic feature engineering
Now that I had a presence on the leaderboard, I decided to do some quick feature engineering. The first obvious thing that I had missing was the date variable, which I had excluded previously as it was not numeric. With time series competitions, there are very often patterns in the date, where certain dates are more popular than others. The first thing I did was extract the month variable from the date and add that as a feature, which brought my score down to 117 on the leaderboard!
Inspired by this improvement, I tried to get more information out of the date. I tried using the exact day of year (maybe there are certain dates like Christmas with less/more visitors every year?) with target rate encoding — meaning the feature was the mean Footfall for that day in the past (a useful way of dealing with large categorical features). However, this overfit my model quite badly. I added random noise to the feature to make it less potent, and continued from there.
Usually, the day of week is a good feature to investigate — maybe people go to the park more on weekends — but I found that actually the days of week had very similar mean footfall, and I was not able to make a good feature out of this. This led me to believe that the dataset might have anonymized dates (1990–2005 is a weird date range for something in 2015) or that it was synthesized data without day of week consideration.
I made one-hot features from the Park ID, so XGBoost could better model each park individually.
Lastly, I added a feature importance function to my XGBoost model, and took a look at the importances, Strangely, Direction_Of_Wind was the most important feature, and contributed a lot to the score. This perplexed me, as the feature seemed to have no visible correlation with Footfall, even in individual parks. I added a little bit of noise to the feature to reduce overfitting, and left it as it was, since the feature gave my score a big improvement.
Step 4: Advanced Feature Engineering
Now onto the less obvious stuff. After doing all the initial feature engineering, it usually takes some time for me to think of some new, super features. I find that forcefully staring at graphs etc. doesn’t help me find new things in the data, but rather that taking a step back, and looking at the problem from a simpler viewpoint usually helps me find the things that others don’t. The question you have to ask yourself is “What information would affect whether people go to the park?”
After a day or two of thinking, I had a eureka moment. My thinking was that data about the recent past would affect the current day. For example, if it had been raining the past week that would have a bigger effect than if it had been raining just for one day, and if it had just stopped raining then more people would visit the park as they had not visited the past few days.
Based on this, I made ‘lead’ and ‘lag’ features, which meant that the features for the last two days and next two days in the dataset were also included as features for the current day, allowing the model to learn these cases. The features were a success, improving my score to 109.
I tried a few other similar features that did not help. For example, I added features which were the mean pressure, mean moisture etc. for the last week, which did not help.
To combat the NaN problem, I tried making features which were the mean of that feature for the other parks for that day. While they were very good at replicating the true values of NaNs, they did not help my validation and so I left them out. In hindsight, I think they would have helped a lot on the private leaderboard so I regret not using them.
Nearer to the end of the competition, I also made some additional features which were the differences between the current day and last day’s features, which helps XGBoost map the relationship between the days. This gave a small improvement.
Step 5: External Weather Data
One thing which I spent a lot of time on and failed with was trying to apply external weather data. Because I did not know whether the parks were (and the admins refused to tell me :P), I wrote a scraper to go through the weather channel’s website and download past data going back to the nineties about a bunch of major cities in India.
Once I had this data, I tried to add the addition data such as temperature and rainfall (data which was not included in the original dataset), city by city to my model to see if it improved. If it improved, that would mean that I had found the location of the parks. However, none of the data seemed to match, and I hit a dead end.
However, I don’t think that was wasted because there was no way to know that it wouldn’t work without trying. You mustn’t be afraid of trying things that will most likely fail. The reason people win is because they had more failed features than you, but they keep going.
Step 6: Ensembling
The last step of any competition for me is to make an ensemble model or meta-model. For this model I went for a very simple ensemble. My first model was a standard XGBoost, with tuned parameters.
In addition, I used a “bagged” neural network in Keras. After every epoch, I saved the weights to disk, and after the network finished, I took all the epochs with less than a certain validation loss, and averaged them to make my final neural network predictions.
At the end, I took a simple mean of my two models, and used this as the final submission. Sometimes simple is better :), although I would have loved to make a 3-layer meta-model with hundreds of base models if I had the time — it’s actually surprisingly fun.
Why did you do so well on the Private leaderboard?
This is still a mystery, even to me. However, it may have been one of a few things, firstly that I did not overfit the public leaderboard. Trust your validation as much as the public leaderboard, as tempting as it may be to overfit. Another reason may be that I did things differently from the other teams. I did not do any data cleaning (I rarely do this), and I did not remove any outliers as it goes against my philosophy that all data is useful data.
I think the most likely scenario is because of how I handled NaNs. I did not do any clever imputation, instead leaving them as NaN to XGBoost and imputing them with 0 for my neural network. The intention of this was to allow those models to learn different models in the case of NaNs, for example making more conservative predictions or relying less on those features.
In the end, the best thing about the competition was that I learnt a lot, and I got to talk to the wonderful AnalyticsVidhya community. Thanks guys!
Follow Imploding Gradients on Medium for more posts about machine learning and insights into competitions!
Next post: Finding the best Kaggle XGBoost parameters