One week of Machine Learning madness with HackerRank: Part 1
Since the beginning of my journey in Data Science and Machine Learning, I have been actively looking for opportunities to apply what I learn from MOOCs and books to (almost) real data and problems. Naturally, I have come across the most famous Machine Learning competition platform, Kaggle.
One issue that I soon found out was that many of its hosted competitions required powerful hardware, or at least very smart use of it, meaning that much effort is directed to optimizing the modeling framework to the computing power available.
This made me look for alternatives. One platform that piqued my interest was HackerRank. HackerRank isn’t exactly a Machine Learning competition platform like Kaggle. Instead, it’s a host for coding contests, and one of its (sub)domains is Machine Learning.
The contest that I participated in was the Machine Learning CodeSprint. A one week competition composed of two separate challenges, a binary classification problem and a recommendation problem. The final submissions should have not only the results, but also (readable) source code and documentation, making an already tight time frame even tighter. If you are interested in the former keep reading, otherwise, head to part 2 of the series (will be released soon!).
For those of you who are still here, let’s get started!
The first challenge is “Predict Email Opens”, and as the name suggests, our objective is to determine if a user from HackerRank will open a given email or not. For this task, we have access to a training and a test set, both containing information about the email sent and the user who received it. The datasets have 48 features in common, with the training set having 6 additional features, including the target variable. The features can be divided in the following categories:
- Profile features: basic user information like user ID, whether or not his account has been verified, and the date the user signed up;
- Email features: email information like the email ID, email category and the date it was sent;
- Action features: information about the behavior of a user, number of times he logged in, number of posts, number of contests, etc. These are mostly split in multiple timeframes (1, 7, 30 and 365 days) and are the majority of features;
- Target features: features associated to the target variable (opened), like the date the email was opened, if the user clicked some link in it, whether he unsubscribed from that email or not. These are only available in the training set.
(Light) Data Exploration and Feature Selection
The most import lesson that I learned from ML competitions is that you should begin working from the end. Wait, what? Yes, that’s right. The first thing to do is to setup the end of the modeling framework. In other words, your validation strategy.
When designing prediction models for competitions, the one thing that we have to make sure of is that our model will be able to generalize to the test set. To accomplish this, we inspect the test data looking for possible garbage features and distortions. For this challenge, I found 12 features that had less than 1% variation in the test set. For some features, only a single value was present. Discarding these variables make the model more robust and also reduces the training time, something that is very valuable considering the one week time frame.
One quirk that I discovered was regarding one of the email features, the “email category”, as shown in the following histogram:
There were 18 different categories, with category 15 being the most common in the training set. Surprisingly, this category was absent in the test set, alongside category 13. What to do then? I tried the following:
- Transform missing categories in the test set into NA
- Remove cases with missing categories in the test set entirely
- Remove the email category feature
- Do nothing
My initial bet was on the third option. But it turns out that “do nothing” led to the best results. In this case it made little difference, but this is the kind of distortion that we must be wary of, as it could impact negatively our performance on the contest rankings.
Having removed the garbage, the next step is to come up with some new features. For problems that have some sort of interaction between two or more actors, in this case a user and an email, computing the mean of the target feature for each individual actor, which translates to P(target|actor), usually gives good results. You can also try P(target|actor_1, actor_2, …, actor_n), however due to the number of combinations this requires an amount of data that was not available for this competition. Some of these features are summarized in the following graphs:
It’s clear that a user-email combination gives almost perfect separation, but don’t be deceived, this is due to many of the pairs being unique. The same goes for user-email category. The P(target|user), on the other hand, seems very promising, while P(target|email) looks like it has some significance, but not as much.
Another set of features to generate are time derived features. Since we have features like when the email was sent and when a user registered, we can break them into hour, day of the month, day of the week, month, etc, to capture any sort of inherent seasonality.
With the data clean and plenty of new features, we can now move on to training the model.
Model Training and Results
There are many options when it comes to supervised classification algorithms. I chose XGBoost due to its history of winning classification contests. It’s a little bothersome to optimize its endless parameters, but it outperforms most alternatives, while being very fast. Later I discovered auto-sklearn, a framework that combines hyperparameter optimization with model ensembles. I did not test it, but if it delivers what is promised, it would certainly be a worthy rival for a standalone XGBoost model.
Other than performance, XGBoost has a couple of neat features. It accepts a great number of evaluation metrics and can output the importance of the model features. The competition metric was the F1 Score, which is not natively supported by XGBoost, so I picked AUC as a replacement, since it’s also a kind of balanced accuracy metric.
After all these steps I finally moved on to training the prediction model to make a submission. For the validation method I chose a holdout dataset instead of cross-validation to make iterations faster. After some hours spent (in vain) trying to optimize the hyperparameters, the results were:
- Training score: 0.76
- Validation score: 0.70
- Public Leaderboard score: 0.58
- Private Leaderboard score: 0.59
For this challenge, the above score put me in place 36/411. Not amazing, but not bad either. I was short only 0.01 from the winning solution. This little difference also shows that the data were not very good to begin with, as there were not much information to leverage.
There is also some clearly bad overfitting going on, as can be observed from the difference between scores. An interesting concept to explore for future projects is adversarial validation. Unfortunately, I only found out about it after the competition. It would have helped in crafting a more robust validation framework.
This is it for now, see you in part 2!