How we won a silver medal on Kaggle

Published in

pendulibrium

9 min readJun 20, 2019

In February our team Pendulibrium (Ilija Lalkovski, David Simeonovski, and Simona Ivanova) took part in the Santander Customer Transaction Prediction competition on Kaggle. We finished in 148th place out of more than 8800 registered teams, ending up in the top 2% and winning a silver medal. In this post, we summarize everything we tried and learned during this competition.

The competition

The competition was organized by the largest Spanish bank, Santander, and hosted on Kaggle. It lasted for 8 weeks and the goal was to build a binary classifier to predict which customers will make a specific transaction in the future. Very little info was given on the nature of the problem and the data, which severely limited the ability to use domain-specific knowledge.

Because of the straight-forward nature of the problem, it quickly became the most popular competition on Kaggle to date, surpassing the previous record of 7198 competitors.

The data

We were given 2 datasets: a labeled training set and an unlabeled testing set. Both datasets contained 200,000 rows and 200 columns of numeric data.

The data was fully anonymized, the id codes were replaced with generic “train_x” and “test_x” names, and the column names were given as “var_x”. This anonymization of the data is necessary when sharing private bank data, but it also makes our job significantly harder.

Exploratory Data Analysis (EDA)

The first step towards our solution was to familiarize ourselves with the data. We analyzed the distributions of the variables and did some statistical analysis.

From the distribution of the target variable, we can see that the dataset is imbalanced, only 10% of the data belongs to the positive class.

This can be problematic because most machine learning models tend to develop a bias for the majority class when trained on imbalanced data.

Although there are some techniques to address this problem such as undersampling or oversampling the data, applying them on this dataset had a negative effect on our performance, so we decided to work with the original imbalanced data.

Next, we took a look at the distributions of the 200 variables in the training and test sets.

All 200 variables followed close to normal distributions, and there were no significant differences between the distributions of the training and test sets. This is a good sign because models which have a good performance on the training data should perform similarly on the testing data as well.

Finally, we wanted to see if there are differences between the distributions of the target classes.

Some variables have identical distributions between both classes, while others have obvious differences. These differences can be useful to our models, using them for better discrimination between the targets.

We also analyzed the correlation between the variables by calculating the Pearson correlation coefficient for each pair of variables.

What we noticed surprised us. The variables were extremely uncorrelated, with the maximum correlation being less than 1%. This usually doesn’t happen by itself in real datasets and led us to believe that the original data had been transformed with some technique such as PCA. Probably with the intent to further anonymize the information.

Basic ML models

In order to get a feel for which classification models work well with our data, we trained a bunch of out-of-the-box models and evaluated them on a 10% holdout validation set.

The models based on gradient tree boosting such as XGBoost and LightGBM showed the best performance, achieving AUC scores of 89.8% with minimal parameter tuning.

With some careful parameter tuning and absolutely no feature engineering, one could easily achieve an AUC score of 90.0%. However, breaking the 90.0% barrier proved a difficult challenge.

Feature engineering

In order to improve the performance of our models, we tried engineering new features from the original 200 variables. We used various math operations (multiplication, addition, power) as well as statistical operations (mean, rank, round), and trained a LightGBM model on the enhanced datasets for validation.

We spent a few weeks trying different combinations of features and fine-tuning our models, but in the end, we didn’t manage to engineer features that would significantly improve the models from the baseline of 90.0% AUC.

An interesting feature that we discovered during this process was the distance from the centroid. If we imagine the data as points in 200-dimensional space, then the centroid is the mean of all points in the dataset per dimension. Plotting the distribution of the distances from the centroid reveals an interesting pattern.

It is obvious that the samples with the positive class have higher distances on average compared to the samples of the zero class. In other words, the further the distance, the higher the probability for the positive class.

This feature seemed promising at first, but after weeks of trying to make it work, it proved useless. We were still stuck at 90.0% AUC.

Search for the magic

Most of the other teams were also stuck at 90.0% AUC, and everyone was trying to find the “magic” that would push the score forward. The discussion board was full with threads discussing potential magic features and hints.

Some competitors that had already found the magic often gave hints and directions in the discussions. One particularly helpful hint came from the kaggler Chris Deotte in which he pointed out that the magic can be achieved with standard “textbook” techniques by analyzing the data and searching for things that look weird.

The magic

Back to the basics, we got back to EDA, and the first weird thing we noticed was that some distributions of the variables had weird spikes near the tails with higher than expected concentration of positive samples.

Taking a closer look at these spikes, we noticed an interesting property. The samples from the zero class have mostly unique values, while samples from the positive class have lots of duplicate values.

Could this be the magic everyone was searching for?

Frequency encoding

One way to encode this property into our dataset is by creating frequency encoding features. For each variable, we create an additional feature which encodes the number of occurrences of the value in the dataset.

This technique is often used on categorical variables and can help the models in cases where the target variable is somewhat related to the frequency of the values. However, using it on numeric data such as ours isn’t as straight-forward, and requires some sort of grouping (binning) of the values, because otherwise, all frequencies would equal 1.

In our case, we didn’t have to do this binning because the provided datasets already had all values clipped to 4 decimal places, effectively binning similar values together.

With the frequency encoding technique, we created 200 additional features, one for each of the variables. In that way, we increased the size of our dataset from 200 to 400 features in total.

When trained on this new dataset, both LightGBM and XGBoost managed to achieve scores of more than 91.0% AUC, which is a significant improvement from the previous 90.0% baseline. We finally cracked the magic!

Why frequency encoding works

To get a better idea of why frequency encoding improved our score, we can analyze the splits that the LightGBM model makes.

What is immediately noticeable from these heatmaps is the very steep change of the probabilities on the split between count=1 and count=2. This indicates that the model makes use of the uniqueness of the values, classifying unique values as zeroes with high certainty.

The model trained with the magic features is also way more confident about the positive classifications as indicated by the darker red color in the heatmap.

As to why the uniqueness of the values is a good predictor for the target class, nobody knows for sure. One theory is that the positive samples have been augmented by oversampling from some smaller set of values, which would explain why most values are repeating.

Whatever the case may be, the targets were leaked in the frequency of the values, and all top teams made use of this property.

Our silver medal solution 🥈

On the final day of the competition, we discovered that multiplication interactions between the original variables and their respective frequencies improved our score by an additional 0.5%.

For our final submission, we used an ensemble of LightGBM and XGBoost models trained on the 200,000 x 600 training dataset (original features + frequencies + multiplication interactions). Ensembling is a technique for combining the outputs of multiple different models with the purpose of reducing the variance of the single models and improving their ability to generalize on unseen data. Although there are many complicated techniques for ensembling models, we found that a simple average of the outputs worked well enough in our case.

Our score on the final leaderboard was 91.35% AUC, earning us the 148th place.

Gold medal solutions 🥇

The top finishers in the competition had scores of 92.5%, which are substantially better than our score of 91.35%. The key insight that we missed was to utilize the independence of the original variables and prevent interactions between them.

Since all 200 variables are independent, searching for dependencies and allowing interactions between them could only result in overfitting and finding false dependencies that would hurt the performance of the model on unseen data.

The best way to prevent the interactions is to train 200 separate classifiers, each on a different single variable and its related frequency features, and then combine the 200 outputs of these classifiers into 1 final prediction using a top-level classifier.

The 1st place team used some additional tricks such as custom neural network architectures and pseudo-labeling to eke out the last bit of performance from their models, but generally, the top teams’ solutions were similar and revolved around frequency encoding features and preventing variable interactions.

Conclusion

Although most teams (including us) were relying on gradient tree boosting methods such as LightGBM and XGBoost due to their ease of use and high performance, in the end, the winning solution was a carefully constructed neural network.

This confirms the NN dominance over the other ML models but also emphasizes the need for appropriate feature engineering and proper understanding of the algorithms, the data and the problem at hand.

For us, this was an exciting and enlightening experience. In the 8 weeks that we competed, we learned a ton of new approaches, algorithms, and concepts, while having lots of fun in the meantime. We strongly recommend Kaggle competitions to everyone looking to learn about state-of-the-art machine learning and data science, and we certainly plan to participate in similar challenges in the future.