Kaggle Home Credit, a silver solution (Top 5%)

Published in

eat-pred-love

8 min readDec 21, 2018

Overview

After the end of the competition 3 months ago, there were heavier workload as well as more commitments in real life. Therefore, this is really very overdue. But “Better to be late than never”. I briefly posted my thoughts previously here. In this article, I will not go through the details ( I believe there are many gold solutions out there) but instead, share what I did differently.

Back to business: This competition requires kagglers to build a credit-scoring model of sorts. The main objective is to classify if a given individual did not repay the loan (1) and otherwise (0). The fun part is that we are given lots of data:

Source: https://www.kaggle.com/c/home-credit-default-risk/data

Essentially, we have the main data (application) as well as different information of the users. (credit card, previous loans, installments… etc) In order words, we have to find methods to combine them. (since the target variable is in the main application df) This also gave room for creativity when it comes to feature engineering.

Features Engineering

Features engineering is the place where we data scientist expresses our creativity.

The features engineering process can be split into 2 phase:
1) Side Data

2) Main Data

Side data includes bureau, credit, installments, cash as well as the previous loans. They have to be aggregated before merging to the main data(application_train/test). We can see, for example, multiple entries of customer A, while in the main data, we only have unique entries of a customer. (1 customer A)

A look at preprocessing.py. Feature engineering of side data.

Some of them are time series data, as such, we can take into account things like the latest or earliest K data before aggregation.

Aggregation usually involves the standard statistics: Mean, Mode, Median, Sum, Variance/Std, Kurtosis, Skew, as well as Linear regression. Linear regression involves regressing the grouped data with a time index and using the beta coefficient as features.

An illustrated example of side data. Multiple entries for a customer

Apart from the multiple numerical entries that we need to aggregate, there are also various categorical features. Since it is a many to one mapping to the main data, we must think of some methods to aggregate them. Some methods to do so will be to one-hot-encode them and taking the average. Other methods include target-encoding and averaging the encoded categorical.

Once features engineering are done on the side data, we can apply the same to the main data.

Features Embedding

Similar to word2vec, we are also able to get the embedding spaces for some of the categorical variables. For more information, you can refer to this paper. By training on all the data, with embedding layers for the categorical variables, we can later extract the embedding layer as features for our main model. (XGB/LGBM)

At the same time, the same Feed Forward NN which I used for the embedding extraction can also be used as a different model for ensembling. However, it is not really useful (Not surprising here since I did not do much tunning on its overall architecture)

Structure of the NN is as such:

Multi-head input model taking in the different categorical variables
Embedding layer for each input to reduce the dimension space
Flatten layer after each embedding layer
Usually 2 super dense RELU layer (1000, 500)
End with a sigmoid

Features Interactions

Features interaction is something that I focused on as well. There are 2 main ways in which I induced features interactions in the model: Compute target/mean encoding grouped by the collection of variables. I.e if I need the interaction of V1 and V2, I can compute the mean encoding of V1-V2 combination.

I used a custom function to do it because I want to have more flexibility and control during the genetic algorithm stage (see below). Otherwise, you can find a module to implement it here. (I have not used it personally so I am not sure if it is the same)

The target encoding will be introduced in each cross-validation layers after the splitting and before the modeling. This ensures no leakage from the valid data sets while the encoding is computed only from the train. Other measures I added is to split the train into yet another inner 20 fold before computing the encoding.

Another way to induce features-interaction is to make use of LibFM. The paper is described here. Of course, I was inspired by CPMP for this approach. Using Keras, we can build an NN and extract its embeddings trained on the features in which we want to induce its interaction. This concept runs heavily on factorized machines for recommendation systems.

We will extract these paired interactions as:

Factor + Bias

Another additional point is the flexibility to define the number of latent variables we want. With the factor and bias extracted, we can add them as features to your main model. I have created several interactions and extracted the factor + bias as features for various models. I will include them later in the stacking process.

Modeling & Cross-validation

My main go-to algorithm in this competition is LightGBM. Due to the large dimension of the dataset (especially after Features engineering), I prefer the speed that LGBM offers. One thing that I found out late was that after LGBM version 2.1.2, there is this parameter known as boost_from_average. By default, it is True. As such, most of my models from AWS are based on this parameter. I had to re-trained some of the models with the setting turned to False. But this actually provided me with even more diverse models, which added to the diversification needed in stacking later on.

I also tried Xgboost early on in the competition but ended up sticking to LightGBM throughout. Catboost was not used since I have my own encoding pipeline but I might try them out in future (Catboost’s core advantage over others is in its own encoding functions).

Not forgetting Keras 2.0, utilizing it for embedding weights as well as libFM. I kind of love the new standard way of defining the Neural network.

As for cross-validation, I am sticking it with the usual 5-fold CV. No particular reason why it is preferred over stratified 5-fold CV apart from the fact that it provides the same directional relationship with the leaderboard (positive correlation). This ensured that I can believe in my local 5-fold cv and not overfit to the public leaderboard.

An important point to note is to ensure the same validation split to have:

Fair comparison across models
No leakage when staking later on (since this affects the OOF)

I have also spent time setting up my own personal modeling code structure in this competition (including the validation structure). This helps in future and especially later on for my genetic selection, where you would have more control to automate the modeling process.

Genetic Selection

This was the main reason for my USD100 AWS cost at the end of the competition.

Bot timestamp (collected on telegram) of each generic runs ran on AWS

This is also inspired by a fellow kaggler known as Jacek who briefly described this in a previous competition (Porto). The best kagglers are often the ones who learned from the best. Based on my understanding of his description, I implemented my own version. The idea is very simple:

Select a base of good features. I believe I started with 160 features. There are many ways to do this. Either via forwarding selection or by features importance gain. I ran a base model with LightGBM on the entire features space, extract the features importance and subset to those with importance gain > 1000.
Create a master list of features (3000) as well as interactions (any possible interactions). I played pretty safe and only interacted non-cat with non-cat as well as cat with cat.
Generation 1: Randomly select features not in your base model (160) from the master list and interaction list.
Run Generation 1 over 20 times? — select the best 2 models.
Take the total distinct features of the 2 models. Simply: Distinct(Features_Model_1 + Features_Model_2)
Create a “mutation layer” for (5). and add random features from the master list of features (3000) as well as interactions (any possible interactions). This layer also takes a proportion of the new subset of features (I set to P= 0.8). Reason for this is that I do not want the number of the subset of features to grow for each generation. As such, I want to introduce a form of ‘dropping’ features.
Generation 2: Repeat (3) and (4) while using the product from (6) as the base.
Run this over Nth generation, where for each generation we take a good subset of features and add some random element.

Basically, I wanted an automatic approach to select the best features. This approach allows me to systematically try different features interactions amongst the features. Moreover, it is largely automatic. The algorithm runs in AWS while I get to monitor the progress on my phone (I ran the algorithm mostly overnight, through work..). I could disable AWS easily via the AWS mobile app.

Final Essemble/ Stacking and fingers crossed

I have the advantage of a diverse number of models ran on a different subset of features as well as interactions. I believe by the end of the competition, I have nearly 1000 automatically generated models.

The final solution is a blend of 2 different stacks:

Stacking is where we train the out-of-fold predictions of each model and apply it to the test-predictions. The idea is that it will optimize the weight of each model (think regression: intersection + w1 * Model_1 + w2 * Model_2… )

In this case, I have tried using Xgboost as the stacker but through experimentation, Ridge regression/ logistic regression gave the best results (CV). In the end, I decided to use Ridge regression as the main stacker. (level 1 stacker)

Stack A using Ridge regression on a myriad of Good Models (Genetic), as well as different libFM models
Stack B using Ridge regression on other sets of Good Models (Genetic) as well as different libFM models. Some of the models I ran it again using LGBM’s boost_from_average = False
Apply a blend (weighted average) on 1 + 2. The blend weight is estimated using 5 fold CV (the same fold used in modeling) for the highest ROC-AUC.

Closing

This is definitely a good experience. Not my first solo kaggle competition and not my last. But I definitely learned a lot more when it comes to optimizing my code (as I structured it for AWS). Also pretty contented that I remained pretty stagnant in the leaderboard (no overfitting).

Now, when shall I begin my journey for my gold medal? (Last piece to a kaggle master)