Part I: Conducting Exploratory Data Analysis (EDA) for the Kaggle Home Credit Default Competition
Follow along as the Comet.ml team competes to win the Kaggle Home Credit Default Competition — this is the first of a series of posts on our modeling process!
In this first post, we are going to conduct some preliminary exploratory data analysis (EDA) on the datasets provided by Home Credit for their credit default risk Kaggle competition (with a 1st place prize of $35,000!).
Home Credit is a loan provider for people with little or no credit history. They use a variety of alternative data sources such as transactional or telco information to evaluate a client’s repayment abilities.
We’re going to break down our analysis into three posts:
- Exploratory data analysis (EDA), feature pre-processing, and initial modeling with LightGBM and Random Forest (this post!)
- Creating hand-engineered features from a master dataset of all available Kaggle datasets.
- Comparing model performance between the hand-engineered features with features generated with a Keras Neural Net model. Summary + final takeaways about the competition
Whether you’re an experienced Kaggler or someone who is just starting out in Kaggle competitions, this series is for you!
All the code for this post can be found here & model results, figures, and notes can be found in this Comet.ml public project
Setting up the environment
First, let’s set up our experiment in Comet.ml, and grab our API key.
At Comet.ml, we help data scientists and machine learning engineers to automatically track their datasets, code, experiments and results creating efficiency, visibility and reproducibility.
# Import Comet.ml and log an experiment with API key
from comet_ml import Experiment
experiment = Experiment(api_key="YOUR API KEY", project_name="home-credit")# Reading in the data + setting a hash to the dataset
df = pd.read_csv('./application_train.csv', sep=',')
experiment.log_dataset_hash(df)
Getting an initial view of the data
Let’s start by looking at features in the application_train.csv file. This file contains 121 features, and 307, 511 examples. Before we begin any sort of modeling, it’s important to get a sense of the distribution of your data, and the correlations between individual features.
First, let’s check the distribution of our target variable, and log that visualization to our Comet project.
import matplotlib.pyplot as pltfeature = "TARGET"ax = integer_df[feature].value_counts().plot(kind='bar',
figsize=(15,10),
color='blue')
ax.set_xlabel(feature)
ax.set_ylabel("Count")
experiment.log_figure(figure_name=feature, figure=plt)
We can see that our figure has been uploaded to the Graphics page of our experiment on Comet.ml. Having the figure ready at hand will be useful for reference as we progress through the competition and for collaboration!
It’s clear that our target distribution is highly imbalanced with 80% of clients repaying their loans on time. This is great for Home Credit, but will definitely inform how we evaluate our classifier. When your target variable’s distribution is imbalanced, accuracy is not a good metric to evaluate model performance.
Next let’s check for the number of categorical and numerical features to see the split.
Categorical: 16
Numerical: 105 . #can be divided into integer and float type
We seem to have a larger presence of numerical features in our dataset. These numerical features can be divided into integer type features and float type features. On closer examination, we see that a majority of our numerical features, such as, FLAG_DOCUMENT, FLAG_EMAIL, REG_CITY_NOT_LIVE_CITY, are actually encoding categorical information, and so we will include them in our categorical features dataset.
We’ll take a look at these three types of features in order: (1) float valued, (2) categorical, and (3) integer valued.
Let’s first take a look at the correlation matrix for our float valued features and our target.
This figure illustrates that there seems to be little correlation between our target label (feature no. 65) and our float valued features. However, we do see that features 11 to 53 are highly correlated, and on further inspection, we find that these are all features related to the client’s home (interesting 🧐). We can make a note of this in the Notes tab of our Comet.ml experiment page.
# highly correlated float valued features (features 11 - 53)
# Conduct PCA on these features to reduce down to 10['APARTMENTS_AVG',
'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG',
'COMMONAREA_AVG',
'ELEVATORS_AVG',
'ENTRANCES_AVG',
'FLOORSMAX_AVG',
'FLOORSMIN_AVG',
'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG',
'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG',
'APARTMENTS_MODE',
'BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE',
'YEARS_BUILD_MODE',
'COMMONAREA_MODE',
'ELEVATORS_MODE',
'ENTRANCES_MODE',
'FLOORSMAX_MODE',
'FLOORSMIN_MODE',
'LANDAREA_MODE',
'LIVINGAPARTMENTS_MODE',
'LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE',
'NONLIVINGAREA_MODE',
'APARTMENTS_MEDI',
'BASEMENTAREA_MEDI',
'YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MEDI',
'COMMONAREA_MEDI',
'ELEVATORS_MEDI',
'ENTRANCES_MEDI',
'FLOORSMAX_MEDI',
'FLOORSMIN_MEDI',
'LANDAREA_MEDI',
'LIVINGAPARTMENTS_MEDI',
'LIVINGAREA_MEDI',
'NONLIVINGAPARTMENTS_MEDI',
'NONLIVINGAREA_MEDI']
These features are good candidates for dimensionality reduction, since they are adding redundant information to our model. We’ll run a Principal Component Analysis (PCA) transformation over these features, and use the top 10 principle components in our classifier. These 10 components are able to explain about 77% of the variance in our dataset.
Next we’ll take a look at categorical features. However, in order to use them in our model, we will have to One Hot Encode them into binary vectors (basically create dummy variables). After encoding these variables, we can run a Random Forest and LightGBM models with similar parameters, over the data to extract an estimate of feature importance.
Our models consist of 100 trees, with 31 leaves, and produce the following feature rankings.
Some takeaways: LightGBM and Random Forest both rank the type of income, type of education, family status, and car ownership in the top 15 categorical features.
From these categorical features, I’ve included some of the more informative plots of features that showed up in both the LightGBM and Random Forest feature rankings below. The full list can be found here.
- Gender distribution
- Loan type distribution
- Family status distribution
- Occupation distribution
Finally, let’s take a look at our integer type features. After filtering out integer features that represent categories, we are left with only 7 integer valued features.
['CNT_CHILDREN',
'DAYS_BIRTH',
'DAYS_EMPLOYED',
'DAYS_ID_PUBLISH',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY',
'HOUR_APPR_PROCESS_START']
These features are on different scales, so we will just use LightGBM, and Random Forest on this data, and plot the most important features
Both algorithms rank the integer features in a similar way. We will now combine these filtered features into a new dataset, and train three models: (1) Logistic Regression, (2) Random Forest, and (3) LightGBM on this data. These models will serve as our baselines.
We’ve logged all three models in Comet.ml and also used the Hyperparameter Optimizer feature. Our initial run of LightGBM results in an AUC score 0.745, which is significantly higher than both Logistic Regression and Random Forest.
Not bad for a baseline model, but we can definitely do better! Next post we’ll explore some automatic feature engineering using a Neural Network.
👉🏼 Follow us on Medium to stay tuned for our next two posts for our Kaggle Home Credit Default Risk competition submission! 👈🏼
All the code for this post can be found here & model results, figures, and notes can be found in this Comet.ml public project
Dhruv Nair is a Data Scientist on the Comet.ml team. 🧠🧠🧠 Before joining Comet.ml, he worked as a Research Engineer in the Physical Analytics team at the IBM T.J. Watson Lab.
About Comet.ml — Comet.ml is doing for ML what Github did for code. Our lightweight SDK enables data science teams to automatically track their datasets, code changes, experimentation history. This way, data scientists can easily reproduce their models and collaborate on model iteration amongst their team!