Predicting Financial Transactions With Catboost, LGBM, XGBoost and Keras (AUROCC Score of 0.892)

Tackling the Santander Customer Transaction Prediction challenge from Kaggle

Sebastien Callebaut
9 min readAug 31, 2020

The goal of this challenge is to predict whether a customer will make a transaction (“target” = 1) or not (“target” = 0). For that, we get a data set of 200 incognito variables and our submission is judged based on the Area Under Receiver Operating Characteristic Curve which we have to maximise.

This project is somewhat different from others, you basically get a huge amount of data with no missing values and only numbers. A dream come true for any data scientist. Of course, that sounds too good to be true! Let’s dive in.

An ATM by Jan Antonin Kolar on Unsplash

I. Set up

We start by loading the data and get a quick overview of the data we’ll have to handle. We do so by calling the describe() and info() functions.

# Load the data sets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# Create a merged data set and review initial information
combined_df = pd.concat([train_df, test_df])
print(combined_df.describe())
print(combined_df.info())

We have a total of 400.000 observations, 200.000 of whom in our training set. We can also see that we will have to deal with the class imbalance issue as we have a mean of 0.1 in the target column.

II. Missing values

Let’s check whether we have any missing values. For that, we print the column names that contain missing values.

# Check missing values
print(combined_df.columns[combined_df.isnull().any()])

We have zero missing values. Let’s move forward.

III. Data types

Let’s check the data we have. Are we dealing with categorical variables? Or text? Or just numbers? We print a dictionary containing the different types of data we have and its occurrence.

# Get the data types
print(Counter([combined_df[col].dtype for col in combined_df.columns.values.tolist()]).items())

Only float data. We don’t have to create dummy variables.

IV. Data cleaning

We don’t want to use our ID column to make our predictions and therefore store it into the index.

# Set the ID col as index
for element in [train_df, test_df]:
element.set_index('ID_code', inplace = True)

We now separate the target variable from our training set and create a new dataframe for our target variable.

# Create X_train_df and y_train_df set
X_train_df = train_df.drop("target", axis = 1)
y_train_df = train_df["target"]

V. Scaling

We haven’t done anything when it comes to data exploration and outlier analysis. It is always highly recommended to conduct these. However, given the nature of the challenge, we suspect that the variables in themselves might not be too interesting.

In order to compensate for our lack of outlier detection, we scale the data using RobustScaler().

# Scale the data and use RobustScaler to minimise the effect of outliers
scaler = RobustScaler()

# Scale the X_train set
X_train_scaled = scaler.fit_transform(X_train_df.values)
X_train_df = pd.DataFrame(X_train_scaled, index = X_train_df.index, columns= X_train_df.columns)

# Scale the X_test set
X_test_scaled = scaler.transform(test_df.values)
X_test_df = pd.DataFrame(X_test_scaled, index = test_df.index, columns= test_df.columns)

We now create a X_train, y_train, X_test and y_test set for training our model and then testing it on hold-out data.

# Split our training sample into train and test, leave 20% for test
X_train, X_test, y_train, y_test = train_test_split(X_train_df, y_train_df, test_size=0.2, random_state = 20)

When it comes to outliers, some could use IsolationForest() in order to automatically identify and remove rows that are outliers. This technique is often used for data sets with numerous variables. This code chunk has been borrowed form MachineLearningMastery.

# OUTLIERS

# Remove outliers automatically
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)
print(yhat)

# Select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train.loc[mask, :], y_train.loc[mask]

Please note that this automated outlier discovery did not add any predictive power to our model and we decided to comment it out.

VII. Class Imbalance

In our data, we have seen that we have way less observations that have made a transaction than have not. If we want our model to be equally capable at predicting both, we should make sure we don’t feed it with skewed data.

We correct for class imbalance by upsampling the minority class. This techniques are inspired from this excellent article by Tara Boyle.

# CLASS IMBALANCE

# Downsample majority class

# Concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)

# Separate minority and majority classes
not_transa = X[X.target==0]
transa = X[X.target==1]

not_transa_down = resample(not_transa,
replace = False, # sample without replacement
n_samples = len(transa), # match minority n
random_state = 27) # reproducible results

# Combine minority and downsampled majority
downsampled = pd.concat([not_transa_down, transa])

# Checking counts
print(downsampled.target.value_counts())

# Create training set again
y_train = downsampled.target
X_train = downsampled.drop('target', axis=1)

print(len(X_train))

Here is the code for upsampling the minority class.

# Upsample minority class

# Concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)

# Separate minority and majority classes
not_transa = X[X.target==0]
transa = X[X.target==1]

not_transa_up = resample(transa,
replace = True, # sample without replacement
n_samples = len(not_transa), # match majority n
random_state = 27) # reproducible results

# Combine minority and downsampled majority
upsampled = pd.concat([not_transa_up, not_transa])

# Checking counts
print(upsampled.target.value_counts())

# Create training set again
y_train = upsampled.target
X_train = upsampled.drop('target', axis=1)

print(len(X_train))

And here is the code for creating synthetic samples with SMOTE.

# Create synthetic samples

sm = SMOTE(random_state=27, sampling_strategy='minority')
X_train, y_train = sm.fit_sample(X_train, y_train)

print(y_train.value_counts())

VIII. Modelling

We now dive deeper into the models. The plan is to create 4 different models and then averaging their predictions to make an ensemble that will yield the final prediction. We do not plan to fine tune the models to a too wide extent. Leaving GridSearch out of this.

1. Neural Network With Keras

# NEURAL NETWORK

# Build our neural network with input dimension 200
classifier = Sequential()

# First Hidden Layer
classifier.add(Dense(150, activation='relu', kernel_initializer='random_normal', input_dim=200))

# Second Hidden Layer
classifier.add(Dense(350, activation='relu', kernel_initializer='random_normal'))

# Third Hidden Layer
classifier.add(Dense(250, activation='relu', kernel_initializer='random_normal'))

# Fourth Hidden Layer
classifier.add(Dense(50, activation='relu', kernel_initializer='random_normal'))

# Output Layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))

# Compile the network
classifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])

# Fitting the data to the training data set
classifier.fit(X_train,y_train, batch_size=100, epochs=150)

# Evaluate the model on training data
eval_model=classifier.evaluate(X_train, y_train)
print(eval_model)

# Make predictions on the hold out data
y_pred=classifier.predict(X_test)
y_pred =(y_pred>0.5)

# Get the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Get the accuracy score
print("Accuracy of {}".format(accuracy_score(y_test, y_pred)))

# Get the f1-Score
print("f1 score of {}".format(f1_score(y_test, y_pred)))

# Get the recall score
print("Recall score of {}".format(recall_score(y_test, y_pred)))

# Make predictions and create submission file
predictions = (classifier.predict(X_test_df)>0.5)
predictions = np.concatenate(predictions, axis=0 )
my_pred = pd.DataFrame({'ID_code': X_test_df.index, 'target': predictions})

# Set 0 and 1s instead of True and False
my_pred["target"] = my_pred["target"].map({True:1, False : 0})

# Create CSV file
my_pred.to_csv('pred_ann.csv', index=False)

This model is built upon the excellent review from Renu Khandelwal. We haven’t modified the original script except adding some layers and increased the number of neurons by layer.

Our first submission with this Neural Network gives us a score of 0.80882

2. LightGBM

# LIGHT GBM

# Get the train and test data for the training sequence
train_data = lgbm.Dataset(X_train, label=y_train)
test_data = lgbm.Dataset(X_test, label=y_test)

# Set parameters
parameters = {
'application': 'binary',
'objective': 'binary',
'metric': 'auc',
'is_unbalance': 'true',
'boosting': 'gbdt',
'num_leaves': 31,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'bagging_freq': 20,
'learning_rate': 0.05,
'verbose': 0
}

# Train our classifier
classifier = lgbm.train(parameters,
train_data,
valid_sets= test_data,
num_boost_round=5000,
early_stopping_rounds=100)


# Make predictions
predictions = classifier.predict(X_test_df.values)

# Create submission file
my_pred_lgbm = pd.DataFrame({'ID_code': X_test_df.index, 'target': predictions})

# Create CSV file
my_pred_lgbm.to_csv('pred_lgbm.csv', index=False)

This code chunk is based on some work from this Kaggle Notebook by E. Zietsman. If you want a complete overview of how LightGBM works and how to optimally tune it, make sure you read this article from Pushkar Mandot.

This gives us a score of 0.89217

3. XGBoost

# XGBOOST

# Instantiate classifier
classifier = XGBClassifier(
tree_method = 'hist',
objective = 'binary:logistic',
eval_metric = 'auc',
learning_rate = 0.01,
max_depth = 2,
colsample_bytree = 0.35,
subsample = 0.8,
min_child_weight = 53,
gamma = 9,
silent= 1)

# Fit the data
classifier.fit(X_train, y_train)

# Make predictions on the hold out data
y_pred = (classifier.predict_proba(X_test)[:,1] >= 0.5).astype(int)

# Get the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Get the accuracy score
print("Accuracy of {}".format(accuracy_score(y_test, y_pred)))

# Get the f1-Score
print("f1 score of {}".format(f1_score(y_test, y_pred)))

# Get the recall score
print("Recall score of {}".format(recall_score(y_test, y_pred)))

# Make predictions
predictions = (classifier.predict_proba(X_test_df)[:,1] >= 0.5).astype(int)

# Create submission file
my_pred_xgb = pd.DataFrame({'ID_code': X_test_df.index, 'target_xgb': predictions})

# Create CSV file
my_pred_xgb.to_csv('pred_xgb.csv', index=False)

We also rely on XGBoost and the helpfull insights from Félix Revert.

This gives us a score of 0.59283

4. Catboost

# CATBOOST

# Instantiate classifier
classifier = cb.CatBoostClassifier(loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.01,
iterations=1000,
random_seed=42,
od_type="Iter",
depth=10,
early_stopping_rounds=500
)


# Fit the data
classifier.fit(X_train, y_train)

# Make predictions on the hold out data
y_pred = (classifier.predict_proba(X_test)[:,1] >= 0.5).astype(int)

# Get the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Get the accuracy score
print("Accuracy of {}".format(accuracy_score(y_test, y_pred)))

# Get the f1-Score
print("f1 score of {}".format(f1_score(y_test, y_pred)))

# Get the recall score
print("Recall score of {}".format(recall_score(y_test, y_pred)))

# Make predictions
predictions = (classifier.predict_proba(X_test_df)[:,1] >= 0.5).astype(int)

# Create submission file
my_pred_cat = pd.DataFrame({'ID_code': X_test_df.index, 'target_cat': predictions})

# Create CSV file
my_pred_cat.to_csv('pred_cat.csv', index=False)

This part is inspired from Wakame on Kaggle.

This gives us a score of 0.78769

5. Ensemble

In this last part, we take the 4 models we created and ensemble them in order to generate our final answer. We want at least 3 out of the 4 models to qualify an observation as 1 in order to effectively doing so.

# ENSEMBLE

# Create data frame
my_pred_ens = pd.concat([my_pred_ann, my_pred_xgb, my_pred_cat, my_pred_lgbm], axis = 1, sort=False)

# Review our frame
print(my_pred_ens.describe())

# Sum all the predictions and only assign a 1 if sum is higher than 2
my_pred_ens["target"] = my_pred_ens["target_ann"] + my_pred_ens["target_xgb"] + my_pred_ens["target_lgbm"] + my_pred_ens["target_cat"]

# Assign a 1 if sum is higher than 2
my_pred_ens["target"] = np.where(my_pred_ens["target"] > 2, 1, 0)

# Remove other target cols
my_pred_ens = my_pred_ens.drop(["target_ann", "target_lgbm", "target_xgb", "target_cat"], axis = 1)

# Create submission file
my_pred = pd.DataFrame({'ID_code': X_test_df.index, 'target': my_pred_ens["target"]})

# Create CSV file
my_pred.to_csv('pred_ens.csv', index=False)

This gives us a score of 0.78627

IX. Conclusion

Our best models was the LightGBM. In order to improve on our score, we might rely on Stratified Kfolds or any other cross validation technique. We might as well fine tune our models in more detail.

Packages used

import pandas as pd
import numpy as np
from collections import Counter
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from keras import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, roc_curve, roc_auc_score
from sklearn.utils import resample
import lightgbm as lgbm
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import IsolationForest
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb
import catboost as cb
from catboost import Pool
from sklearn.model_selection import KFold

Helpful sources we drew inspiration from

--

--

Sebastien Callebaut

Using data and coding to make better investing decisions. Co-founder of stockviz.com