German Bank Data Default Prediction

Ayush Tiwari
12 min readMay 8, 2024

--

Code-https://github.com/ayush-477/Bank_default_prediction_code

Introduction

The objective of this project is to build a fully reproducible project report that showcases the use of Machine Learning models to answer the question of loan default prediction based on certain parameters from the German bank dataset.

Context

Banks as institutions, thrive on a fundamental practice of lending money. However a serious challenge they face is of loan defaults. This is a huge issue which troubles the financial stability of a bank and can lead to huge losses and ultimately shutting down of business if not kept in check.

Each day banks get a large amount of loan applications requiring careful considerations of many factors. The challenge lies in the fact of approving the ones which have very low potential of defaulting and rejecting the others, minimizing the losses.

To address this issue banks require a predictive mechanism to assess the likelihood of a loan default by using already available information with them.

The German Bank data is a historical dataset that contains personal and financial information about the customers like age, income, already existing loans etc. along with the “default” status, using which predictive modelling can be done to forecast the probability of a loan default for a new loan applicant.

This way banks can harness the power of data to minimize their losses, and ensure a more secure lending environment, both for their customers and themselves.

Some questions that I explored and tried to answer during the course of this project are-

How does the data and its distribution look like as a whole.

How does the distribution of each feature look like and how does each factor be it personal or financial and their combinations impact the likelihood of loan default.

Which machine learning model performs the best in predicting the loan default outcome based on the dataset.

Which hyper parameters work best with each model to give the best accuracy and which model to use to reduce the false negative and increase true positives using the recall performance.

Methods and Materials

Exploratory Data Analysis

The dataset contains 1000 rows which are historical loan applicants. The dataset contains 17 columns or features where each of the 16 features specify either personal or financial information about the customer and the 17th feature specifies whether or not the customer defaulted on the loan.

The features included are

checking_balance — Amount of money available in account of customers

months_loan_duration — Duration since loan taken

credit_history — credit history of each customers

purpose — Purpose why loan has been taken

amount — Amount of loan taken

savings_balance — Balance in account

employment_duration — Duration of employment

percent_of_income — Percentage of monthly income

years_at_residence — Duration of current residence

age — Age of customer

other_credit — Any other credits taken

housing- Type of housing, rent or own

existing_loans_count — Existing count of loans

job — Job type

dependents — Any dependents on customer

phone — Having phone or not

default — Default status (Target column)

Looking deeply into the dataset, I found out there are no missing values in the dataset and neither any duplicated rows, and the features are combination of numerical and string (int64 and object).

Describing the dataset revealed that there are total 7 numerical columns out of which there are 3 continuous numerical columns which are age, amount and months_loan_duration and 4 are discrete numerical columns which are percent_of_income, years_at_residence ,existing_loans_count and dependents .

Other than that there are 2 ordinal columns which are credit_history and employment_duration. There are some more columns which can be considered ordinal but they have some missing values so it is difficult to establish an order and thus considered as categorical or numerical columns only.

Rest are categorical features (columns).The summary statistics of numerical and categorical and ordinal columns is described below which gives a deeper look into the distribution.

Diving deep into the values and their frequencies for the categorical columns reveal an interesting potential typo.

The purpose of loan has one value called car0 which occurred for 12 times and could be a potential typo. So changing it “car” and adding its frequencies to the “car” category.

Visualizations

I have included some visualizations to better understand the distribution of various features and the distributions of relationships between them. The images below contain various plots like histograms, boxplots and density plots etc. These plots apart from showing the distribution of features also help in understanding various other factors like outliers and correlation between input and target features, which help in determining the most appropriate machine learning model to train on the data.

Frequency Distribution of Numerical Features

Boxplots for Numerical Features with Default Parameter as Hue (Color)

Pairplot of Numerical Columns with Default Parameter as Hue (Color)

Correlation matrix between numerical features

Count Plots of Categorical Variables with Default Parameter as Hue

Count Plots of Categorical Variables with other Categorical Variables as Hue

Some observations from EDA and Visualizations about the dataset are-

The proportion of customers having very good and perfect credit history defaulting is higher than the customers having poor or critical credit history. This fact is contrary to a layman’s guess that customers having lower credit score are more likely to default.

As the amount of money in the current account of a customer increases(checking_balance), the proportion of defaulting on a loan decreases.

People with their own houses are more likely to take loans for different purposes than people living on rent or other places.

Except for amount and months_loan_duration , there is not a very strong positive or negative correlation between the numerical features.

Modelling

Data is preprocessed after the EDA to make it ready for predictive modelling. I scaled the numerical features using Standard Scaler to make all the data points standardized having mean 0 and standard deviation 1 . This is very important for distance based machine learning algorithms to avoid bias with features having a larger scale. Then I encode the ordinal features using Ordinal Encoding from scikit learn to rank the values in certain orders. Then I encode the remaining categorical features using OneHotEncoding to convert them into numerical values for fitting the model.

Finally I split the data using train test split, keeping 800 rows for training and 200 rows for testing.

Models Used

I employed 5 supervised learning models for the classification task of predicting loan default, Logistic Regression, Quadratic Discriminant Analysis, Support Vector Classifier, Random Forest Classifier and Gradient Boosting Classifier. I employed Grid Search CV along with stratified K Fold to do hyper parameter tuning. I used ‘recall’ as the evaluation metric as the objective was to minimize false negatives to find out potential loan defaulters.I fitted the models on the training data with the best hyper parameters, and evaluated the performance on both training and test set. Used various performance metrics like precision, recall, f1 score and accuracy to evaluate the model’s performance.

Results

Key Findings and Model Statistics

Logistic Regression

Logistic Regression is a machine learning model used for binary classification mainly. It is used to predict the probability of a given instance belonging to a certain class. The probability is calculated using the logistic function, which fits the data linearly. The expected probability of logistic regression is a value between 1 and 0. On fitting the logistic regression model on the training dataset I got the performance as shown below

The Logistic Regression Model has an average Test accuracy when comparing with other models but has the lowest testing Recall and F1 Score for the positive class.

The training accuracy for the Logistic Regression model is also average when compared with other models.

Quadratic Discriminant Analysis

Quadratic Discriminant Analysis is a classification algorithm that is similar to LDA but aims for non-linear decision boundaries. It allows for different covariance matrices among classes. This makes QDA a better fit for cases where the relationship between features is non-linear. It performs classification by modelling the conditional probability densities of the features given each class by calculating the posterior probabilities using Bayes Theorem.

On fitting the QDA classification model on the training dataset I got the performance as shown below

The QDA classification model has the lowest test accuracy when comparing with other models and the lowest testing precision score for the positive class.

The training accuracy for the QDA model is also the least when compared with other models.

Support Vector Classifier

Support Vector classifier is a powerful supervised learning algorithm which detects classes by finding the optimal hyperplane that divides the points of different classes in such a manner that the hyperplane has maximum margin between the points and the hyperplane. It performs best in high dimensional feature spaces.

On fitting the Support Vector classification model on the training dataset I got the performance as shown below

The Support Vector classification model has better test accuracy than Logistic Regression and QDA but is less than other models. The precision, recall and f1 score is also decent for the positive class test data.

The training accuracy for the QDA model quite high when compared with other models, so the model fits the training data pretty well.

Random Forest Classifier

The random forest classifier is an ensemble learning method used for classification. It works by constructing a multitude of decision trees during training and outputs the class that is mode of the classes of the individual trees. Here each tree is built from a randomly selected subset of training data and features. The randomness in feature selection allows the Random Forest to capture complex relationships between features and classes, making it robust to noisy data and outliers. Additionally, by aggregating predictions from multiple trees, the Random Forest tends to generalize well to unseen data, resulting in robust and accurate classification performance.

On fitting the Random Forest classification model on the training dataset I got the performance as shown below

The Random Forest classification model has the best accuracy amongst all the models. The precision, recall and f1 score is average for the positive class.

The training accuracy for the Random Forest model is 1 indicating it fits the training data perfectly.

Gradient Boosting Classifier

The Gradient Boosting Classifier is a strong method of ensemble learning which uses boosting principles as well as gradient descent to create an excellent predictive model. In this during training a loss function is minimized by gradually fitting new trees to residuals from prior predictions and thus improving prediction power step by step. The GBC works best because it concentrates on difficult examples and learns from its past failures; therefore, it can detect complex relationships within data more effectively.

On fitting the Gradient Boosting classification model on the training dataset I got the performance as shown below

The Gradient Boosting classification model has the close to best accuracy amongst all the models and is comparable to Random Forest Classifier. The precision, recall is decently good when compared with other models and f1 score is the highest for the positive class.

The training accuracy for the Gradient Boosting model is close to 1 indicating it fits the training data very well.

Discussion

Summary of major findings

To select the best model I used the standard Precision Recall Curve. The AUC value for precision vs recall was used instead of the ROC curve as the dataset was slightly imbalanced with 700 no values and 300 yes values and the true positive is more important in our scenario. The Precision Recall curve attached below reveal insights into the classification prediction, highlighting the trade-off between precision and recall.

On the basis of AUC on the Precision Recall plot, Random Forest Classifier and Gradient Boost Classifier perform equally well and have the highest accuracy as well on the test data. The Random Forest Classifier fits the training data perfectly well, which can lead to overfitting but we cannot comment on that until we have more testing data to test on. The QDA classifier performs the worst for our problem and the results of Logistic Regression and SVC are comparable and average.

On other metrics like precision f1 score Logistic Regression and Support Vector Classifier tend to perform average in most of the metrics, Random Forest and Gradient Boost perform decently good while QDA consistently shows poor performance, though in terms of sensitivity (also known as recall), Support Vector Classifier and QDA prove to be the best models since they are able to identify more true positive cases than any other model. On the other hand, logistic regression has the lowest sensitivity which means it missed many true positives.

Based on these results we can conclude that among all algorithms considered here random forest and gradient boosting classifier are the best for our dataset modelling.

Limitations of study and Potential Future Directions

The dataset is pretty small just consisting of 1000 customers (rows), so it is less generalizable and if it sees a new customer and model performance may also be less. So a potential future direction could be to collect more data or manufacture synthetic data to make the model more generalizable.

More work can be done on feature engineering and coming up with new features from the existing ones, which can in turn be more representative of the prediction variable and also decrease the problem of curse of dimensionality.

The dataset is slightly imbalanced with 700 non default cases and just 300 default cases. This imbalance can lead to a more biased result where the model favours the majority class. Here more work can be done in future like under sampling etc to achieve a more balanced representation of the target class in the training data.

More advanced models like XGBoost and Neural Networks can be deployed in the future that may fit the data better and give a better test result on the testing data.

Conclusion

In this project, I sought to develop a predictive model for loan default prediction using historical data from a German bank. To achieve this, I took the following steps: data exploration, visualization and different machine learning algorithms were applied. I did that in order to come up with an accurate model which can be useful for the bank in identifying potential loan defaulters and mitigating financial risks.

I discovered meaningful patterns through careful examination of the dataset. These patterns were used as indicators about how different variables are related to each other and influence the chances of having a bad debt. The selection criterion was based on the best performance measured by important metrics like AUCPR or precision-recall curve area under it (AUCPR).

When we look at Gradient Boosting — one of ensemble methods-, it can be said that when specific thresholds are set then this could become very good model for predicting whether somebody will default on their loans or not. It always showed high values of recall together with decent AUCPR figures thus being able to recognize most cases where defaults might occur but also keeping false negatives low which is equally important if not more so. This allowed me modify some settings within Gradient Boosting such that more emphasis is put onto improving true positive rate compared against false positives rate thereby making sure both have equal priority.

The above approach led me into obtaining much better results than any other method tried previously during my study phase since it gave rise to pretty balanced precision and recall rates while optimizing for both at once which has never been done before according my knowledge.

--

--

Ayush Tiwari

"Hi, I'm Ayush, a data scientist with a Master's in Machine Learning. I post here Data Science and Machine Learning Content and Projects