Data Science Practice for Loan Applications

Ömer Faruk Eker
CodeX
Published in
14 min readOct 29, 2021

Following is a data science practice which involves data wrangling, exploratory data analysis, predictive modelling and results evaluation of a loan application dataset.

Image taken from: https://www.template.net/business/letters/sample-loan-application-letters/

This article contains Jupyter notebook cells to help data scientists to reproduce results and maybe learn new numpy, pandas, seaborn, sklearn package tricks.

Dataset

Dataset is a modified version of the well-known “Statlog German Credit Data” which can be accessed from the following link:

https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

Dataset is in the CSV file format of German loan applications which was provided by Professor Dr. Hans Hofmann’s team in 1994.

The data is in the structured table form, each row represents a loan application where an individual is applying for a loan for various reasons. It contains 1003 rows, 22 variables about the applicant and their loan history.

Exploratory Data Analysis

Next step is to explore the dataset. We have a target class named “accepted” denoting whether the application was rejected or accepted. The target variable given in the original dataset actually measures the risk of credit to the applicant; values are either “good” or “bad”. “Good” means the applicant was worth taking the credit and “bad” is the opposite. 70% of the target variable of the original data are in the “good” category, remaining 30% are “bad”. In our case zeros represent 70% and ones represent 30%. Therefore we should alter the modified dataset to map zeros to “good” which means loan approval, and the ones to “bad” which means loan rejection. We will use the map function of pandas library, Python.

Following the correction, the distribution of loan approval and rejections are shown in the donut chart of Matplotlib library below.

Let’s explore personal_status field by using the unique() function of pandas. It seems to have four unique different categorical values. Values start with either “male” or “female” followed by an underscore.

Now let’s take the “male” and “female” parts from the strings and create a new variable called “gender”. We will use combination of split() and apply() functions.

By using the pivot_table() function of pandas we can group and aggregate the target value by gender.

Male approval rate is 72.4% and females are 64.8%. Seaborn barplot() function helps display the results. Also Matplotlib text() function is used for embedding the percentages in the graph.

Data Imputation

Now let’s fill the missing data (if any). To check missing data we will use isnull() function of pandas. Both “age” and “asnm” columns have 50 missing values whereas “credit_amount” column has only one.

Before imputing the missing data, it is better to see the histogram of the variables first. We will use the distplot() function of Seaborn package. Also Matplotlib’s vlines() function is used for the vertical lines (mean, median and mode of age variable) in the graph. Mean, Median and Mode are the most commonly used central tendency metrics. Mean metric is fit to use for non-skewed distributions like Normal/Gaussian. For our case, as age variable is positive skewed, we will have to eliminate the mean metric.

Therefore one can choose between median and mode for skewed distributions like these. We will go for the mode as it does not distort the shape of the distribution as much as mean and median. Mean, median and mode imputation scenarios are visualised in the comparison graph below.

Now let’s create a factor age_group to with groups for: under 25, 25–34, 35–49, 50+. We will use numpy’s select() function for this purpose.

Credit_amount is the amount of money requested in the application. Let’s plot a histogram of the variable.

The histogram plot is so unhelpful because of the outlier values. Due to the extensive difference in between the outlier values and the rest of the data, a single bin has to represent almost total population.

Now let’s take a look at the outliers in credit_amount field and weed them out. Outliers can be seen in the following boxplot. Some candidates are seem to be interested getting loans over 10M of Deutsche Marks. In the graph, log scale is selected for the x-axis to have a better visual. Interquartile range (IQR) is one of the known methods to filter out the outliers. Values outside the range of Q1–1.5IQR and Q3+1.5IQR are considered as outliers.

Lets calculate the 25% and 75% percentiles and measure the range. This range will be extended by multiplying 1.5 then the values outside the 1.5*IQR will be considered as outliers and removed from the dataset

After weeding the outliers out, histogram looks much better. Credit amount variable have a positively skewed distribution which means the tail is on the right side or the data is more towards the lower side. For these type of distributions, mean value is greater than that of median value

What can we tell about the relationship between the applicant’s age and the requested credit amount is as follows.

First we calculate the correlation value between the age and credit_amount variables in our dataset. Correlation between these two is computed as ~0.036 a positive value but still means very weak linear relationship between the two.

Scatter plot between the two also gives an idea about the relationship between the variables. The plot suggests the absence of a linear relationship between the two variables.

However it is better to check the “age group” variable that we created earlier too. By the look of it, applicants under 20 request slightly less as compared to the rest of the applicants. Let’s conduct a T-test whether the above assumption is correct.

Our null hypothesis will to be as follows:

H0: Average credit amount requested by applicants under age 25 and over age 25 are indifferent

Lets set the significance level to 0.05 which will set our confidence level to 95% on either rejecting or fail to rejecting the null hypothesis

T-test results based on that are:

  • statistic: -2.17
  • pvalue: 0.03

P-value is less than our significance level (0.05) that means we can reject the null hypothesis. Therefore we can say with a 95% confidence; applicants under age 20 request slightly less as compared to the rest of the applicants.

Now let’s check the relationship between age and this variable called “asnm”.

It is always good idea to start with a scatter plot to visualise the relationship. As we in the plot below, two variables seem to have a parabolic relationship (not linear relationship). Therefore Pearson correlation value will mislead us. As expected misleading results are observed. Pearson correlation value calculated at -0.093 which means a weak negative correlation, however if we look at the graph:

  • when applicant age is greater than 42, ASNM values tend to increase with age with an exponential fashion therefore strong positive correlation
  • n the other hand when applicant age is less than 42 ASNM and age correlation becomes a strong negative correlation

In conclusion these two parameters have a strong correlation. More than that; the graph below shows that ASNM values are obtained by just putting age values in a function.

For these type of non-linear correlations, distance correlation metric can be used to obtain the correlation value.

Hypothesis Testing

Let’s try to find an answer to the following question by constructing a hypothesis and testing it:

What’s the effect of owning a telephone on the likelihood of a credit application being accepted?

When an experiment is trying to draw a comparison or find the difference between the two categorical random variables, then we can use the chi-square test, to test the statistical difference.

Both owning a phone (yes & none) and credit approval (0 & 1) variables are categorical or lets say non-continuous. Therefore our purpose will be making a hypothesis and test it with chi-square testing mechanism.

So our null hypothesis will be as follows:

  • H0: “Owning a phone has no effect on getting the credit approval”

Since we have only two outcomes (approval vs. rejection) for a binary test, degrees of freedom will be 1. Also the alternative hypothesis will be only one kind which is:

  • H1: “Owning a phone affects getting the credit approval”

The contingency table below gives the counts for each combination. Out of 351 candidates who owns a phone, only 82 of them got the approval which is around 23% approval rate. On the opposite side, candidates without a phone seem to have a higher approval rate (~31%). Almost two thirds of the applicants does not own a phone, this indicates that data was collected in 90s. Nowadays, since almost everyone have a phone, this analysis may not be reflected to our times. Let’s continue testing our hypothesis further though.

Lets set the significance level to 0.05 which will set our confidence level to 95% on either rejecting or fail to rejecting the null hypothesis.

Calculate Chi-square score

Calculate critical value given the confidence level and degrees of freedom:

If calculated chi-square value is greater than the critical value then we reject the null hypothesis. Therefore we rejected the null hypothesis. This means “Owning a phone affects getting the credit approval”. Based on the approval rates that we calculated earlier, one can say owning a phone decreases the chances of getting the approval.

Before we delve into the modelling part, let’s explore the variables more.

Now we visualise the distribution of the variables that we can make a histogram of. Initial thoughts and interesting facts:

  • These four variables took numeric (integer) values of a wide range, therefore it is better to use a histogram to visualise the distribution of values
  • Remaining variables in the dataset are either categorical or ordinal so we will use donut chart to see the distribution

Credit amount and repayment duration are correlated as expectedly. A positive correlation value between them is calculated at 0.62.

Let’s visualise the distribution of the categorical variables. During the feature engineering and modelling improvement phases, it is better to come back to this graph from time to time to have a solid grasp of the variable distributions.

Modelling

Since we have:

  • A target variable (accepted column) to train our our models
  • And target variable takes binary values only (1 & 0)

This becomes a binary classification problem. If we apply Occam’s Razor principle to machine learning, it is better to choose a simpler model as compared to a complex and black box model “if their performance is similar”. However, if complex/advanced model performs significantly better than the simpler one, then we will have to choose best performer anyway.

Logistic regression could be the best fit (simpler model) to start with. Easy to understand its way of modelling and also easy to interpret results with its parameter changes

In order to use categorical variables in our models we will have to convert them to one-hot-encoded versions first:

Let’s find out the most correlated features to the target value. Note that top two most correlated features comes from the checking status variable. This tells us checking status variable is a valuable input to our ML models.

If we compare each group’s approval rate of the checking_status variable, the group with a negative checking_account balance have the least approval rate. Approval rate increases with the balance in the checking_account. However the applicants with no running account seem to have the highest approval rate.

Lets dig in further. Following graph says if you have over 1000DM balance in your savings account you are very close to get the approval rate no matter your checking status is (purple bars below). Still this graph does not much explain why the applicants with no running account have higher approval rate.

When we check distribution of savings_status over checking_status, we can see that the applicants with no running account have higher balance in their savings account for this dataset. This could be a biased for this specific dataset, therefore it is better to check this phenomenon with other credit datasets

Standardization of a dataset is a common requirement for many machine learning estimators like Logistic Regression: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

Now let’s initialise a logistic regression model

A cost matrix is provided with the original dataset that gives a different penalty to each misclassification error for the positive class. Positive class is the ones with a value of 1 in the target variable which also means the applicants that are not credit worthy.

Cost for False Negative > Cost for False Positive

This suggests that the positive class is the focus of the prediction task and that it is more costly to the bank or financial institution to give money to a bad customer than to not give money to a good customer. We will take this into account when selecting a performance metric.

Therefore we will use fbeta score from scikit-learn which is more generic version of known F1 score metric. F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of recall in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall. Since we are interested in higher recall values we will set beta=2

To avoid sampling bias effects and randomness, we will split data dynamically using stratified k-fold cross validation technique. We set number of splits as 10 to split the dataset into 10 groups; in each turn, 9 of the groups will be used for training the model, remaining group will be used for testing purposes. As opposed to regular k-fold, stratified k-fold divides the groups in a way to make sure groups are not significantly different than each other based on the distribution of values

In conclusion, Logistic Regression model (with default parameters) gets an F2 Score around 0.50 for the testing groups in average which is insufficient performance. There are no signs of overfitting as the performance is already low, also train and test results are not significantly different. In next section we will try increase the model perfromance.

How can we improve the performance?

Logistic Regression has a hyperparameter called class_weight which handles class imbalance problem. Since our target variable distributed 30% to 70%, adjusting this parameter will help getting better recall values. When we set the class_weight parameter to the “balanced” mode, it uses the target values to automatically adjust weights inversely proportional to class frequencies in the input data. This means weights of the minority class (the candidate who are not credit worthy) are propotionally increased.

let’s also try optimise the hyperparameters of the model too. We will use grid search algorithm for this purpose to aim for getting high scores of the customised f2 score which we defined earlier:

Now we will use the optimised parameters to obtain the final results and performance.

If we compare the results before and after the adjustments, we can say that we made a progress:

  • F2 score for test_results increased from 0.50 levels to 0.72 which is a significant improvement
  • Also the recall parameter which is very important for this kind of problem increased significantly (from 0.47 to 0.81)

Further Work

There are many other steps that can be applied to get better results, some of them are:

  • Feature Engineering: This refers to creating most useful features out of the data. For instance new features can be extracted from the existing data using domain expertise. Also feature selection can be made which is choosing the most important features or other methods of dimensionality reduction (e.g. PCA). There are many techniques we can use to both extract features and select features
  • More Advanced Models: Decision Tree based ensemble models like bagging and boosting can also be used for performance increase. Random Forests, XGBoost, Light GBM models are known to be performing well for structured data problems

--

--