Predicting the Default Risk

Published in

The Startup

14 min readJan 8, 2020

Business Understanding

Consumption needs sometimes take unexpected turns such as replacing major appliances, fixing up houses, and paying unplanned expenses. In these situations, consumers can be left strapped for cash to deal with the immediate payment obligation. The consumer loans industry, therefore, exists to provide consumers with short-term financial relief or achievement of short-term goals.

Home Credit, a leading international multi-channel provider of consumer finance, is looking to broaden offerings to more customers. Targeting a greater number of customers allows Home Credit to improve its top line. However, the more loans the company offers, the more individuals of high-risk Home Credit will be exposed to; a critical decision for the company is to determine the likelihood that a customer would have a default. If the default rate among customers rises, the company would incur losses in its expansion. The objective of Home Credit is to correctly offer loans to individuals who can pay back and turn away those who cannot. The challenge lies in the great diversity of backgrounds of individuals who come to Home Credit to procure the loan. In this analysis, I aim to implement a variety of techniques to assess the idiosyncrasies of each customer and determine whether a customer will default.

Data Understanding

The data consists of seven different sets. The main dataset is “Application_train.csv.” It contains 307,511 observations of 122 variables and provides static data for all applicants. The target variable resides in this dataset and indicates whether clients have difficulties in meeting payment. For my analysis, I consider this as default, although not all those who have difficulties in payment will actually go default. Each observation is a loan application and includes the target value, demographic variables, and some other information. Other datasets comprise data of previous application, credit card balance, repayment history, and balance of credits in the Credit Bureau. Due to the vastness of the data and limited time and resources, in this project, I will focus solely on the main dataset. The 122 variables in it can be broken into 5 categories: personal information of the customer, information of the loan, information of the area where the customer lives, the documents the customer provided, and the inquiries made to the Credit Bureau.

Data Preparation

Before conducting the analysis, the following transformations were made to make the dataset usable.

The first step in cleaning the data was to figure out a way to deal with the missing values. Based on the meaning of and the magnitude, it was decided whether to 1) exclude the variable entirely, 2) drop the observations containing missing data, or 3) recode the missing data in a way consistent with their meaning.

Next, I re-coded some of the binary variables from categorical into dummy variables: Gender was re-coded to Female, and Car Ownership and Realty Ownership were re-coded from Y/N to 0/1. Furthermore, some variables are conditional on others, such as Car Age, which only has value if Car Ownership is 1. For these variables, an interaction term was created between the dummy and the continuous variable, with missing recoded to zero, and the original continuous variable was taken out. This way, my further analyses can distinguish between the effect of the dummy being true or not and the further effect of the continuous variable without it being pushed downward due to zero-values. Refer to the Appendix for an overview of these variables.

The following step was to find extreme outliers and take these out of the data. Since my goal is to build a model to predict whether a debtor will default on a loan, including this small number of extreme observations could substantially bias the model. Based on the histogram of Total Income, I decided to drop all observations that have a value greater than 600,000. Next, uninformative variables such as the application ID, are dropped from the dataset. The next transformational step was to recode all categorical variables into factor variables, so that R recognizes them as such, and not as continuous variables.

Finally, because of thelack of computational power, when trying with the full dataset, none of my codes were able to run. Hence, a randomly sampled subset is used instead: of the 305,880 observations left, I only use 1/30th (or 10,196) for the analyses. This is my “train” dataset. I use one of the remaining 29 datasets as the “test” dataset.

Exploratory Analysis

The histogram on the left shows the frequency distribution of Home Credit customers’ income. The skewed right distribution indicates that the bulk of incomes are in the range from 0 to 300,000.

The next plot compares the relationship between the total loan amount and loan annuity for Home Credit loan applicants. The distribution illustrates a positive relationship between two variables. Besides, the distribution is divergent. We can see from the figure that the distribution clearly separates into several different lines with a mixture of different levels of income.

The figures above demonstrate the effect of different ownerships, genders and loan types on the default rate. We can see that people who do not own a car are comparably more likely to default. This information intuitively makes sense since people who own a car generally have more assets to afford the loan payment. On the other hand, the default rate for males is about 35% more than females’ default rate. This is suggesting that female loan applicants can be more preferable than male loan applicants for Home Credit. Finally, the default rate for cash loans is much higher than that of revolving loans. So Home Credit should probably pay more attention to cash loans to avoid losses and to maximize profits in the expansion.

The plot on the left-hand side shows the total loan amount for different family types in different loan types. From the plot, we can interpret that the loan amount of revolving loans is much smaller compared to that of cash loans. Also, the majority portion of applicants for both cash loans and revolving loans are married people.

The pairwise correlation matrix plot provides useful illustrations of relationships between my target variable (0 indicates not default and 1 indicates default) and several other variables (contract type, total income, total loan amount, etc.). From the matrix plot, we can interpret that no matter what type of contracts the customer has, the median income of the customers who did not default is higher than that of the customer who defaulted.

Modeling

The objective I identified for my analysis is to classify each application as default or not. Such objective indicates that the problem I need to solve is a classification problem. I then chose to use classification techniques which would provide me a probability rather than a simple “Yes” or “No.” This allows me some autonomy in choosing relevant parameters.

Classification and regression tree (CART) & Principal component analysis (PCA)

The first classification method used is the classification tree. Some reasons for using one are that it is easy to create and it is easy to interpret for anyone, including non-technical stakeholders. To find out the best set of variables for CART, I tried different combinations of variables and eventually obtained the model with 4 nodes, classified by External 2 and External 3. In other words, in the CART model, someone will default or not is mainly classified by two standardized scores from external data resources.

However, a limitation of the classification tree is that it can only use 31 variables at most. Therefore, I expanded upon the initial CART with a Principal Component Analysis (PCA). The PCA allows me to see which variables contribute the most to the variation in the target variables. This way I can select those and only those variables that have a sizeable contribution.

From the plots above, we can interpret that the variance of factors become relatively flat since the fifth one. Also, the number of principal components increases steadily with the cumulative proportion of variance explained. In other words, except for the first factor, latent features do not explain a lot of information about the model. In the CART model with 30 components as variables, I only obtained 2 nodes, which are classified by PC2.

Logistic regression

The regression type I chose to run is logistic regression; it predicts the default probability for a specific loan. I will run a logistic regression without interactions since my limited computational power does not allow for running lassoes with interactions on a dataset of this magnitude. However, I do include a few interactions manually where deemed appropriate, such as between car ownership and car age. The logistic regression is more robust and allows a wider variety of predictions. However, it requires good selection of variables. The outcome of both the classification tree and logistic regression is the probability that a loan would default. There is a need to convert these probabilities into classes of “Default” or “Not Default” . A threshold is needed; when the probability is greater than a certain threshold, the application is classified as “Default”.

To determine thresholds for various models, I took insights from the ROC curve. I plotted the ROC curve to check the in-sample performance for the following three models: Logistic Regression (Blue), Classification Tree (Green) and Classification Tree with PCA (Red). The logistic model performs the best(in-sample) compared to the remaining two models(Higher Area Under the Curve(AUC) for the logistic model). Using the ROC curve I evaluated the optimal thresholds for the three models; These three thresholds define the level at which that model will have the highest accuracy. These thresholds will also be applied for OOS evaluation.

Lasso: Regularization and Selection

With so many variables in the dataset, 77 variables and 192 with all dummy variables for all categories, there are ample opportunities for overfitting, creating the need for some measure that selects variables based on how much they contribute to both underfitting and overfitting. Least absolute shrinkage and selection operator (Lasso) is an excellent means to this end at my disposal. Since we are doing predictive modeling, I will set the penalty parameter λ at both its minimum value (λmin) and the most regularized model within 1SE away from that value (λ1SE). The λs and the corresponding numbers of variables are summarized below. Using these selected variables would result in the optimal balance between over- and underfitting. The lasso path below shows how the λ relates to the number of variables and deviance. A disadvantage is that lasso can drop highly correlated variables arbitrarily, meaning for some of those it could drop the more relevant ones.

Evaluation

In the evaluation section, the different models are compared on OOS R-squared and OOS accuracy using K-fold cross-validation. Per convention, I used 10 folds to evaluate my models. In total, I compare the performance of 8 models: the Post-Lasso and Lasso with minimum lambda choice and 1se lambda choice for each (4 models), the logistic regression, the classification tree with selected variables, the classification tree with top 30 principal components, and the null model.

OOS R-squared

The OOS R-squared of the 10-fold cross-validation is shown below.

The bar charts don’t show negative values to preserve visibility. However, the logistic regression performs poorly with an extremely negative R-squared; this outcome indicates an overfitting phenomenon for the model. This is understandable as the logistic regression has the most number of variables among all models, 192 variables in total. The best performing model in terms of average OOS R-squared is Post-Lasso with 1se choice of lambda.

OOS Accuracy

The accuracy measure is calculated as the percentage of true predictions (true positives plus true negatives) over the total number of observations. Below is the 10-fold cross-validation for accuracy measure.

The worst-performing model is the classification tree using latent variables from PCA with only 81.45% accuracy. All the models, in general, seem to perform at a high accuracy rate. However, a closer look shows that the null model already performs at 91.88% accuracy. This is the result of a very small default rate in the population; the base default rate of the “train” dataset is 8.12%. The OOS accuracy illustrates a rather discouraging outlook for the predictive models. With such a small default rate, it is very difficult to beat the “Majority Rule” baseline where we always predict the most common (not default).

Results on the Test dataset

To choose among the models, I balance between the results of OOS R-squared and OOS Accuracy. The model of choice is Post-Lasso with 1se lambda selection since it has the highest OOS R-squared and an OOS Accuracy that is just slightly less than the highest accuracy. Subsequently, I use the Post-Lasso 1se lambda to predict values in the test dataset. The result is followed.

The comparison between the accuracy of my prediction and the accuracy obtained from using the “Majority Rule” shows that my model was not able to beat the baseline.

While it is rather disappointing that my model does not perform better than a simple rule such as assigning non-default to all applicants, my analysis still has significant value for two reasons. Firstly, “Majority Rule” is not practical for business settings. If Home Credit starts accepting all applications for a loan, even if the company keeps its policy concealed, sooner or later consumers will catch on. Higher-risk individuals, thus, will be drawn to Home Credit and lead to a much higher default rate. Secondly, the prediction of my model outperforms the random assignment of default. Using the base default rate, a random assignment produces an accuracy of 85.43% while my model achieves an accuracy of 92.15%.

Deployment

The result of the data mining will be deployed via writing an algorithm and building up a web page interface so that sales representatives can use to input applicants’ data. Upon finishing the data input, a suggestion of whether the application has a risk of default or not will pop up on the screen; the sales rep then can use it for further assessment. Some of the obstacles I identify in the implementations are below.

First, data accuracy might be an issue since I don’t have the means to verify every response. Some applicants might provide wrong data to increase the chance of getting the loan. However, this is also the case for the status quo. Even with these noises, I was still able to build a predictive model. There is no reason to believe that new customers from the expansion will produce more noisy data. Thus, this risk is relatively low.

Second, it might be time-consuming to acquire the data needed from customers. If customers have to spend more time to get evaluated for the loan, we might lose some customers during the data input process. However, since all the variables that the model uses were already being recorded, deploying the model does not come with increased time consumption. On the contrary, an improvement that my newly established model can bring is that it filters out many demographic variables that turn out not to be relevant for predicting default, which do not have to be recorded, thus simplifying the loan application process. This could both cut costs and lower the bar for potential new customers to apply for a loan.

Furthermore, since I don’t have the visibility of customers’ profiles, I cannot consider factors like geography or applicant behaviors. So the data might not be applicable in a different market. Entering a new market is always a risky step. In order to mitigate some of this risk, it should be entered in slow, small steps. By starting out with a small group first, it can be established whether the market is similar to the ones Home Credit is currently active in and if the product is successful there. By not going all-in right away, the risk is reduced.

Last but not least, using the optimal model raises an ethical problem: the variable Female is relevant in predicting whether an applicant will default. However, including it in the model to determine whether or not to approve a loan would, by definition, lead to gender discrimination, which is both unethical and unlawful. The only way to solve the gender discrimination problem is by taking this variable out of the model entirely. Instead, we could focus on more latent variables related to both gender and default probability, such as some proxy for risk-seeking behavior, to include in my model instead.

Improvement

I believe the current model is useful and will help Home Credit significantly in its expansion endeavors. Nevertheless, there are ample opportunities to improve the model as Home Credit needs become more complicated. Additional sources of data such as the history of payment should be used to improve the accuracy rate. Moreover, for now, I am only considering yes/no for the default rate without taking the loan amount into consideration. Naturally, the loan amount is a relevant factor: defaults on larger loans are far more detrimental than are smaller ones. A more robust model can be made using extra information like the demand curve. In this way, I can find the optimal interest rate and make some improvements to the current loan. For example, for a high-risk loan, I can have a higher interest rate to make the expected profit worthy. Finally, since right now I use a general model for all types of loans, it is probably not optimized. Different models should be created for different loan types to avoid losses and to maximize profits in Home Credit expansion. It is very likely that the highly different nature of revolving loans and cash loans makes for different optimal models.