Machine learning for Banking: Loan approval use case

Youssef Fenjiro
5 min readJul 24, 2018

--

Banks fundamental business model rely on financial intermediation by raising finance and lending (mortgage, real estate, consumer and companies loans). the latter is the major source of credit risk composed from 2 main points loan approval and fraud. in this post we will focus on loan approval by using machine learningmodels.

Granting credit to both retail and corporate customers based on credit scoring is key risk assessment tool that allow optimally managing, understanding and quantifying a potential obligor’s credit risk through “creditworthiness score”, which represent a more robust and consistent evaluation technique comparing to judgmental scoring.

Credit scoring in retail portfolios reflects the default risk of a customer at the moment of loan application, it helps to decide whether to accept or reject credit application based on 4 main input data:

· Customer information: age, gender, marital status, job, incomes/salary, housing (rent, own, for free), geographical (urban/rural), residential status, existing client (Y/N), number of years as client, total debt, account balance.

· Credit information: Total amount, purpose, amount of the monthly payment, interest rate, …

· Credit history: Payment history and delinquencies (payment delays), Amount of current debt, number of months in payment arrears, Length of credit history, time since last credit, Types of credit in use.

· Bank account behavioral: average monthly savings amount, maximum and minimum levels of balance, credit turnover, trend in payments, trend in balance, number of missed payments, times exceeded credit limit, times changed home address

Feature Selection and Models

Machine learning increases understanding by showing which factors most affect specific outcomes: Correlation matrix helps to dismiss correlated variable and feature selection methods (particularly Multivariate correlations) like stepwise regression are used to filter irrelevant predictors; it adds the best feature (or deletes the worst feature) at each round, and evaluate model error in each iteration using cross-validation to finally keep the best predictors subset (“feature selection” subject will be tackled in a separate post).

Logistic regression and decision trees are both popular classification techniques (supervised learning) used to build behavioral scorecards, they are statistical methods that analyse a dataset to bring out the relationship between “predictors” (or explanatories) that are independent variables and a “response” (or Outcome variable) that is a dependent variable. In our case we try to estimate the probability of granting a loan given the value of input variable seen above. For simplification, we will restrict the number of variables to 4 following predictors “age, income, average monthly savings amount, credit duration”.

Logistic Regression

In logistic regression, the target y is binary (Granted p = 1 /Not granted p = 0) and the probability p of granting the credit. The goal is to find coefficients αi of the formula below to predict a logit transformation of P.

Logit (p) = log(p/[p-1])= α0 + α1 . age + α2 . income + α3 . savings amount + α4 . credit duration

To find coefficients αi we train the classification model with a labelled data history, where the decision “granted”/”not granted” is already known, by using Cross-entropy as loss function to compare the predictions ^y

vs labels y :

The values of αi are those that minimize L(α0,.., α4) using its first derivative and an optimization algorithm like gradient descent:

Decision Tree

In decision Tree (like CRT, QUAID, QUEST, C5.0) we build classification model that learn decision rules inferred from data features to make predictions, generating a tree structure with decision nodes corresponding to attributes (input variables).

· Step 1: use splitting criterion (like Information Gain, Gain Ratio, Gini Index) to select the attribute with the best score that will be chosen to produce the purest node regarding to the target variable (in our case, the attribute that best separates “Granted” from “Not granted”).

· Step 2: create the root split node with the consequents subsets, then repeat step 1 for each subset by reusing splitting criterion to select the next best attribute to produce the purest sub-nodes regarding to the target variable.

· Step 3: repeat step 2 until reaching a stopping Criteria, for instance: Purity of the node > pre-specified limit or Depth of the node > pre-specified limit or simply Predictor values for all records are identical (no more rule could be generated)

· Step 4: apply Pruning to avoid overfitting by using a criterion to remove sections of the tree that provide little power to classify and determine the optimum tree size. To do so, we create distinct dataset “training set” and “validation set”, to evaluate the effect of pruning and use statistical test ( like Chi-square for CHAID) to estimate whether pruning or expanding a given node produce an improvement. We have two types of Pruning:

o Pre-pruning stop growing the tree earlier, before it perfectly classifies the training set.

o Post-pruning allow the tree to grow and then prune it back.

Conclusion

Logistic regression is a popular for modeling scorecard that have a continuous range of scores between 0 and 1, contrary to decision trees which have only a limited set of score values (every leaf node = particular score), thus, it may not be sufficient to provide a fine distinction between obligors in terms of default risk.

In addition, we can use other models like discriminant analysis, neural networks, and support vector machines (SVMs); or we can combine them by using ensemble methods such as bagging for more stability and boosting for more accuracy (ensemble methods will be addressed in a separate post).

--

--

Youssef Fenjiro

Data scientist, Machine learning & Artificial intelligence.