A short story of Credit Scoring and Titanic dataset

Published in

Analytics Vidhya

7 min readAug 5, 2020

Hello everyone, today we are getting to know about the “Credit Scoring” concept and it would be better to leverage some datasets to be more representative. I will use the Titanic dataset to show how to assign the score of survive passengers.

Our goals

Today’s topics are covering:

What is the Credit Scoring?
How to build the scoring model?
How to assign the score?

What is the Credit Scoring?

The Credit Scoring is the most widely useful tool in financial and banking industries. The score can be used for decision-making of lenders whether their clients are likely to repay the loan or not. It also helps the lenders measure the credit risk when offering the loan whether the risk is in acceptable range or else. There are several kinds of credit scoring models as:

Application score
Behavior score
Collection score
Deposit score

Today, we are focusing on Application score (A-Score) which is mostly fit for our example dataset.

How to calculate the score?

Now, we are going to building the A-Score model but before doing that you might have a question like…

Why has to be A-Score?

Well, the answer is quite simple. It depends on which data types you have. A-Score is based on the application data. In terms of “Application” means data, which represent the characteristic. For example, sex, age, location, education or somethings usually define who/what you are and rarely changed. On the other hard, if you want to build other models such as B-Score, you could use transaction data, number of clicks per day or somethings usually changed by behavior. Hope, this is clear enough.

1.Dataset

As I mentioned earlier, today’s data will leverage from the Titanic dataset, which mostly classic used for classification models tutorial. We are doing pretty much the same but more details with credit scoring concept to assign the score of survive. The dataset can be found here. Below is data dictionary that giving you a brief summary of data structure.

Survived: 0 = No, 1 = Yes (1 is our target)
Pclass: Ticket class — 1 = 1st, 2 = 2nd, 3 = 3rd
SibSp: Number of of siblings / spouses aboard the Titanic
Parch: Number of parents / children aboard the Titanic
Fare: Passenger fare
Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Quickly checking dataset, it is found that there are a significant missing values in ‘Age’ column and some missing in ‘Embarked’ as well.

Checking missing data
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64

Due to the total number of dataset is quite limit, we want keep data as much as possible. Hence, we are going to fix the missing by replace with ‘mode’ and we then ready for model building.

2.Model building

Before building the scoring model, there are 2 things that you should know first. 1. Weight of evidence (WOE) and 2. Information valus (IV). We don’t model with raw values but we are going to transform them into WOE. We then use IV for features selection.

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. (Deepanshu founder of ListenData)

Information value is one of the most useful technique to select important variables in a predictive model. It helps to rank variables on the basis of their importance. (Deepanshu founder of ListenData)

To calculate those values, we can binning or grouping our continually numeric factors e.g. Age, Passenger fare or Number of of siblings (if you want). Apart from that it can be represented the group with own values e.g. Ticket class and Embarked. In Pandas library, you can use qcut() function as pandas.qcut(df[‘factor’], 10, duplicates = ‘drop’) for decile binning.

The primary results might not good enough (e.g. getting infinity IV) because it might contain some missing bins when grouping into decile. You might consider re-binning, called Coarse-Classing. The process could base on quartile (changed from decile), trend of WOE, trend of rate (in our case is survival rate) or any other expert senses. All of these could be leveraged to perform the analysis. After the Coarse -Classing, the results should be like:

Factors
Age_bin      0.097745
Embarked     0.119923
Fare_bin     0.625860
Parch_bin    0.089718
Pclass       0.500950
Sex          1.341681
SibSp_bin    0.055999
Name: IV, dtype: float64

As I mentioned earlier, we generally don’t use all factor to develop the model because it might face the difficulty for maintenance when there are too many factors. We perform the Univariate analysis by using IV cut-off. Therefore, some factors will be eliminated based on pre-set threshold. (In our example, there is no factor having IV less than 0.02, keep them all!)

credit scoring model information values — General IV thresholds used for the Univariate analysis

The next step is to perform the Logistic regression. You could split the dataset for training and testing. This is to ensure model performance with out of sample data (Over-fitting is not presented). Today, we use statsmodels.api to build the model by calling sm.Logit() function. Our target (Y) is ‘Survived’ passengers, which has been indicated as 1. For our feature factors (X), we first leverage all factors that passing the Univariate analysis step. It should be aware that feature factors is NOT raw values, instead we are going to use WOE to perform the logistic model.

**Important** We have to consider model coefficient output on each factor (except intercept). It has to have the same direction (e.g. all positive and all negative). It will affect when performing score assessment. Also, if p-value greater than 0.05, it should be eliminated. Hence, the final model might not contain with all features.

Below is my final model, it can be seen that all features coefficient having positive direction and their p-value is less than 0.05.

credit scoring model logistic regression — This is the final model looks like

How to assign the score?

The final goal for today has come. We are now ready to calculate the score card. What we need are both of logistic regression equation (Model coefficients) and WOE values as the inputs. These are giving you the “Logit” scale or “Log odds” values.

Next, we need to convert the Logit into point system of our scorecard. We called the methodology as the “Scaling” method. The main assumption of the method is the log odds and scores can be presented as a linear formation (As can be seen from above formula). There are 3 main scorecard setup variables:

Target score is baseline score, which can be scaled point to. (you could set what ever you want, it doesn’t affect model performance at all)
Target odds is a specified odds ratio at a certain point
Point to double odds means that the score will be increased of “some point” doubles our odds

Let’s setting up these variables:

target_score = 600
target_odds = 30
pts_double_odds = 20

Now, we are good to calculate the score for each passenger by applying the formula below:

Factor = pts_double_odds / ln(2)
Offset = Score-{Factor * ln(Odds)}

After these all we now got the final formula for the scorecard calculation.

Result

I apply on both training and testing datasets. Below is the score distribution look like. It can be seen that the final model gives a quite similar result.

We could do something like to find the maximum score of survival rate and consider how characteristic would be:

Resulting passenger ID: 330 who has the score at 599.96. She is a female and the age of 16 years old at that time. She had the first-class ticket and out of the pier at Cherbourg.

Before ending

You could see that it’s not that difficult to develop the scoring model. With the different types of data, you could leverage the same logic and add more advanced statistical technique to build more fancy model (I mean more accurate model).

For example, when building the B-Score model (Or even A-Score), you might face the issue like there are too many features and you couldn’t perform only the Univariate analysis by IV. What you need is the Bivariate analysis or Multivariate analysis (e.g. Correlation or Clustering) to perform features selection.

I would finish it here and if you have any question, please let me know. You could contract me over the LinkedIn below. See you there!

Sasiwut Chaiyadecha - Senior Consultant - EY | LinkedIn

View Sasiwut Chaiyadecha's profile on LinkedIn, the world's largest professional community. Sasiwut has 1 job listed on…

www.linkedin.com

Update

creditrisk/TitanicAppModel.ipynb at main · naenumtou/creditrisk

Contribute to naenumtou/creditrisk development by creating an account on GitHub.

github.com

The full code related to this post can be found from the notebook above.