Machine Learning Credit Risk Modelling : A Supervised Learning. Part 1

Wibowo Tangara
5 min readJan 19, 2024

--

Part 1: Understanding The Data

Business Understanding

Credit risk refers to the potential that a borrower may fail to meet their financial obligations, such as repaying a loan or meeting interest payments. It is a fundamental component of lending and financial services, and understanding and managing credit risk is crucial for banks, financial institutions, and lenders.

Assessing credit risk can help financial institution to manage and minimize the probability of false positive( i.e.: lending money to someone who can’t repay) and false negative( i.e.: rejecting credit application from someone who can actually repay).

In this project we will build, train, and evaluate a machine learning model using python and jupyter notebook as the IDE to assess credit risk.

Load and Exploring The Data

The first step is we will load the data we will use to build a model and oversee the ‘shape of the data itself. The data we will use is stored in local host and in csv format, we will use a python function to make us easier to understand the raw data.

import pandas as pd              
pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)

view rawimporting pandas library hosted with ❤ by GitHub

Import the pandas library and set display options to show a maximum of 1000 columns and 1000 rows when displaying a DataFrame

def inspect_data(df, col=None, n_rows=5):
print(f'data shape: {df.shape}')

if col is None:
col = df.columns

display(df[col].head(n_rows))

view rawdefine inspect data function hosted with ❤ by GitHub

Define a support function named inspect_data that takes a DataFrame (df), an optional list of columns (col), and an optional number of rows (n_rows) as input parameters. This function prints the shape of the DataFrame and displays the first n_rows of the specified columns using the display function.

def check_missing(df, cut_off=0, sort=True):
freq=df.isnull().sum()
percent=df.isnull().sum()/df.shape[0]*100
types=df.dtypes
unique=df.apply(pd.unique).to_frame(name='Unique Values')['Unique Values']
unique_counts = df.nunique(dropna=False)
df_miss=pd.DataFrame({'missing_percentage':percent,'missing_frequency':freq,'types':types,'count_value':unique_counts,
'unique_values':unique})
if sort:df_miss.sort_values(by='missing_frequency',ascending=False, inplace=True)
return df_miss[df_miss['missing_percentage']>=cut_off]

view rawdefine check missing function hosted with ❤ by GitHub

Define another support function named check_missing that takes a DataFrame (df), a missing data cutoff percentage (cut_off), and a sorting parameter (sort) as input parameters. This function calculates and returns a DataFrame with information about missing values, including the percentage of missing values, frequency of missing values, data types, count of unique values, and unique values.

freq = df.isnull().sum(): Calculates the frequency of missing values for each column in the DataFrame and stores the result in the freq variable.

percent = df.isnull().sum() / df.shape[0] * 100: Computes the percentage of missing values for each column and stores it in the percent variable.

types = df.dtypes: Retrieves the data types of each column in the DataFrame and stores them in the types variable.

unique = df.apply(pd.unique).to_frame(name=‘Unique Values’)[‘Unique Values’]: Extracts unique values for each column and stores them in the unique variable. This step ensures that unique values are obtained, especially for non-numeric columns.

unique_counts = df.nunique(dropna=False): Counts the number of unique values in each column, including missing values, and stores the result in the unique_counts variable.

df_miss = pd.DataFrame({…}): Creates a new DataFrame df_miss using the collected information about missing values, including missing percentages, frequencies, data types, unique value counts, and unique values.

if sort: df_miss.sort_values(by=‘missing_frequency’, ascending=False, inplace=True): Sorts the df_miss DataFrame by the frequency of missing values in descending order if the sort parameter is set to True.

return df_miss[df_miss[‘missing_percentage’] >= cut_off]: Returns a subset of the df_miss DataFrame, containing only rows where the missing percentage is greater than or equal to the specified cut_off value. This cut-off can be adjusted to filter out columns with a certain level of missing data.

df = pd.read_csv('your_data_here.csv')

view rawload data hosted with ❤ by GitHub

Load a CSV file (change your_data_here with the raw data file) into a DataFrame variable df.

inspect_data(df)

view rawcall inspect data function hosted with ❤ by GitHub

Call the inspect_data function to display information about the loaded DataFrame df, including its shape and the first 5 rows of data (default behavior).

import pandas as pd              
pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)

def inspect_data(df, col=None, n_rows=5):
print(f'data shape: {df.shape}')

if col is None:
col = df.columns

display(df[col].head(n_rows))

def check_missing(df, cut_off=0, sort=True):
freq=df.isnull().sum()
percent=df.isnull().sum()/df.shape[0]*100
types=df.dtypes
unique=df.apply(pd.unique).to_frame(name='Unique Values')['Unique Values']
unique_counts = df.nunique(dropna=False)
df_miss=pd.DataFrame({'missing_percentage':percent,'missing_frequency':freq,'types':types,'count_value':unique_counts,
'unique_values':unique})
if sort:df_miss.sort_values(by='missing_frequency',ascending=False, inplace=True)
return df_miss[df_miss['missing_percentage']>=cut_off]

df = pd.read_csv('loan_data.csv')

inspect_data(df)

view rawinspect data function hosted with ❤ by GitHub

The complete block code will shown as above.

The output will shown as above when you run the block.

check_missing(df)

view rawcall check_missing function hosted with ❤ by GitHub

Call the check_missing function to display information about the missing value of loaded DataFrame df.

The output will shown as above when you run the function.

df.describe()

view rawcall describe method hosted with ❤ by GitHub

Call the describe() method: This method is called on the DataFrame df to generate descriptive statistics of the numeric (quantitative) columns in the DataFrame. It provides various summary statistics, including measures of central tendency, dispersion, and shape of the distribution.

The output of df.describe() will include the following statistics for each numeric column in the DataFrame:

  • Count: Number of non-null values.
  • Mean: Mean (average) value.
  • std: Standard deviation, a measure of the amount of variation or dispersion of a set of values.
  • min: Minimum value.
  • 25%: First quartile or 25th percentile.
  • 50%: Median or 50th percentile.
  • 75%: Third quartile or 75th percentile.
  • max: Maximum value.

Keep in mind that .describe() only provides statistics for numeric columns by default. If your DataFrame contains non-numeric columns and you want to include them in the summary, you can use df.describe(include=‘all’).

The output will shown as above.

df.duplicated().any()

view rawcheck duplicated data hosted with ❤ by GitHub

Lastly we call the method to check if any duplicated row. The output will be True or False. In this case the output is False indicating there are no duplicated row in the DataFrame.

This conclude the first part, we will continue this project on the second part: Defining The Label and Making Target Column

Medium.com

You can also visit my github public repository for the project below

github repository

This article is first published in https://www.xennialtechguy.id/posts/credit-risk-modelling-part-1/

--

--