Machine Learning Credit Risk Modelling : A Supervised Learning. Part 2

Wibowo Tangara
6 min readJan 19, 2024

--

Part 2: Defining The Label and Making Target Column

Part 1: Understanding The Data

Medium.com

What is Label and Target Column

In machine learning, the concept of a label and a target column is crucial for supervised learning tasks. Let’s break down these terms:

Label:

  • In machine learning, a label refers to the output or the dependent variable that the model is trying to predict.
  • It represents the “answer” or the expected outcome for each input example in the dataset.
  • In classification tasks, the label is often a categorical variable, indicating the class or category to which an input belongs.
  • In regression tasks, the label is a continuous variable, representing a quantity or a numerical value.

Target Column:

  • The target column is the specific column in your dataset that contains the labels or the values you are trying to predict.
  • It is also known as the dependent variable or response variable.
  • When you’re working with a dataset, you typically have features (independent variables) and a target column.
  • The features are the input variables used by the model to make predictions, and the target column is what the model aims to predict.

Based of those terms, we wil identify the column in the dataset that contains the information we want our model to predict. This column becomes our target column or label.

Defining The Label and Making Target Column

In the first part after we understand the data we can see all the names of the column. Based on the expert judgment or subject matter expertise (SME) we will use the loan_status column to be our label to make the target.

df.loan_status.value_counts()

view rawcheck count of each value on label column hosted with ❤ by GitHub

count the occurrences of unique values in the ‘loan_status’ column with the code

The output will shown as above. As we can see there are 9 unique value with the number of occurences shows on the right side.

df.loan_status.value_counts(normalize=True)*100

view rawcheck percentage of count of each value on label column hosted with ❤ by GitHub

We can also check the percentages of occurrences of each unique value in the label column.

The output will shown as above.

Based on Subject Matter Expertise (SME) we will conduct the treatment to the label column as below:

loan_status column will be our label to make the target with:

  • ‘good’ label will be ‘Fully Paid’, and ‘Does not meet the credit policy. Status:Fully Paid’ loan status
  • row with ‘Current’ loan status will be dropped since we couldn’t determine yet wether it is bad or good
  • other loan status will be labeled ‘bad’
df = df[df['loan_status'] != 'Current']

view rawdropping row with ‘Current’ value in loan_status column hosted with ❤ by GitHub

We drop all the row in the dataset with the ‘Current’ value in the loan_status column.

import numpy as np

view rawimport numpy hosted with ❤ by GitHub

Imports the NumPy library and gives it the alias ‘np’, which is a common convention.

conditions = [
(df['loan_status'].isin(['Fully Paid', 'Does not meet the credit policy. Status:Fully Paid'])),
(df['loan_status'].isin(['Charged Off', 'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)',
'Default', 'Does not meet the credit policy. Status:Charged Off']))
]

view rawdefine conditions hosted with ❤ by GitHub

Define Conditions: The conditions list contains two conditions, each corresponding to a different category of loan status:

  • The first condition identifies instances where the loan status is either ‘Fully Paid’ or ‘Does not meet the credit policy. Status:Fully Paid’.
  • The second condition identifies instances where the loan status is any of the specified categories indicating a problematic situation (‘Charged Off’, ‘Late (31–120 days)’, ‘In Grace Period’, ‘Late (16–30 days)’, ‘Default’, ‘Does not meet the credit policy. Status:Charged Off’).
values = ['good','bad']

view rawdefine corresponding values hosted with ❤ by GitHub

The values list contains two corresponding categorical labels: ‘good’ for the first condition and ‘bad’ for the second condition.

Choosing whether to treat a target column as categorical or nominal depends on the nature of the problem you are trying to solve and the characteristics of the data. Here are some considerations:

Nature of the Problem:

  • Classification Tasks: If your problem involves predicting categories or classes, your target variable is likely categorical. For example, predicting whether an email is spam or not spam, classifying images into different categories, or predicting the outcome of a loan application (approved or denied) are all classification tasks where the target is categorical.
  • Regression Tasks: If your goal is to predict a numerical value or a continuous variable, then your target variable is continuous, and you are dealing with a regression problem.

Data Characteristics:

  • Categorical Data: If the target variable represents categories without a specific order (e.g., colors, types of fruit), it is categorical. In this case, you might use techniques like one-hot encoding to represent the categories numerically.
  • Ordinal Data: If the categories have a meaningful order, but the differences between them are not well-defined, the target variable is still categorical but may have an ordinal nature. For example, a survey question with responses like “Low,” “Medium,” and “High” is ordinal.

Statistical Analysis:

  • Categorical variables are often used in statistical analyses like logistic regression for classification tasks.
  • Nominal variables are those without an inherent order, while ordinal variables have a meaningful order. The choice may depend on the statistical method used.

Algorithm Requirements:

  • Some machine learning algorithms are designed to handle categorical data directly, while others may require numerical input. In the case of categorical targets, you might need to encode them appropriately.

Interpretability:

  • Depending on the problem and the audience, it might be easier to interpret and communicate results when the target variable is categorical.

In summary, the choice between treating a target column as categorical or nominal depends on the problem context, the nature of the data, and the requirements of the chosen machine learning algorithm. Understanding the characteristics of your target variable helps guide your choice and informs the preprocessing steps you may need to take before building and training your machine learning model.

In this case we choose the target column as categorical type, and we will change it to nominal at the model evaluation stage.

df['loan_label'] = np.select(conditions, values, default='Unknown')

view rawcreate the new column based on conditions hosted with ❤ by GitHub

After running this code above, the DataFrame df will have a new column ‘loan_label’ that categorizes loans as either ‘good’ or ‘bad’ based on the specified conditions.

import numpy as np

conditions = [
(df['loan_status'].isin(['Fully Paid', 'Does not meet the credit policy. Status:Fully Paid'])),
(df['loan_status'].isin(['Charged Off', 'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)',
'Default', 'Does not meet the credit policy. Status:Charged Off']))
]

values = ['good','bad']

df['loan_label'] = np.select(conditions, values, default='Unknown')

view rawcreating target hosted with ❤ by GitHub

The full block code will be seen as above.

df.shape

view rawcheck data shape hosted with ❤ by GitHub

df['loan_label'].value_counts(normalize=True)*100

view rawcheck percentage of value occurence in target column hosted with ❤ by GitHub

df.loan_label.value_counts()

view rawcheck count of value occurence in target column hosted with ❤ by GitHub

After we generate the target column and checking the dataset, at this moment we will have a DataFrame consisting 242059 rows and 76 columns, which have 186727 (77.14%) rows of ‘good’ value on the target column loan_label and 55332 (22.86%) rows of ‘bad’ value on the target column.

This conclude the second part, we will continue this project on the third part: Feature Engineering and Selection

Medium.com

You can also visit my github public repository for the project below

github repository

This article is first published in https://www.xennialtechguy.id/posts/credit-risk-modelling-part-2/

--

--