Machine Learning Credit Risk Modelling : A Supervised Learning. Part 3

Wibowo Tangara
9 min readJan 22, 2024

--

Part 3: Feature Engineering and Selection

Part 2: Defining The Label and Making Target Column

Medium.com

What Are Feature Engineering and Feature Selection? Why It’s Important?

Feature Engineering:

Feature engineering involves creating new features or modifying existing features in a dataset to improve the performance of machine learning models. It’s a crucial step in the data preprocessing phase. The goal is to provide the model with more meaningful and relevant information, helping it to better understand patterns and relationships within the data.

Feature engineering can include tasks such as:

  • Creating Interaction Terms: Combining two or more features to capture potential interactions between them.
  • Polynomial Features: Introducing higher-order terms to capture non-linear relationships between features.
  • Handling Missing Values: Imputing or removing missing values in a way that aligns with the characteristics of the data.
  • Normalization and Scaling: Scaling numerical features to a common scale to prevent certain features from dominating others.
  • Encoding Categorical Variables: Converting categorical variables into numerical representations, such as one-hot encoding.
  • Time-Based Features: Extracting features related to time, day of the week, or seasonality in time series data.

Feature engineering requires a good understanding of the domain and the problem at hand. Well-engineered features can significantly enhance a model’s ability to learn and generalize.

Feature Selection:

Feature selection involves choosing a subset of the most relevant features from the original set of features. The goal is to reduce dimensionality and eliminate features that do not contribute significantly to the predictive performance of the model. Feature selection methods can be categorized into three types:

  • Filter Methods: These methods evaluate the relevance of features based on statistical measures and ranking. Common metrics include correlation, chi-squared, and mutual information.
  • Wrapper Methods: These methods assess feature subsets by training and evaluating the model with different combinations of features. Recursive Feature Elimination (RFE) is an example of a wrapper method.
  • Embedded Methods: These methods incorporate feature selection as part of the model training process. For example, decision trees and LASSO regression automatically perform feature selection during training.

Why Feature Engineering and Feature Selection are Important:

  • Improved Model Performance: Well-engineered features can expose underlying patterns in the data, leading to better model performance.
  • Reduced Overfitting: Feature selection helps to eliminate irrelevant or redundant features, reducing the risk of overfitting and improving a model’s ability to generalize to new, unseen data.
  • Faster Training and Inference: Removing irrelevant features can lead to faster model training and prediction times, especially important in real-time applications.
  • Enhanced Interpretability: By selecting the most important features, models become more interpretable, making it easier to understand and explain the factors influencing predictions.

In summary, feature engineering and feature selection are critical steps in the machine learning pipeline. They contribute to model robustness, interpretability, and overall performance. The choice of which techniques to apply depends on the specific characteristics of the data and the goals of the machine learning task.

Feature Engineering and Feature Selection

in this part, if needed we will :

  • drop column who have high threshold of missing value (20% or more)
  • drop column who consist all unique value
  • drop column than only have 1 unique value
  • drop column that contain free text value
  • drop based target column(loan_status column)
  • drop column with 1 dominant category
  • drop numerical column that have high corelation with other column (excluding the target column)
  • drop categorical column with high cardinality
  • drop other column we won’t be using in the model (expert judgement/subject matter expertise)
  • manipulate value on features including handling missing value

Scaling and Encoding will be conducted in the next part.

threshold = len(df) * 0.8
df = df.dropna(axis=1, thresh=threshold)

view rawdrop column who have high threshold of missing value (20% or more) hosted with ❤ by GitHub

This code is essentially removing columns from the DataFrame df that have more than 20% (1–0.8) missing values. The threshold is set based on the assumption that columns with a significant number of missing values may not be informative or useful for the analysis, and removing them can help simplify the dataset and potentially improve the quality of the data for further processing or modeling.

unique_cols = [col for col in df.columns if df[col].nunique() == len(df)]
df = df.drop(columns=unique_cols)

view rawdrop column who have all unique value hosted with ❤ by GitHub

The overall purpose of this code is to identify and remove columns from the DataFrame df where every value is unique across all rows. These columns are considered to be less informative and might be dropped to simplify the dataset and potentially improve the efficiency of subsequent analyses or modeling efforts.

single_value_cols = [col for col in df.columns if df[col].nunique() == 1]
df = df.drop(columns=single_value_cols)

view rawdrop column with only 1 unique value hosted with ❤ by GitHub

The overall purpose of this code is to identify and remove columns from the DataFrame df where every value is the same across all rows, as these columns do not contribute meaningful information to the dataset. This step can be particularly useful in data preprocessing to clean up the dataset and improve the efficiency of subsequent analyses or modeling efforts.

df = df.drop(columns=['desc'])

view rawdrop free text value column hosted with ❤ by GitHub

For column with free text value, we use the drop method to remove it, as we can see in the first part of this project, the desc column have a free text value. Note: I put this code on comment in the python code because this column actually have been dropped due to high missing value.

df = df.drop(columns=['loan_status'])

view rawdrop based target column hosted with ❤ by GitHub

Why we remove the label column?

Removing the label column (the target variable) from the feature set during the preprocessing phase generally done to separate the features (independent variables) from the target variable (dependent variable) before training a machine learning model. This separation serves a few purposes:

Model Training:

During the training phase, the model needs to learn patterns and relationships within the data. The target variable is what the model is trying to predict, so it’s kept separate during training to ensure the model is not learning from the target variable itself.

Supervised Learning Setup:

In supervised learning, the goal is to train a model to make predictions based on input features. The target variable is the output or label we want the model to predict. By separating the target variable from the features, you create a clear distinction between what the model is trying to predict and the information it uses for prediction.

Preventing Data Leakage:

Including the target variable in the feature set could lead to data leakage, where the model inadvertently learns patterns that won’t generalize well to new, unseen data. The model should not have direct access to the target variable during training; otherwise, it might simply memorize the target values rather than learning underlying patterns.

for col in df.select_dtypes(include='object').columns.tolist():
value_counts_percentage = df[col].value_counts(normalize=True) * 100
if any(value_counts_percentage > 80):
print(value_counts_percentage)
print('\n')

view rawchecking column with 1 dominant category, 80% will be the threshold hosted with ❤ by GitHub

This code is intended to identify and print out the percentage distribution of values for columns where any single category constitutes more than 80% of the total values in that column. This can be useful for detecting highly imbalanced or dominated categorical columns in a dataset.

The output will shown as above. As we can see there is only 1 column (pymnt_plan) with one of the value is dominant( above 80%).

for col in df.select_dtypes(include='object').columns.tolist():
value_counts_percentage = df[col].value_counts(normalize=True) * 100
if any(value_counts_percentage > 80):
df = df.drop(columns=col)

view rawdrop column with 1 dominant category automaticaly, 80% will be the threshold hosted with ❤ by GitHub

To drop the column with 1 dominant category automaticaly, we modify the previous code by changing the print method with pandas method to drop column.

import matplotlib.pyplot as plt
import seaborn as sns

correlation_matrix = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='Spectral', center=0, annot_kws={'size': 6})
plt.title('Correlation Heatmap')
plt.show()

view rawcreating a heatmap showing correlation matrix hosted with ❤ by GitHub

The code above using the matplotlib and seaborn libraries generates a visual representation of the correlation between different variables in the DataFrame df using a heatmap. The colors in the heatmap indicate the strength and direction of the correlation, making it easy to identify patterns and relationships in the data.

The heatmap produced can be seen above. The strength of a correlation is typically assessed using the absolute value of the correlation coefficient. The correlation coefficient ranges from -1 to 1, where:

  • 1: Perfect positive correlation
  • 0: No correlation (uncorrelated)
  • -1: Perfect negative correlation

The closer the absolute value of the correlation coefficient is to 1, the stronger the correlation. Commonly used thresholds for interpreting the strength of correlation are:

  • 0.00 to 0.19: Very weak correlation
  • 0.20 to 0.39: Weak correlation
  • 0.40 to 0.59: Moderate correlation
  • 0.60 to 0.79: Strong correlation
  • 0.80 to 1.00: Very strong correlation

It’s important to note that correlation does not imply causation. Even if two variables have a strong correlation, it doesn’t necessarily mean that changes in one variable cause changes in the other. Correlation only measures the strength and direction of a linear relationship between two variables.

df = df.drop(columns=['funded_amnt','funded_amnt_inv','installment','total_pymnt','total_pymnt_inv','total_rec_prncp',
'total_rec_int','last_pymnt_amnt','collection_recovery_fee','out_prncp_inv','open_acc'])

view rawdrop column that have high corelation with other column hosted with ❤ by GitHub

We drop the column that have correlation coefficient >0.4 or <-0.4 with other column, and choose only 1 column to stay (i.e.: we drop ‘funded_amnt’,‘funded_amnt_inv’,‘installment’,‘total_pymnt’,‘total_pymnt_inv’,‘total_rec_prncp’, ‘total_rec_int’,‘last_pymnt_amnt’ and choose ‘loan_amnt’ to stay). The decision of which column to stay is based on SME.

The result as above show there’s no column that have correlation coefficient >0.4 or <-0.4 with other column.

At this point if we check the data shape, we will find out that we have 242059 rows and 32 columns, next we will remove the column that have high cardinality (I use 32 as the threshhold for high cardinality for categorical column) and remove column based on SME (sub_grade column).

df = df.drop(columns=['emp_title','last_pymnt_d','earliest_cr_line','last_credit_pull_d','title','addr_state','zip_code',
'issue_d','sub_grade'])

view rawdrop column that have high cardinality and based on SME hosted with ❤ by GitHub

The result are we have 242059 rows and 23 columns with 9 columns are having missing value at this point.

categorical_columns = ['emp_length']
numerical_columns = ['revol_util','collections_12_mths_ex_med','inq_last_6mths','acc_now_delinq','delinq_2yrs','total_acc',
'pub_rec','annual_inc']

for col in categorical_columns:
mode_value = df[col].mode()[0]
df[col].fillna(mode_value, inplace=True)

for col in numerical_columns:
median_value = df[col].median()
df[col].fillna(median_value, inplace=True)

view rawdefining columns with missing value and handling missing value hosted with ❤ by GitHub

This code is used for handling the missing value in our dataset, here we define the the column with missing value then impute missing value in categorical columns with mode because it helps to preserve the overall distribution of categorical data and is suitable when the missing values are expected to occur at random. For the numerical column, we impute the missing value with median because it less sensitive to extreme values (outliers).

If we check again at this point we have 242059 rows and 23 columns with no missing value.

This conclude the third part, we will continue this project on the fourth part: Feature Scaling and Encoding

Medium.com

You can also visit my github public repository for the project below

github repository

This article is first published in https://www.xennialtechguy.id/posts/credit-risk-modelling-part-3/

--

--