Machine Learning Credit Risk Modelling : A Supervised Learning. Part 4

Wibowo Tangara
6 min readJan 23, 2024

--

Part 4: Feature Scaling and Encoding

Part 3: Feature Engineering and Selection

Medium.com

Feature Scaling

Feature Scaling:

Feature scaling is a preprocessing step in machine learning that standardizes or normalizes the range of independent variables or features of the dataset. The goal is to bring all features to a similar scale, preventing some features from dominating others and ensuring that the model can learn more effectively. Two common methods for feature scaling are:

  • Min-Max Scaling (Normalization): Scales the values in a feature to a range between 0 and 1
  • Scales the values in a feature to have a mean of 0 and a standard deviation of 1.

Feature scaling is essential for algorithms that rely on distances between data points, such as k-nearest neighbors or clustering algorithms, and for optimization algorithms like gradient descent, which converge more efficiently when features are on a similar scale.

Feature Encoding

Feature encoding is the process of converting categorical data into a numerical format that can be fed into machine learning algorithms. Categorical data consists of categories or labels and does not have a natural numerical representation. Two common methods for feature encoding are:

  • One-Hot Encoding:
  • Represents each category as a binary vector.
  • Creates a new binary column for each category, and the presence or absence of a category is indicated by 1 or 0.
  • Suitable for nominal categorical variables (categories with no inherent order).
  • Label Encoding:
  • Assigns a unique integer to each category.
  • Converts categories into numerical labels (integers).
  • Suitable for ordinal categorical variables (categories with a meaningful order).

The choice between one-hot encoding and label encoding depends on the nature of the categorical variable and the requirements of the machine learning algorithm. One-hot encoding is often preferred for nominal variables to avoid introducing false ordinal relationships.

Both feature scaling and feature encoding are crucial preprocessing steps to ensure that machine learning models can effectively learn from the data. The specific techniques chosen depend on the characteristics of the dataset and the requirements of the chosen machine learning algorithm.

Process of Feature Encoding and Feature Scaling

These are the processes that we will conduct in this part :

  • Numerical columns will be scaled using StandardScaler.
  • Categorical columns will be encoded using OneHotEncoding.
  • Developing dataframe for machine learning modeling.
categorical_cols = [col for col in df.select_dtypes(include='object').columns.tolist()]

onehot = pd.get_dummies(df[categorical_cols], drop_first=True)

onehot = onehot.drop(columns=['loan_label_good'])

view rawdevelop encoded dataframe hosted with ❤ by GitHub

This code is preparing categorical data for machine learning by first identifying categorical columns, then performing one-hot encoding on these columns, and finally dropping one of the resulting columns (‘loan_label_good’) to avoid multicollinearity or dummy variable trap. The drop_first=True parameter on the socond line is used to drop the first level of each categorical variable to avoid multicollinearity issues. This is common practice when one-hot encoding categorical variables, especially for algorithms like linear regression.The resulting onehot DataFrame can be concatenated with the original DataFrame to create a dataset suitable for machine learning models that require numerical input.

The reason we use one-hot encoding for categorical column:

  • Preventing Ordinal Relationships: One-hot encoding eliminates any ordinal relationship that may be incorrectly inferred by the algorithm from the original categorical values. In other words, it treats each category as independent and avoids introducing unintended ordinal relationships.
  • Handling Nominal Categories: For nominal categorical variables (categories without inherent order), one-hot encoding is particularly useful. It ensures that all categories are treated equally and prevents the algorithm from misinterpreting the nominal categories as having an ordinal relationship.
  • Avoiding Misinterpretation as Numeric Values: Without one-hot encoding, some algorithms may incorrectly interpret categorical variables with numeric labels as having a meaningful numeric relationship. One-hot encoding avoids this misinterpretation.

The onehot dataframe will have 242059 rows and 38 columns.

from sklearn.preprocessing import StandardScaler

numerical_cols = [col for col in df.columns.tolist() if col not in categorical_cols + ['loan_label']]

ss = StandardScaler()

std = pd.DataFrame(ss.fit_transform(df[numerical_cols]), columns=numerical_cols)

view rawdevelop scaled dataframe hosted with ❤ by GitHub

The explanation for each row of code above as: + Imports the StandardScaler class from scikit-learn, which is a preprocessing technique for standardizing numerical features. + Defines a list of numerical columns by excluding the categorical columns and the target column ‘loan_label’ from the entire list of columns in the DataFrame df. + Instantiates an object of the StandardScaler class. This object will be used to scale the numerical features. + Applies the fit_transform method of the StandardScaler to the numerical columns (df[numerical_cols]). fit_transform computes the mean and standard deviation of each numerical column and scales the values based on the computed statistics.

After this code segment, the std DataFrame contains the standardized (scaled) values of the numerical features. Standardization ensures that the numerical features have a mean of 0 and a standard deviation of 1, making them more suitable for certain machine learning algorithms, particularly those that rely on distance-based calculations (e.g., k-nearest neighbors, support vector machines). Standardization helps ensure that features are on a similar scale, preventing some features from dominating others during the learning process.

The reason we use standardscaler for numerical columns:

  • Normalization of Scale: StandardScaler transforms the numerical features in such a way that they have a mean of 0 and a standard deviation of 1. This brings all the features to a common scale, preventing features with larger scales from dominating those with smaller scales.
  • Improving Model Convergence: Many machine learning algorithms, especially those that involve gradient descent optimization, converge faster when the input features are on a similar scale. Standardizing features using StandardScaler can accelerate the convergence of these algorithms.
  • Equal Weight to Features: StandardScaler ensures that all numerical features contribute equally to the model’s learning process. Without standardization, features with larger scales may have a disproportionately larger impact on the model.
  • Enhancing Interpretability: In some models, interpretability is important. Standardizing features makes it easier to interpret the model coefficients, as they represent the change in the target variable associated with a one-standard-deviation change in the corresponding feature.
  • Supporting Regularization: Regularization techniques, such as L1 or L2 regularization, penalize large coefficients. StandardScaler can help prevent certain features from dominating the regularization term, making regularization more effective.
  • Assuming Normal Distribution: Some machine learning models, such as linear regression, assume that the features are normally distributed. StandardScaler helps to meet this assumption by transforming the features to have a mean of 0 and a standard deviation of 1.

The result is a NumPy array, which is then converted back to a DataFrame (std) with the original column names consisting 242059 rows and 15 columns.

The last step in this chapter is we will going to develop DataFrame using onehot,std, and target column.

df_reset = df.reset_index(drop=True)
onehot_reset = onehot.reset_index(drop=True)
std_reset = std.reset_index(drop=True)

df_model = pd.concat([onehot_reset, std_reset, df_reset[['loan_label']]], axis=1)

view rawdevelop dataframe for modeling hosted with ❤ by GitHub

Resetting Index:

  • df_reset = df.reset_index(drop=True): Resets the index of the original DataFrame df. The drop=True parameter is used to discard the old index and replace it with a new one starting from 0. This ensures that the row numbers are sequential and don’t add up.
  • onehot_reset = onehot.reset_index(drop=True): Resets the index of the one-hot encoded DataFrame onehot in a similar manner.
  • std_reset = std.reset_index(drop=True): Resets the index of the standardized DataFrame std in a similar manner.

Developing DataFrame for Machine Learning Modeling:

df_model = pd.concat([onehot_reset, std_reset, df_reset[[‘loan_label’]]], axis=1):

  • Uses pd.concat to concatenate the three DataFrames (onehot_reset, std_reset, and a subset of df_reset containing the ‘loan_label’ column) along the columns (axis=1).
  • The resulting DataFrame (df_model) is intended for use in machine learning modeling.
  • It combines the one-hot encoded categorical variables, the standardized numerical variables, and the target variable (‘loan_label’).

The purpose of resetting the index is to ensure that the row indices are consistent across the three DataFrames when concatenating them side by side. This helps prevent issues related to mismatched indices. The final df_model DataFrame consisting 242059 rows and 54 columns is now structured for machine learning modeling, with features and the target variable ready for training and testing machine learning algorithms.

This conclude the fourth part, we will continue this project on the fifth part: Modeling — Train, Test and Evaluate

Medium.com

You can also visit my github public repository for the project below

github repository

This article is first published in https://www.xennialtechguy.id/posts/credit-risk-modelling-part-4/

--

--