Building a predictive credit risk analysis model using XGBoost

54 min readNov 7, 2023

Clique aqui para ler esse artigo em Português.

Credit risk analysis is a key component in maintaining the health of financial institutions’ balance sheets. Keeping a low default rate ensures that the loans being made are profitable. For this, the use of machine learning to build models capable of identifying patterns and predicting whether a customer may become in default has been intensified.

In this article, I show the path I took to find the best predictive model for this problem, using exploratory analysis, statistical analyses (descriptive, diagnostic, and prescriptive), indicators and graphs, data cleaning and treatment, data preparation with standardization (Standard Scaler) and balancing (Undersampling), treatment of categorical variables with Label Encoder and dummy variables, creation of a function to create and evaluate models, construction of models for comparisons with 7 different algorithms, optimization of XGBoost hyperparameters using cross-validation (Stratified KFold) and grid search (Grid Search), application of feature engineering, and performance evaluation of models with the construction of confusion matrices and by the metrics of Area Under the Curve (AUC) and Recall. Finally, it is proven that one model has better performance than the other with hypothesis test z.

* Note

This is a complete study report, including the codes and methodology used. I published a brief version of it, more direct, where I bring only the main results of this research.

To check the summarized article click here.

Summary

About the Project
General Objective
2.1. Specific Objectives
Obtaining the Data
Variable Dictionary
Data and Library Importation
Exploratory Data Analysis
Data Cleaning and Processing
7.1. Remove attributes
7.2. Remove missing data intarget_default
7.3. Correct data type errors inemail
7.4. Handle missing data
7.5. Change the data type of variables
7.6. Remove outliers inincome
Data Preparation
8.1. Handle categorical variables
8.2. Split the target class from the independent classes
8.3. Training and test sets
8.4. val_model function
8.5. Baseline model
8.6. Data standardization and balancing
Performance Assessment Metrics
Creating Machine-Learning Models
Why XGBoost?
XGBoost: Hyperparameter Optimization
Feature Engineering
13.1. Remove variables
13.2. Create new variables
13.2.1. Extract information fromlat_lon
13.2.2. Extract information from shipping_state
13.3. Data Preparation
13.3.1. Handle categorical variables
13.3.2. Split the target class from the independent classes
13.3.3. Training and test set
13.3.4. Baseline model
13.3.5. Data standardization and balancing
13.4. Creating Machine-Learning Models
XGBoost: Feature Engineering + Hyperparameter Optimization
XGBoost Model Comparison
15.1. Confusion Matrix
15.2. AUC
15.3. Recall
Hypothesis Test
Conclusion

1. About the Project

Credit analysis in financial institutions is crucial to assess whether a loan borrower has the potential to fulfill the contract or if they might become delinquent, that is, fail to pay back the loan. This is because the decision to lend or not will directly impact the financial institution’s balance sheet, potentially leveraging the results or significantly harming it.

This credit evaluation usually involves analyzing the customer’s credit history, such as whether they have previously defaulted or not, and if they have assets that could serve as collateral, among various other factors that aim to measure their ability to meet their obligations to return the granted amount.

When a customer becomes delinquent, that is, fails to pay their debt, this person becomes a default. And, due to the risk of default for both financial institutions and the entire financial system, and even for a country’s economy, there is an increasing investment in artificial intelligence models to predict potential defaults and minimize them, to avoid greater losses.

These models seek to identify patterns and trends in historical data that are not obvious to the human eye, and thus, they can predict future behavior with greater accuracy, such as in our case, what the risk would be of a client becoming delinquent, or a default. In this way, losses are avoided and profits for lending institutions are increased. It also makes the entire system more resilient.

One of the financial institutions that has been investing in this type of tool, partly because it was created to provide credit to people who previously did not have access to such products at other banks, is the Brazilian fintech NuBank. For this reason, both its data science team and its uses of artificial intelligence have become a reference in this area. As a result, the bank has managed to keep its delinquency rate lower than traditional banks, according to a 2022 report published by UBS BB.

With all these factors in mind, this study, which is based on data made available by Nubank in a competition it held to reveal talents, will aim to create an artificial intelligence model capable of predicting whether a client is likely to become a default or not.

2. General Objective

Develop a machine-learning model that predicts whether a new customer is likely to default.

2.1. Specific Objectives

Conduct an exploratory data analysis to get to know the dataset and extract insights that may assist in later stages.
Create and evaluate machine-learning models with various types of supervised algorithms.
Optimize the hyperparameters of the XGBoost algorithm to seek better model performance.
Perform feature engineering to improve the default prediction of the model created with XGBoost.
Evaluate the two models produced and determine which one performs best.

3. Obtaining the Data

The data used in this project were originally made available by Nubank. The complete dataset has been saved in the cloud as a precaution, in case it is removed and can be accessed at this link.

4. Variable Dictionary

Understanding the dataset involves checking the available variables in it so that a good analysis can be performed. Although there is no official documentation on the meaning of each variable, it was possible to infer the meaning of some of them, according to the records in the data frame. Thus, they were separated into 2 different tables: with and without inferred meaning.

in alphabetic order

application_time_applied: time when the request was applied (string) (HH:MM:SS)
application_time_in_funnel: customer position in the sales funnel at the time of application (int)
credit_limit: credit limit (float)
email: customer email provider (string)
external_data_provider_fraud_score: fraud score (int)
facebook_profile: does the person have a Facebook profile? Yes (True)/No (False) (string)
ids: unique customer identification (string)
income: customer income (float)
last_amount_borrowed: total value of the last loan/credit granted (float)
last_borrowed_in_months: when the last loan was made (in months) (float)
lat_lon: customer location(latitude and longitude) (string — tuple)
marketing_channel: channel through which the customer made the application (string)
n_defaulted_loans: number of defaulted loans (float)
profile_phone_number: phone number(string)
reported_income: income value reported by the client (float)
risk_rate: risk score(float)
shipping_state: card delivery location (country and state acronym) (string)
shipping_zip_code: zip code for sending the card (int)
target_default: target variable, indicates whether the customer defaulted (True) or not(False) (string)
target_fraud: target variable, indicates whether the customer was defrauded (string)
user_agent: type of device the user is using (PC, cell phone, brand, operating system, etc.) (string)

Other variables
channel: string (anonymized)
external_data_provider_credit_checks_last_2_year: float
external_data_provider_credit_checks_last_month: int
external_data_provider_credit_checks_last_year: float
external_data_provider_email_seen_before: float
external_data_provider_first_name: string
job_name: string (anonymized)
n_accounts: float
n_bankruptcies: float
n_issues: float
ok_since: float
profile_tags: string (dictionary)
real_state: string (anonymized)
reason: string (anonymized)
score_1: string (anonymized)
score_2: string (anonymized)
score_3: float
score_4: float
score_5: float
score_6: float
state: string (anonymized)
zip: string (anonymized)

5. Data Library and Importation

When starting a project, it’s necessary to install packages, import libraries that have specific functions to be used in the following lines of code, and make the necessary configurations for the code’s output. Additionally, you proceed with importing the dataset, saving it in a specific variable for later use.

# install additional packages
!pip install scikit-plot -q   # data visualization
!pip install scipy -q         # statistical analysis

# import libraries
import pandas                 as pd     # data manipulation
import numpy                  as np     # array maniputlation
import missingno              as msno   # missing data evaluation
import matplotlib.pyplot      as plt    # data visualization
import seaborn                as sns    # statistical data visualization
import scikitplot             as skplt  # data visualization and machine-learning metrics
from sklearn.impute           import SimpleImputer    # handling missing values
from sklearn.preprocessing    import LabelEncoder     # categorical data transformation
from sklearn.model_selection  import train_test_split # split into training and test sets
from sklearn.pipeline         import make_pipeline    # pipeline construction
from sklearn.model_selection  import cross_val_score  # performance assessment by cross-validation
from sklearn.preprocessing    import StandardScaler   # data standardization
from imblearn.under_sampling  import RandomUnderSampler     # data balancing
from sklearn.model_selection  import StratifiedKFold        # performance assessment with stratified data
from sklearn.model_selection  import GridSearchCV           # creating grid to evaluate hyperparameters
from sklearn.metrics          import classification_report  # performance report generation
from sklearn.metrics          import roc_auc_score          # performance evaluation by AUC
from sklearn.metrics          import recall_score           # recall performance assessment
from scipy.stats              import norm                   # statistical analysis (normal distribution)

# data classification models
from sklearn.ensemble         import RandomForestClassifier
from sklearn.tree             import DecisionTreeClassifier
from sklearn.neighbors        import KNeighborsClassifier
from sklearn.linear_model     import SGDClassifier
from sklearn.svm              import SVC
from sklearn.linear_model     import LogisticRegression
from xgboost                  import XGBClassifier
from lightgbm                 import LGBMClassifier


import warnings                   # notifications
warnings.filterwarnings('ignore') # set notifications to be ignores

# additional settings
plt.style.use('ggplot')
sns.set_style('dark')
np.random.seed(123)

# configure the output to show all rows and columns
pd.options.display.max_columns  = None

# import data sets and save them into variables
data_path = "http://dl.dropboxusercontent.com/s/xn2a4kzf0zer0xu/acquisition_train.csv?dl=0"
df_raw = pd.read_csv(data_path)

6. Exploratory Data Analysis

This is an essential step in data science projects where the goal is to better understand the data, whether by identifying patterns, outliers, possible relationships between variables, etc. In this study, information that was relevant to guide the responses to the previously indicated objectives (see General Objective and Specific Objectives) was explored.

For this purpose, various techniques and tools will be used, such as graphs, frequency tables, and statistical data, among other methods that are deemed necessary. In this phase, the data scientist becomes a detective in search of things that are not explicit in the data frame. For this reason, the data will also be plotted in different ways, to better visualize them and test initial hypotheses, to obtain insights that may guide the rest of the project.

First, I generated the visualization of the first 5 and the last 5 entries to check the composition of the dataset, and I verified if, at the end of it, there were no inappropriate records, such as total sums.

# print the 5 first entries
df_raw.head()

# print the 5 last entries
df_raw.tail()

Upon first inspection of this dataset, one can notice that the data has been anonymized, which is very common nowadays to comply with the General Data Protection Law (known as LGPD in Brazil), as well as international legislation, when applicable.

Furthermore, we can observe, in addition to some missing values:

ids probably represents the unique identification of a client, therefore, it is an attribute that should be removed as it does not contain any relevant information for the construction of the machine-learning model and, moreover, it could be a variable that might even hinder the development of the model if the algorithm manages to relate it to other data in some way.
target_default indicates in boolean values the risk of default for that specific client, that is, it is our target variable. For this, True points to a client who has become delinquent, while False indicates that the client was not a default.
Following, we have 6 attributes named as score, with the first two (score_1 and score_2) in string format, while the others (score_3, score_4, score_5, and score_6) are of the float type. From their values and information, it was not possible to infer an initial meaning.
risk_rate, presented in float type, indicates some risk score.
last_amount_borrowed, also in float type, shows the total value of the last loan/credit granted. In this attribute, it is possible to see missing values, which may indicate that the person did not take out any loan or use their credit limit.
last_borrowed_in_months, also in float type, shows when the last loan/credit was granted, in months. In this attribute, it is possible to see missing values, which may indicate that the person did not take out any loan or use their credit limit.
In the credit_limit attribute, in float, we have the value of the credit limit. In this attribute, it is possible to see missing values, as well as values equal to 0, which means that the person has no credit limit to be granted.
income, in float type, is the client's income.
facebook_profile, in boolean, indicates whether the person has a profile on the social network Facebook (True) or not (False). In this attribute, it is possible to see missing values, which may indicate that the field was not filled out and, therefore, it can be inferred that the person does not have a profile on the network.
The variables reason, state, zip, channel, job_name (with missing values), real_state, in string type, were anonymized.
ok_since (with missing values), n_bankruptcies, n_defaulted_loans, n_accounts, n_issues (with missing values) are in float format.
application_time_applied refers to the time the client applied for the request, in format (HH:MM:SS).
application_time_in_funnel is in int format and indicates the client's position in the sales funnel at the time of application.
email, in string type, indicates the client's email provider.
external_data_provider_credit_checks_last_2_year (with missing values) and float type, external_data_provider_credit_checks_last_month in int type, external_data_provider_credit_checks_last_year (with missing values) and float type, external_data_provider_email_seen_before in float type, external_data_provider_first_name in string type, external_data_provider_fraud_score in int type, indicate captures of external information to fill this database, with the last one, (fraud_score) seeming to indicate the Serasa score, since its range is from 0 to 1000 points.
lat_lon in string type, indicates the client's location in a tuple format that informs the latitude and longitude.
marketing_channel in string type shows through which channel the client made the application.
profile_phone_number in string type informs the client's phone number.
reported_income in float type is the income reported by the client.
shipping_state in string type, shows where the card should be sent to the client, has 5 values with the first 2 letters being the country code, with a separator type -, and two values to inform the state code.
shipping_zip_code in int type, refers to the postal code for sending the card.
profile_tags in string type, is in dictionary format.
user_agent in string type, seems to indicate the type of device the user is using (whether it's a PC, mobile phone, brand, operating system, among others).
target_fraud (with missing values) appears to be a target variable for fraud detection.

The next step is to know the size of this dataset.

# check the data set size
print('Dataset Size')
print('-' * 30)
print('Total de entries:   \t {}'.format(df_raw.shape[0]))
print('Total de attributes:\t {}'.format(df_raw.shape[1]))
'''
Dataset Size
------------------------------
Total de entries:     45000
Total de attributes:  43
'''

Let’s take a closer look at these 43 variables. The goal will be to understand the type of variable present in these attributes, check for missing values, data distribution, outliers, etc.

# generate data frame information
df_raw.info()

With the output above, we can confirm that there are variables with missing data and that some have the incorrect variable type:

target_default and facebook_profile are in string type but should be boolean, due to the presence of missing data, this treatment should be done before the type conversion.
application_time_applied is in string format and could be converted to datetime (HH:MM:SS), however, Scikit-Learn does not accept Timestamp formats. Therefore, this attribute will be discarded.

In addition to these changes, it was observed that it would be possible to apply feature engineering to the following variables:

In lat_lon, the information is joined by a tuple. We could separate this information, as well as reduce the number of decimal places to approximate the information.
The separation of information can also be done in shipping_state, which contains data in the country and state abbreviation, separated by a hyphen.
In user_agent, the information related to the brand and model of the device used for the connection could be segregated into distinct attributes, that is, brand and model.

As observed, we have a reasonable amount of missing values. Let’s check this in more detail by printing the amount (in percentage) of missing data in each attribute, organized in descending order, from highest to lowest.

# check amount of missing data
print(((df_raw.isnull().sum() / df_raw.shape[0]) * 100).sort_values(ascending=False).round(2))

At the top of the column above are the attributes with the highest amount of missing data. That is, target_fraud, last_amount_borrowed, last_borrowed_in_months, ok_since, and external_data_provider_credit_checks_last_2_year have more than half of the data missing. Some of these can be improved with a data processing treatment; for example, in last_amount_borrowed and last_borrowed_in_months, a missing value may simply mean that the client did not take out any loan and, therefore, there would also be no value filled in for the amount of this loan, since it never occurred.

For the variables external_data_provider_credit_checks_last_year, credit_limit, and n_issues, the amount of missing data represents between 25 and 35% of the data.

In our target variable target_default, we have 7.24% of missing data, which will need to be excluded from the dataset for the modeling of the predictive algorithm.

Therefore, in general, it is necessary to point out the need for treatment of missing values

Let’s visualize the amount of missing data for each attribute below, to facilitate the understanding of the quality of this dataset

# print chart to check for missing data
msno.bar(df_raw, figsize=(10,4), fontsize=8);

Next, it’s worth looking at the number of unique entries in each variable.

# check unique input values for each attribute
df_raw.nunique().sort_values()

From top to bottom, in the output above, we have that:

external_data_provider_credit_checks_last_2_year and channel have only one input value, therefore, as we do not have access to the original dictionary of attributes, I will remove these attributes since they will not add information to the machine-learning model.
The attributes score_4, score_5, score_6, and profile_phone_number each has 45,000 unique entries. This means that each record has a value that does not add information for the algorithm. However, in the case of the phone number, it makes sense, that is, each person has a different number, therefore, as it does not add information, this attribute will be removed. As for the other score variables, since they are numerical and probably may be the result of mathematical calculations and/or have undergone some type of normalization, as they are in float values these variables will be maintained.

Now, for the attributes that have up to 5 unique values, I will check to confirm that there are no inappropriate values that have been entered erroneously. These attributes will be: target_fraud, target_default, external_data_provider_credit_checks_last_year, facebook_profile, last_borrowed_in_months, external_data_provider_credit_checks_last_month, n_defaulted_loans, and real_state.

# create variable with attributes with up to 5 unique entries
five_unique = df_raw[['target_fraud', 'target_default',
                      'external_data_provider_credit_checks_last_year',
                      'facebook_profile', 'last_borrowed_in_months',
                      'external_data_provider_credit_checks_last_month',
                      'n_defaulted_loans', 'real_state']]

# generate result for each attribute
for i in five_unique:
  print('{}'.format(i), df_raw[i].unique())
'''
target_fraud [nan 'fraud_friends_family' 'fraud_id']
target_default [False True nan]
external_data_provider_credit_checks_last_year [ 0. nan  1.]
facebook_profile [True False nan]
last_borrowed_in_months [36. nan 60.]
external_data_provider_credit_checks_last_month [2 1 3 0]
n_defaulted_loans [ 0.  1. nan  2.  3.  5.]
real_state ['N5/CE7lSkAfB04hVFFwllw==' 'n+xK9CfX0bCn77lClTWviw=='
 'nSpvDsIsslUaX6GE6m6eQA==' nan 'UX7AdFYgQh+VrVC5eIaU9w=='
 '+qWF9pJpVGtTFn4vFjb/cg==']
'''

It can be noted that the inconsistent values in the variables checked above are NAN, that is, missing values.

Now, I will check a statistical summary of these attributes, as it provides important data on the mean, median, maximum, and minimum values, standard deviation, as well as the quartile values.

# see statistical summary of numerical data
df_raw.describe().round(4)

Once again, in the count line, we confirm the presence of missing values in some attributes. Moreover, in an individual analysis, what stood out the most was:

In last_amount_borrowed, we see that the minimum value is 1,000 reais, while the maximum reached 35,000 reais.
In last_borrowed_in_months, as seen previously, we have only 2 unique values filled here, which might not make much sense for these rounded statistical values, but the explanation lies in this fact.
In credit_limit, we may have outlier values; however, it is common for banks to grant higher credit to people with more resources, so there should not be many clients with a maximum credit of almost 450,000 reais. While the average is about 35,000 reais. It is also noted that the median is even lower than the average value, at approximately 25,000 reais. With this, we have that indeed there is a larger number of people with a limit lower than the average of 35,000 reais, which pulls the median to 25. Furthermore, regarding outliers, they can also be perceived through the value of the 3rd quartile, which confirms that 75% of the dataset has a credit limit of up to 47,000 reais.
In income, we can confirm that we have the presence of clients with large differences in income. We can see this directly by the value of the standard deviation and by the maximum and minimum values. That is, we have a deviation of 52,000, the minimum value is 4,800 reais, the maximum is 5 billion, which will distort the average that is 716,000, and, therefore, the median gives us a more appropriate value, which is 61,000 reais.
In external_data_provider_credit_checks_last_2_year, we have values equal to zero, which makes sense that it is filled only with this value, which also coincides with what was seen previously that there is the presence of only one unique value in this attribute.
In external_data_provider_email_seen_before, the minimum value is -999 while the maximum value is 59. Therefore, they must be outliers and should be replaced so as not to interfere with the analysis.
In external_data_provider_fraud_score, we have another point in favor of the hypothesis that it is the Serasa score, since its values range from 0 to 1000, just like the dataset data. Therefore, it is worth noting that we have an average of 500 points, a median very close to this value, and the 3rd quartile at 747.
In reported_income, we have the maximum value as infinite, which means that we have very high values for this attribute. As this can impair the analysis, it will be necessary to treat these data.

# view statistical summary of categorical data
df_raw.describe(include=['O'])

Regarding the categorical attributes, it is noted that:

In target_default, the value False is present in about 35 of the nearly 42,000 entries, indicating an imbalance in the data.
Almost half of the dataset has an email registered with the Google domain (gmail.com).

Let’s take a closer look at the credit_limit and income attributes using the boxplot.

# configure boxplot for 'credit_limit' and 'income'
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 4))

# 'credit_limit'
sns.boxplot(df_raw.credit_limit, orient='h', showmeans=True,
            palette=['#004a8f'], ax=ax1)
ax1.set_title('Credit Limit', loc='left', fontsize=16,
              color='#6d6e70', fontweight='bold', pad=20)
ax1.set_yticks([])

# 'income'
sns.boxplot(df_raw.income, orient='h', showmeans=True,
            palette=['#004a8f'], ax=ax2)
ax2.set_title('Income', loc='left', fontsize=16,
              color='#6d6e70', fontweight='bold', pad=20)
ax2.set_yticks([])

plt.tight_layout();

Once again, we have managed to show the presence of outliers in the graph above. Thus, I see no need to adjust the credit limit data, as they are better distributed. However, in the income attribute, I see the need to remove the last 10 points from the set, since the boxplot is quite distorted and there are few data points that are so far from the standard.

I also want to check the distribution of email domains.

# plot the distribution graph of the 'email' attribute
fig, ax = plt.subplots(figsize=(10, 5))
sns.countplot(x=df_raw.email)
plt.tight_layout();

We found two values with a filling error: hotmaill.com and gmaill.com, both with an extra letter l. This requires proper treatment as well.

Finally, I will check the balance of the data in target_default.

# representation of the amount of 'target_default' in percentage
print('Total (FALSE): {}'.format(df_raw.target_default.value_counts()[0]))
print('Total (TRUE):  {}'.format(df_raw.target_default.value_counts()[1]))
print('-' * 30)
print('The default total represents {:.2f}% of the dataset.'.format(((df_raw.target_default.value_counts()[1]) * 100) / df_raw.shape[0]))
'''
Total (FALSE): 35080
Total (TRUE):  6661
------------------------------
The default total represents 14.80% of the dataset.
'''

As observed and expected, we have a lower value of default compared to the number of non-defaults. That is, most clients paid their debts, while 14.8% of clients defaulted. Because of this, there is a certain imbalance in the data, which should be considered in the data preprocessing phase.

Let’s visualize this proportion in a bar chart.

# plot bar chart
## define axes
x = ['Non-Default', 'Default']
y = df_raw.target_default.value_counts()

## configure bar colors
bar_colors = ['#bdbdbd', '#004a8f']

## plot chart
fig, ax = plt.subplots(figsize=(6, 5))
ax.bar(x, y, color=bar_colors)

### title
ax.text(-0.5, 41000, 'Proportion of Customers', fontsize=20, color='#004a8f',
        fontweight='bold')

### subtitle
ax.text(-0.5, 39000, 'Number of customers that were and were not defaulted',
        fontsize=8, color='#6d6e70')

### non-default ratio
ax.text(-0.10, 35300, '85.2%', fontsize=12, color="#6d6e70")

### default ratio
ax.text(0.9, 7000, '14.8%', fontsize=14, color="#004a8f", fontweight='bold')

### set borders
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)

plt.show()

This confirms the imbalance in the data, and therefore, specific treatment is necessary. The reason is that if a machine-learning model were created with the data in its raw state, it would negatively impact the prediction results since the model would be very good at predicting non-defaults but very poor at predicting defaults. As a consequence, many actual default cases would be considered non-defaults, which are called False Negatives. In other words, the model predicts that a client is not a default when they are! This would go against the objective of this study and the reason for constructing the predictive model we are developing.

7. Data Cleaning and Processing

In the previous section, we observed that the dataset has some issues that need to be addressed. This stage is dedicated to resolving these problems. They are:

Remove attributes
Remove missing data in target_default
Correct data type errors in email
Handle missing data
Change the data type of variables
Remove outliers in income

In addition to these procedures, it will also be necessary to standardize the attributes and balance the dataset. However, this will be carried out in the following section.

Before making changes to the dataset, I will copy it so that the alterations from this point forward are made on this replica. This way, the original dataset is kept intact.

# make a copy of the dataset
df_clean = df_raw.copy()

7.1. Remove attributes

I’ll start by removing the attributes: ids, external_data_provider_credit_checks_last_2_year, channel, profile_phone_number, application_time_applied, and target_fraud.

# remove attributes
df_clean.drop(['ids', 'external_data_provider_credit_checks_last_2_year',
              'channel', 'profile_phone_number', 'application_time_applied',
              'target_fraud'], axis=1, inplace=True)

# check changes
df_clean.head()

7.2. Remove missing data in `target_default`

We have noted the presence of missing data in the target_default attribute, and since it is our target variable, it is necessary to remove the missing data so that we can proceed with the modeling of the predictive algorithm.

# remove missing data from 'target_default'
df_clean.dropna(subset=['target_default'], axis=0, inplace=True)

# verify removal by printing the amount of missing data
print(df_clean.target_default.isnull().sum())
'''
0
'''

7.3. Correct data type in `email`

Two typographical errors were found in the email attribute fields (hotmaill.com and gmaill.com). I will replace these values by removing the extra 'l' that both cases have.

# fix type error
df_clean.email.replace('gmaill.com', 'gmail.com', inplace=True)
df_clean.email.replace('hotmaill.com', 'hotmail.com', inplace=True)

# verify
# plot the distribution graph of the 'email' attribute
fig, ax = plt.subplots(figsize=(10, 5))
sns.countplot(x=df_clean.email)
plt.tight_layout();

By plotting the graph we can confirm the correctness of the data and check the distribution of the data.

7.4. Handle missing data

In the facebook_profile attribute, we will replace the missing data with False values, based on the assumption that if the value does not exist, it is because the client does not have a Facebook account.

# replace missing values with 'False'
df_clean.facebook_profile.fillna(False, inplace=True)

# verify
print(df_clean.facebook_profile.isnull().sum())
'''
0
'''

For the attributes last_amount_borrowed and last_borrowed_in_months, we infer that if a person has never taken out a loan, it would be natural for the value to not exist, nor the number of months that this last loan was taken since it never occurred. The same reasoning can be applied to n_issues. Therefore, the null values in these cases will be replaced by zero.

# replace missing values with zero
df_clean.last_amount_borrowed.fillna(0, inplace=True)
df_clean.last_borrowed_in_months.fillna(0, inplace=True)
df_clean.n_issues.fillna(0, inplace=True)

# verify
print(df_clean[['last_amount_borrowed', 'last_borrowed_in_months', 'n_issues']].isnull().sum())
'''
last_amount_borrowed       0
last_borrowed_in_months    0
n_issues                   0
dtype: int64
'''

The remaining variables that present null data (ok_since, external_data_provider_credit_checks_last_year, credit_limit, marketing_channel, job_name, external_data_provider_email_seen_before, lat_lon, user_agent, n_bankruptcies, n_defaulted_loans, reason) will be treated according to their type. That is, numerical variables will have their values replaced by the median, and categorical variables will be filled with the most frequent value.

It should be noted that, as seen, the variable reported_income has infinite values, which will be replaced by null values. Similarly, in external_data_provider_email_seen_before there are values of -999 which will also be exchanged for null values. This should be done before the treatment of missing variables, so I will start by doing this.

# handle 'inf' data in 'reported_income' for NaN
df_clean.reported_income = df_clean.reported_income.replace(np.inf, np.nan)

# handle data -999 in 'external_data_provider_email_seen_before' for NaN
df_clean.loc[df_clean.external_data_provider_email_seen_before == -999,
             'external_data_provider_email_seen_before'] = np.nan

# verify
df_clean[['reported_income', 'external_data_provider_email_seen_before']].describe()

Above, we see that the inf values in reported_income and the -999 values, which should appear in the minimum value in external_data_provider_email_seen_before, are no longer present.

We can proceed with the second part of the treatment of numerical and categorical data.

# create variables with numeric attributes and another with categorical ones
numeric    = df_clean.select_dtypes(exclude='object').columns
categorical = df_clean.select_dtypes(include='object').columns

# numeric variables: replace with the median
subs = SimpleImputer(missing_values=np.nan, strategy='median')
subs = subs.fit(df_clean.loc[:, numeric])
df_clean.loc[:, numeric] = subs.transform(df_clean.loc[:, numeric])

# categorical variables: replace with the most frequent value
subs = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
subs = subs.fit(df_clean.loc[:, categorical])
df_clean.loc[:, categorical] = subs.transform(df_clean.loc[:, categorical])

Finally, we can check whether the objective of no longer having missing data in the dataset was met with the code below.

# check missing data
df_clean.isnull().sum()

7.5. Change the data type of variables

The target_default and facebook_profile variables are of string type but must be converted to boolean type.

# convert 'target_default' and 'facebook_profile' to boolean
df_clean.target_default   = df_clean.target_default.astype(bool)
df_clean.facebook_profile = df_clean.facebook_profile.astype(bool)

# verify
print(df_clean[['target_default', 'facebook_profile']].info())

7.6. Remove outliers in `income`

Let’s remove the last 10 values from the dataset. To do this, we first identify these values and then proceed with the removal. We verify this by plotting the variable’s boxplot again.

# identify the 10 highest values in 'income'
top_10_incomes = df_clean.nlargest(10, 'income')

# remove the 10 highest values in 'income' from the dataset
df_clean = df_clean.drop(top_10_incomes.index)

# verify
# set boxplot to 'income'
fig, ax = plt.subplots(figsize=(10, 3))

sns.boxplot(df_clean.income, orient='h', showmeans=True,
            palette=['#004a8f'])
ax.set_title('Income', loc='left', fontsize=16,
              color='#6d6e70', fontweight='bold', pad=20)
ax.set_yticks([])

plt.tight_layout();

Now we can have a better visualization of the boxplot for this attribute.

8. Data Preparation

In this stage, we will process the data so that it can be better utilized by machine learning algorithms, thereby generating a more accurate model for credit risk prediction.

This will include:

Handle categorical variables
Split the target class from the independent classes
Training and test sets
val_model function
Baseline model
Data standardization and balancing

I’ll start by making a copy of the df_clean set to distinguish it from the processing that will be done in this section.

# make a copy of the dataset
df_proc = df_clean.copy()

8.1. Handle categorical variables

Next, we will handle the categorical variables. For variables that are of string type (object or bool), we will use Label Encoding, which will convert categorical data into numbers.

# extract the categorical attributes
cat_cols = df_proc.select_dtypes(['object', 'bool']).columns

# apply LabelEconder to categorical attributes
for col in cat_cols:
  df_proc[col+'_encoded'] = LabelEncoder().fit_transform(df_proc[col])
  df_proc.drop(col, axis=1, inplace=True)

# check changes
df_proc.head()

# check changes
df_proc.info()

With this change, the treated attributes are now identified by adding _encoded to their names and all are in integer numeric data.

8.2. Split the target class from the independent classes

We will segregate the data of the target class, that is, the variable target_default_encoded (formerly target_default that has been processed with Label Encoding), which is the one we will predict. This ensures that the data does not contain information that could help the algorithm identify default cases. To do this, we first shuffle the records (in case they have some connection that we cannot identify, this relationship is broken so as not to be detected by the model), and then we separate the independent variables into X, and the target variable (target_default_encoded) into a variable y.

# shuffle the data
df_shuffled = df_proc.reindex(np.random.permutation(df_proc.index))

# split target class from independent classes
X = df_shuffled.drop('target_default_encoded', axis=1)
y = df_shuffled['target_default_encoded']

# check variable size
print('The independent variables are in X: {} records, {} attributes'.format(X.shape[0], X.shape[1]))
print('The target variable "target_default" is in y:{} records.'.format(y.shape[0]))
'''
The independent variables are in X: 41731 records, 36 attributes
The target variable "target_default" is in y:41731 records.
'''
'''

8.3. Training and test sets

To have a generic model, that is, one that best addresses real-world data, it is necessary to divide the dataset into a training group, with which the model will learn, and another set called tests, which will serve to evaluate the performance of the created model.

This division must occur randomly to avoid biased samples. Moreover, it must be done before balancing the data to confirm that the balancing is adequate.

The division of data into Training and Testing has some additional configurations:

The size was defined as 70:30, which means that the Training set will contain 70% of the total data, while the Testing set will have 30% of the total data set;
The Training and Testing sets will contain the same amount of classes, proportionally, that is, the same amount of legitimate transactions and fraudulent transactions;
The randomization of the data was activated to mix the records and thus ensure random data in the Training and Testing sets;
And, a seed value was indicated that will provide the reproduction of the code without changes in the result.

Finally, it should be reinforced that with this division we will only use the test data in the final stage of this project. The intention is to achieve an evaluation closer to what would be obtained with real data.

# split training and testing data
## stratify= y (to divide so that the classes have the same proportion)
## random_state so that the result is replicable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    stratify=y, shuffle=True,
                                                    random_state=123)

# check set sizes
print('The training set has {} records'.format(X_train.shape[0]))
print('The test set has     {} records.'.format(X_test.shape[0]))
'''
The training set has 29211 records
The test set has     12520 records.
'''

8.4. `val_model` function

To begin this step, first, I will build a function called val_model with the construction of a pipeline (workflow) that will standardize the data with the Standard Scaler (since we have the presence of outliers in the dataset), classify them (according to the algorithm that is inserted into the function by the clf parameter) and evaluate the model through cross-validation, taking into account the value of Recall obtained. The final result of this function will be to show the average Recall value found in the cross-validation.

# build model evaluation function
def val_model(X, y, clf, quite=False):
    """
    Performs cross-validation with training data for a given model.

    # Arguments
        X: Data Frame, contains the independent variables.
        y: Series, vector containing the target variable.
        clf:scikit-learn classifier model.
        quite: bool, indicating whether the function should print the results or not.

    # Returns
        float, average of cross-validation scores.
    """

    # convert variables to arrays
    X = np.array(X)
    y = np.array(y)

    # create pipeline
    ## 1. standardize data with StandardScaler
    ## 2. classify the data
    pipeline = make_pipeline(StandardScaler(), clf)

    # model evaluation by cross-validation
    ## according to the Recall value
    scores = cross_val_score(pipeline, X, y, scoring='recall')

    # show average Recall value and standard deviation of the model
    if quite == False:
        print("Recall: {:.4f} (+/- {:.2f})".format(scores.mean(), scores.std()))

    # return the average of the Recall values obtained in cross-validation
    return scores.mean()

8.5. Baseline model

To begin this stage of building and evaluating machine learning models, I will construct a baseline model that will serve as a benchmark for evaluating subsequent models. In other words, with the baseline model, we can use it to compare the created models and assess whether there has been an improvement or not.

Moreover, this baseline model will not include hyperparameter tuning, feature engineering, nor will the data be balanced. This is to have the most basic baseline value possible to also check how much it is possible to improve the result.

Using the val_model function, we will generate the baseline model. This will be done based on the Random Forest classifier and, as mentioned, no additional parameters will be set.

# instantiate base model
rf = RandomForestClassifier()

# evaluate model performance with the 'val_model' function
score_baseline = val_model(X_train, y_train, rf)
'''
Recall: 0.0290 (+/- 0.00)
'''

The result shows that with the Random Forest model it is possible to obtain a Recall of 0.0290, and there was no significant standard deviation in the models obtained during cross-validation.

With this, we now have a baseline metric to compare the next models that will be developed and evaluate their respective performances compared to this raw model, which has no adjustments.

With the information on the Recall of the baseline model, i.e., our evaluation metric for the other models, we can move on to the next steps which are: to balance the training set and to check how different classifiers perform in building a model for our problem.

8.6. Data standardization and balancing

As observed, the dataset is imbalanced, and it is necessary to rebalance the data to avoid creating a model with low performance in identifying defaults, as well as to prevent overfitting. This way, we create a good machine-learning model free of bias.

The method used in this case is Undersampling, which reduces the majority class by randomly excluding these data. Thus, the characteristics of the minority class, which in this case are the default data, are preserved. And these are the most important data for solving our problem.

Also, it is necessary to standardize the data so that they are on the same scale. For this, we use the StandardScaler.

These techniques are applied only to the training set so that the characteristics of the test set are not misconfigured.

# instantiate standardization model
scaler = StandardScaler().fit(X_train)

# apply standardization to training data
## only in the independent variables
X_train = scaler.transform(X_train)

# instantiate undersampling model
## random_state so that the result is replicable
rus = RandomUnderSampler(random_state=123)

# apply undersampling to training data
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# check class balance
print(pd.Series(y_train_rus).value_counts())
'''
0    4663
1    4663
Name: target_default_encoded, dtype: int64
'''

After applying the Undersampling method, we see that both classes now have the same proportion. That is, for 0 (or non-default), we have 4663 records, and for 1 (or default), the same value.

9. Performance Assessment Metrics

To evaluate the performance of the models, the metric that will be used is the Recall value. This is because when dealing with imbalanced data, as is the case with the dataset under study, even though it has been treated and balanced, accuracy is not a good evaluation metric. The reason for this is that one can obtain a very high accuracy result, but the detection of defaults could yield a very low result, which does not lead us to our objective.

Recall is the metric that provides the best measure for the specific problem under study. This is because, in the case of defaults, False Negatives are more harmful to a company than False Positives. In other words, it is preferable for the model to err by indicating that a customer is a default when in reality they are not, rather than failing to identify a customer as a default when they actually are, which would bring losses to the business.

With this in mind, a high Recall rate is sought.

Recall looks at all the defaults and aims to answer: how much does the model get right? The result will be a value between 0 and 1, with values closer to 1 being better, as it indicates a low rate of False Negatives.

Its calculation is given by:

Considering the purpose of this study, another evaluation metric for classification models that can be used is the AUC — Area Under the Curve. It derives from the ROC — Receiver Operating Characteristic — and indicates how well the model can distinguish between two things. In our example, between a default and non-default customer. The value given by AUC ranges from 0 to 1, with values closer to 1 indicating a better model.

Finally, the confusion matrix compares the predicted values with the actual values, showing the model’s errors and correct predictions. Its output has 4 different values, arranged as follows:

Where each of these values corresponds to:

Model Correct Predictions

True Positive: It’s default and the model classifies it as default.
True Negative: It’s not default and the model classifies it as not default.

Model Errors

False Positive: It’s not default, but the model classifies it as default.
False Negative: It’s default, but the model classifies it as not default.

In summary, to evaluate the performance of the models, we will primarily focus on Recall, followed by the confusion matrix and AUC.

10. Creating Machine-Learning Models

To create and evaluate various machine learning models, we will use the previously created val_model function. The goal is to identify the models that perform best so they can be compared with XGBoost after hyperparameter tuning of that algorithm.

The models that will be created and evaluated in this phase are:

Random Forest
Decision Tree
Stochastic Gradient Descent
SVC
Regressão Logística
XGBoost
Light GBM

Reinforcing that the val_model function will: standardize the data, apply the classifier, perform cross-validation, and return the average Recall value found.

# instantiate the models
rf   = RandomForestClassifier()
knn  = KNeighborsClassifier()
dt   = DecisionTreeClassifier()
sgdc = SGDClassifier()
svc  = SVC()
lr   = LogisticRegression()
xgb  = XGBClassifier()
lgbm = LGBMClassifier()

# create lists to store:
## the classifier model
model = []
## the value of the Recall
recall = []

# create loop to cycle through classification models
for clf in (rf, knn, dt, sgdc, svc, lr, xgb, lgbm):

    # identify the classifier
    model.append(clf.__class__.__name__)

    # apply 'val_model' function and store the obtained Recall value
    recall.append(val_model(X_train_rus, y_train_rus, clf, quite=True))

# save the Recall result obtained in each classification model in a variable
results = pd.DataFrame(data=recall, index=model, columns=['Recall'])

# show the models based on the Recall value obtained, from highest to lowest
results.sort_values(by='Recall', ascending=False)

It can be seen that the best models, at the top of the table, were the LGBMClassifier, the XGBoostClassifier, and the RanfomForestClassifier points. While KNeighborsClassifier was the worst-performing algorithm.

11. Why XGBoost?

XGBoost belongs to the family of supervised classifiers known as Decision Trees. Its acronym stands for Extreme Gradient Boosting, and it has gained popularity among professionals due to its high precision and accuracy in model creation.

This is partly attributed to the extensive range of hyperparameters that can be tuned, enhancing the model’s performance. As a result, XGBoost can be applied to various problem types, including classification, regression, and anomaly detection, across different industries and sectors.

12. XGBoost: Hyperparameter Optimization

Due to the problem we are trying to solve (see General Objective), as well as the performance of the algorithms in the previous session, XGBoost has the potential to greatly improve its results through adjustments in its hyperparameters. This, compared to most of the above algorithms, is a greater number and therefore also gives a higher probability of enhancing its performance

Due to the number of hyperparameters that need to be adjusted, it is necessary to find the best value for each of them in separate processes. To start, the learning rate was set at 0.1. The recommendation is that it be between 0.05 and 0.3, according to the problem to be solved.

Let’s search for the best parameter for n_estimators, which is the number of decision trees that should be contained in the algorithm. And, we will use a seed value to be able to reproduce the results.

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
xgb = XGBClassifier(learning_rate=0.1, random_state=seed)

# define dictionary to find out the ideal amount of trees
# in a range from 0 to 500 with an increment of 50
param_grid = {'n_estimators':range(0,500,50)}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'n_estimators'
grid_result = grid_search.fit(X_train_rus, y_train_rus)

# print best parameter found to 'n_estimators'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6625 para {'n_estimators': 100}
'''

The best result was found with 100 decision trees, and with just this hyperparameter, we have already surpassed the LGBMClassifier, which was in first place on the list with a Recall of 0.656231.

With this parameter, as the increment defined was 50, we can refine the search in a smaller range and with smaller steps:

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
xgb = XGBClassifier(learning_rate=0.1, random_state=seed)

# define dictionary to find out the ideal amount of trees
# in a range from 25 to 125 with an increment of 5
param_grid = {'n_estimators':range(25,125,5)}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'n_estimators'
grid_result = grid_search.fit(X_train_rus, y_train_rus)

# print best parameter found to 'n_estimators'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6633 para {'n_estimators': 85}
'''

With this new search, the best value found for the number of trees was 85.

With the best parameter for n_estimators set, we can define it in the classifier assignment and proceed to determine the best values for max_depth, which determines the depth of the decision trees, and for min_child_weight, which is the minimum weight of a node required to create a new node.

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
## n_estimators: 85
xgb = XGBClassifier(learning_rate=0.1, n_estimators=85, random_state=seed)

# set dictionary to find out:
## the ideal depth in a range of 1 to 7 in increments of 1
## the minimum weight of the node in a range from 1 to 4 in increments of 1
param_grid = {'max_depth':range(1,8,1),
              'min_child_weight':range(1,5,1)}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'max_depth' e 'min_child_weight'
grid_result = grid_search.fit(X_train_rus, y_train_rus)

# print best parameters found to 'max_depth' e 'min_child_weight'
print("Best results: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best results: 0.6633 para {'max_depth': 6, 'min_child_weight': 1}
'''

The values obtained for the depth (max_depth) was 6 and the min_child_weight is 1. We have two more parameters to adjust!

We can insert the new parameters found and move on to find the next one: gamma. This parameter defines the complexity of the trees in the model.

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
## n_estimators: 85
## max_depth: 6
## min_child_weight: 1
xgb = XGBClassifier(learning_rate=0.1, n_estimators=85, max_depth=6, min_child_weight=1, random_state=seed)

# set dictionary to find out:
## the ideal depth in a range of 0.0 to 0.4
param_grid = {'gamma':[i/10.0 for i in range(0,5)]}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'gamma'
grid_result = grid_search.fit(X_train_rus, y_train_rus)

# print best parameter found to 'gamma'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6640 para {'gamma': 0.1}
'''

We continue inserting these values into the classification model assignment and now look for the best value for the learning rate (learning_rate).

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the seed
## n_estimators: 85
## max_depth: 6
## min_child_weight: 1
## gamma = 0.1
xgb = XGBClassifier(n_estimators=85, max_depth=6, min_child_weight=1, gamma=0.1, random_state=seed)

# set dictionary to discover optimal learning rate
param_grid = {'learning_rate':[0.01, 0.05, 0.1, 0.2]}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'learning_rate'
grid_result = grid_search.fit(X_train_rus, y_train_rus)

# print best parameter found to 'learning_rate'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6640 para {'learning_rate': 0.1}
'''

Finally, now with the hyperparameters defined, we can use the generated model to use our test set and check how it would perform with real data.

# set a seed for reproducibility
seed = 123

# instantiate the final XGBoost model with the best hyperparameters found
xgb = XGBClassifier(learning_rate=0.1 , n_estimators=85, max_depth=6, min_child_weight=1, gamma=0.1, random_state=seed)

# train the model with training data
xgb.fit(X_train_rus, y_train_rus)

# standardize test data
X_test = scaler.transform(X_test)

# make predictions with test data
y_pred = xgb.predict(X_test)

With the obtained results, I will generate a report containing evaluation metrics and the AUC value. Additionally, I will plot the regular confusion matrix, the normalized confusion matrix, and the AUC curve.

# print assessment metrics report
print('Evaluation Metrics Report'.center(65) + ('\n') + ('-' * 65))
print(classification_report(y_test, y_pred, digits=4) + ('\n') + ('-' * 15))

# print AUC
print('AUC: {:.4f} \n'.format(roc_auc_score(y_test, y_pred)) + ('-' * 65))

# plot graphs
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15, 4))

# normalized confusion matrix
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True,
                                    title='Normalized Confusion Matrix',
                                    text_fontsize='large', ax=ax[0])

# confusion matrix
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
                                    title='Confusion Matrix',
                                    text_fontsize='large', ax=ax[1])

# AUC
y_prob = xgb.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_prob, title='AUC', cmap='brg', text_fontsize='small', plot_macro=False, plot_micro=False, ax=ax[2])

### print AUC value
auc = (roc_auc_score(y_test, y_pred) * 100).round(2)
ax[2].text(0.3, 0.6, auc, fontsize=14, color='#004a8f', fontweight='bold')

plt.show()
'''
                    Evaluation Metrics Report                    
-----------------------------------------------------------------
              precision    recall  f1-score   support

           0     0.9121    0.6808    0.7796     10522
           1     0.2803    0.6547    0.3925      1998

    accuracy                         0.6766     12520
   macro avg     0.5962    0.6677    0.5861     12520
weighted avg     0.8113    0.6766    0.7179     12520

---------------
AUC: 0.6677 
-----------------------------------------------------------------
'''

13. Feature Engineering

The feature engineering process involves creating new variables from existing data to improve the performance of a machine learning model.

I brought it up at this stage so that we could compare the results obtained with and without feature engineering since this procedure requires knowledge of the problem on the part of the analyst, as well as time to analyze the best strategies to be approached.

At the end of the Data Cleaning and Processing stage we have the df_clean dataset. Let's go back to its structure:

# check dataset size
df_clean.shape
'''
(41731, 37)
'''

# check first dataset entries
df_clean.head(3)

Remember that this dataset has already had some attributes removed, treatment of missing data, modification of variables, correction of type errors, removal of outliers, etc., which can be checked in the Data Cleaning and Processing stage.

At this stage, we will:

Remove variables that we did not obtain enough information to interpret
Create new variables

Finally, the entire data preparation stage will be done to proceed with the creation of machine learning models.

To start, I will make a copy of this dataset to work on it.

# make copy of the dataset
df_fe = df_clean.copy()

13.1. Remove variables

The variables listed below are those for which we were unable to infer their meaning with greater precision and require further study, so this time they will be removed from the dataset.

real_state
zip
reason
job_name
external_data_provider_first_name
profile_tags
state
user_agent

# remove attributes
df_fe.drop(['real_state', 'zip', 'reason', 'job_name', 'profile_tags', 'state',
            'external_data_provider_first_name', 'user_agent'], axis=1, inplace=True)

# check for changes
df_fe.head(3)

13.2. Create new variables

I will create the following new variables:

1. lat_long provides us with location information of latitude and longitude, separated by a comma. I will separate this information and reduce the number of decimal places to two, since with this number of places we will be at distances of 1 km (Gizmodo).

2. shipping_state has information about country and state, also separated by a hyphen, I will separate this information, confirm the number of different variables (to check in the country, if there is only 1 value the attribute can be removed, and in the state to check if there are no type errors).

13.2.1. Extract information from `lat_lon`

To extract the information from lat_lon I will use the str.extract function and insert the information, each one, into two new attributes lat and lon. Finally, I will check the type of the variable.

# extract informatioon from 'lat_lon'
df_fe[['lat', 'lon']] = df_fe['lat_lon'].str.extract(r'\((.*), (.*)\)')

# check type of new attributes
print(df_fe[['lat', 'lon']].dtypes)
'''
lat    object
lon    object
dtype: object
'''

The new attributes are in string type, so we need to change them to float type. I will also take the opportunity to reduce the number of decimal places to just 2.

# change type from 'lat' and 'lon' to float
df_fe['lat'] = (df_fe['lat'].astype(float)).round(2)
df_fe['lon'] = (df_fe['lon'].astype(float)).round(2)

# check type of new attributes
print(df_fe[['lat', 'lon']].dtypes)
'''
lat    float64
lon    float64
dtype: object
'''

# check dataset
df_fe.head(3)

Everything is fine, so just remove the lat_lon attribute so that it is not duplicated in the dataset.

# remove attribute 'lat_lon'
df_fe.drop(['lat_lon'], axis=1, inplace=True)

# check changes
df_fe.head(3)

13.2.2. Extract information from `shipping_state`

To extract the country and state information, I will use the str.split function and save the information in two new variables: country and state. At the end, I will check the type of these new attributes created.

# extract information from 'shipping_state'
df_fe[['country', 'state']] = df_fe['shipping_state'].str.split('-', expand=True)

# verificar tipo dos novos atributos
print(df_fe[['country', 'state']].dtypes)
'''
country    object
state      object
dtype: object
'''

# check dataset
df_fe.head(3)

Let’s remove the attribute shipping_state and check the unique values for country and state. Since in country we may find only one value and, in this case, we can discard the attribute because it does not add any information for the construction of the machine-learning model.

# remove attribute 'lat_lon'
df_fe.drop(['shipping_state'], axis=1, inplace=True)

# check unique input values for 'country' and 'state'
df_fe[['country', 'state']].nunique()
'''
country     1
state      25
dtype: int64
'''

As we only have a single value for country, we can exclude it from the dataset without prejudice. Additionally, let’s check the values in state.

# remove 'country' attribute
df_fe.drop(['country'], axis=1, inplace=True)

# check values in 'state'
df_fe.state.unique()
'''
array(['MT', 'RS', 'RR', 'RN', 'SP', 'AC', 'MS', 'PE', 'AM', 'CE', 'SE',
       'AP', 'MA', 'BA', 'TO', 'RO', 'SC', 'GO', 'PR', 'MG', 'ES', 'DF',
       'PA', 'PB', 'AL'], dtype=object)
'''

We can see above that they are all well filled out and are in fact the Brazilian states and the Federal District.

# check changes
df_fe.head(3)

13.3. Data Preparation

The data preparation involves the following steps, as we did previously:

Handle categorical variables
Split the target class from the independent classes
Training and test set
Baseline model
Data standardization and balancing

And, once again, I will start by copying the dataset with which we concluded the feature engineering stage: df_fe.

# make copy of the dataset
df_fe_proc = df_fe.copy()

13.3.1. Handle categorical variables

Here, I will change the way of treating the variables a bit from what was done previously, to improve the model’s performance. Now, only the boolean type variables (bool) will go through the Label Encoding process. The categorical variables (string) will go through the process known as dummy variables. In this method, the variable will assume a value of 0 or 1 to indicate the absence or presence of a certain category.

# identify boolean attributes
bol_var = df_fe_proc.select_dtypes(['bool']).columns

# apply LabelEconder to boolean attributes
for col in bol_var:
  df_fe_proc[col+'_encoded'] = LabelEncoder().fit_transform(df_fe_proc[col])
  df_fe_proc.drop(col, axis=1, inplace=True)

## identify categorical attributes
cat_var = df_fe_proc.select_dtypes(['object']).columns

# apply dummy variables
df_fe_proc = pd.get_dummies(df_fe_proc, columns=cat_var)

# check changes in the data frame
df_fe_proc.head()

Remember that, with this change, the attributes treated are now identified by adding encoded to their names and all are in integer numeric data.

Let’s check the size of the treated and processed data frame.

# check data frame size
df_fe_proc.shape
'''
(41731, 105)
'''

Note that we went from 37 to 105 attributes.

13.3.2. Split the target class from the independent classes

Let’s separate the target class from the independent classes:

# shuffle the data
df_fe_shuffled = df_fe_proc.reindex(np.random.permutation(df_fe_proc.index))

# separate target class from independent classes
X_fe = df_fe_shuffled.drop('target_default_encoded', axis=1)
y_fe = df_fe_shuffled['target_default_encoded']

# check variables size
print('The independent variables are in X: {} records, {} attributes.'.format(X_fe.shape[0], X_fe.shape[1]))
print('Target variable "target_default" is in y: {} records.'.format(y_fe.shape[0]))
'''
The independent variables are in X: 41731 records, 104 attributes.
Target variable "target_default" is in y: 41731 records.
'''

13.3.3. Training and test set

We divide the set into training and testing data.

# split training and testing data
## stratify= y (to divide so that the classes have the same proportion)
## random_state so that the result is replicable
X_fe_train, X_fe_test, y_fe_train, y_fe_test = train_test_split(X_fe, y_fe, test_size=0.3,
                                                    stratify=y_fe, shuffle=True,
                                                    random_state=123)

# check set sizes
print('The training set has {} records.'.format(X_fe_train.shape[0]))
print('The test set has     {} records.'.format(X_fe_test.shape[0]))
'''
The training set has 29211 records.
The test set has     12520 records.
'''

13.3.4. Baseline model

As was done previously, I will generate the base model again with Random Forest, to compare the performance of the other algorithms.

# create baseline and check performance
fe_rf = RandomForestClassifier()

# evaluate model performance with the 'val_model' function
fe_score_baseline = val_model(X_fe_train, y_fe_train, fe_rf)
'''
Recall: 0.0513 (+/- 0.00)
'''

A significant improvement can already be seen between base models, since without attribute engineering the Recall value was 0.0290 and now we have an improvement of almost 2 times that value, at 0.0513!

13.3.5. Data standardization and balancing

Let’s move on to balancing the data so that we can generate more algorithms.

# instantiate standardization model
scaler = StandardScaler().fit(X_fe_train)

# apply standardization to training data
## only in independent variables
X_fe_train = scaler.transform(X_fe_train)

# instantiate undersampling model
## random_state so that the result is replicable
fe_rus = RandomUnderSampler(random_state=43)

# apply undersampling to training data
X_fe_train_rus, y_fe_train_rus = fe_rus.fit_resample(X_fe_train, y_fe_train)

# check class balance
print(pd.Series(y_fe_train_rus).value_counts())
'''
0    4663
1    4663
Name: target_default_encoded, dtype: int64
'''

13.4. Creating Machine-Learning Models

I will generate several models again with different classifiers, which will be evaluated by the value of Recall. The goal will be to compare these results with the previous ones, which do not have the application of feature engineering, as well as to compare the results obtained in the classifiers so that we can follow the development of the XGBoost model with the optimization of hyperparameters.

# instantiate the models
rf_fe   = RandomForestClassifier()
knn_fe  = KNeighborsClassifier()
dt_fe   = DecisionTreeClassifier()
sgdc_fe = SGDClassifier()
svc_fe  = SVC()
lr_fe   = LogisticRegression()
xgb_fe  = XGBClassifier()
lgbm_fe = LGBMClassifier()

# create lists to store:
## the classifier model
fe_model = []
## the value of the Recall
fe_recall = []

# create loop to cycle through classification models
for fe_clf in (rf_fe, knn_fe, dt_fe, sgdc_fe, svc_fe, lr_fe, xgb_fe, lgbm_fe):

    # identify the classifier
    fe_model.append(fe_clf.__class__.__name__)

    # apply 'val_model' function and store the obtained Recall value
    fe_recall.append(val_model(X_fe_train_rus, y_fe_train_rus, fe_clf, quite=True))

# save the Recall result obtained in each classification model in a variable
fe_results = pd.DataFrame(data=fe_recall, index=fe_model, columns=['Recall'])

# show the models based on the Recall value obtained, from highest to lowest
fe_results.sort_values(by='Recall', ascending=False)

Let’s compare:

It is noted that the performance improvement occurred in only half of the trained algorithms. However, when there was an improvement, it was more significant than the loss in performance, in cases where the model worsened. It is worth noting that in the case of XGBoost, the model had a drop of -0.005567, which was the biggest difference among the models with worse performance. However, in the following hyperparameter optimization, we will be able to check whether the feature engineering will actually contribute any differential to the results or not.

14. XGBoost: Feature Engineering + Hyperparameter Optimization

Here, we will repeat the process of optimizing the hyperparameters of the XGBoost algorithm to find the values that produce the best performances.

I will start the search for the best value for n_estimators which is the number of decision trees that should be contained in the algorithm.

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
xgb = XGBClassifier(learning_rate=0.1, random_state=seed)

# define dictionary to find out the ideal amount of trees
# in a range from 0 to 500 with an increment of 50
param_grid = {'n_estimators':range(0,500,50)}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'n_estimators'
grid_result = grid_search.fit(X_fe_train_rus, y_fe_train_rus)

# print best parameter found to 'n_estimators'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6537 para {'n_estimators': 50}
'''

The best result found was with 50 decision trees. With this parameter, as the defined increment was 50, we can refine the search around this value, with smaller steps:

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
xgb = XGBClassifier(learning_rate=0.1, random_state=seed)

# define dictionary to find out the ideal amount of trees
# in a range from 25 to 75 with an increment of 5
param_grid = {'n_estimators':range(25,75,5)}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'n_estimators'
grid_result = grid_search.fit(X_fe_train_rus, y_fe_train_rus)

# print best parameter found to 'n_estimators'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6620 para {'n_estimators': 25}
'''

With this new search, the best value found for the number of trees was 25. This time, just with this adjustment, we improved the algorithm’s performance from 0.642724 to 0.6620. This result would already guarantee second place in the base models!

We still have 4 more tunings to adjust, so let’s proceed.

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
## n_estimators: 25
xgb = XGBClassifier(learning_rate=0.1, n_estimators=25, random_state=seed)

# set dictionary to find out:
## the ideal depth in a range of 1 to 7 in increments of 1
## the minimum weight of the node in a range from 1 to 4 in increments of 1
param_grid = {'max_depth':range(1,8,1),
              'min_child_weight':range(1,5,1)}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# print best parameters found to 'max_depth' and 'min_child_weight'
grid_result = grid_search.fit(X_fe_train_rus, y_fe_train_rus)

# print best parameters found to 'max_depth' and 'min_child_weight'
print("Best results: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best results: 0.6637 para {'max_depth': 6, 'min_child_weight': 2}
'''

We can enter the new parameters found and move on to find the next one: gamma. This parameter defines the complexity of the trees in the model.

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the learning rate to 0.1 and set the seed
## n_estimators: 25
## max_depth: 6
## min_child_weight: 1
xgb = XGBClassifier(learning_rate=0.1, n_estimators=25, max_depth=6, min_child_weight=2, random_state=seed)

# set dictionary to find out:
## the ideal depth in a range of 0.0 to 0.4
param_grid = {'gamma':[i/10.0 for i in range(0,5)]}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'gamma'
grid_result = grid_search.fit(X_fe_train_rus, y_fe_train_rus)

# print best parameter found to 'gamma'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6637 para {'gamma': 0.0}
'''

Once again, the best value found was the minimum value, so we continued inserting this value into the classification model assignment and looking for the best value for the learning rate (learning_rate).

# set a seed for reproducibility
seed = 123

# instantiate the XGBoost model
## set the seed
## n_estimators: 25
## max_depth: 6
## min_child_weight: 1
## gamma = 0.1
xgb = XGBClassifier(n_estimators=25, max_depth=6, min_child_weight=2, gamma=0.0, random_state=seed)

# set dictionary to discover optimal learning rate
param_grid = {'learning_rate':[0.01, 0.05, 0.1, 0.2]}

# set up cross validation with 10 stratified folds
# shuffle=True to shuffle the data before splitting and setting the seed
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

# configuring the search for cross matches with the XGBoost classifier
## tree parameters of param_grid
## scoring: evaluation method by Recall
## n_jobs=1 for parallel fetching (using all available cores)
## cv: validation strategy
grid_search = GridSearchCV(xgb, param_grid, scoring="recall", n_jobs=-1, cv=kfold)

# perform best parameter search for 'learning_rate'
grid_result = grid_search.fit(X_fe_train_rus, y_fe_train_rus)

# print best parameter found to 'learning_rate'
print("Best result: {:.4f} para {}".format(grid_result.best_score_, grid_result.best_params_))
'''
Best result: 0.6663 para {'learning_rate': 0.05}
'''

With the hyperparameter values adjusted, I will run the test set with the created model.

# set a seed for reproducibility
seed = 123

# instantiate the final XGBoost model with the best values found
xgb = XGBClassifier(learning_rate=0.05 , n_estimators=25, max_depth=6, min_child_weight=2, gamma=0.0)

# train the model with training data
xgb.fit(X_fe_train_rus, y_fe_train_rus)

# standardize test data
X_fe_test = scaler.transform(X_fe_test)

# make predictions with test data
y_fe_pred = xgb.predict(X_fe_test)

Finally, I will print a report with evaluation metrics and the AUC. Furthermore, I will plot the regular confusion matrix, with the normalized data and the AUC.

# print assessment metrics report
print('Evaluation Metrics Report'.center(65) + ('\n') + ('-' * 65))
print(classification_report(y_fe_test, y_fe_pred, digits=4) + ('\n') + ('-' * 15))

# print AUC
print('AUC: {:.4f} \n'.format(roc_auc_score(y_fe_test, y_fe_pred)) + ('-' * 65))

# plot graphs
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(15, 4))

# normalized confusion matrix
skplt.metrics.plot_confusion_matrix(y_fe_test, y_fe_pred, normalize=True,
                                    title='Normalized Confusion Matrix',
                                    text_fontsize='large', ax=ax[0])

# confusion matrix
skplt.metrics.plot_confusion_matrix(y_fe_test, y_fe_pred,
                                    title='Confusion Matrix',
                                    text_fontsize='large', ax=ax[1])

# AUC
y_fe_prob = xgb.predict_proba(X_fe_test)
skplt.metrics.plot_roc(y_fe_test, y_fe_prob, title='AUC', cmap='brg', text_fontsize='small', plot_macro=False, plot_micro=False, ax=ax[2])

### print AUC value
auc = (roc_auc_score(y_fe_test, y_fe_pred) * 100).round(2)
ax[2].text(0.3, 0.6, auc, fontsize=14, color='#004a8f', fontweight='bold')

plt.show()
'''
                    Evaluation Metrics Report                    
-----------------------------------------------------------------
              precision    recall  f1-score   support

           0     0.9174    0.6600    0.7677     10522
           1     0.2774    0.6872    0.3952      1998

    accuracy                         0.6644     12520
   macro avg     0.5974    0.6736    0.5815     12520
weighted avg     0.8153    0.6644    0.7083     12520

---------------
AUC: 0.6736 
-----------------------------------------------------------------
'''

15. XGBoost Model Comparison

In this section, we will plot the results of the predictive models generated with XGBoost, both without and with feature engineering, side by side for easier comparison.

15.1. Confusion Matrix

The confusion matrix provides us with four different values. Each of these values corresponds to:

Model Correct Predictions

True Positive: It is default and the model is classified as default.
True Negative: It is not default and the model is classified as not default.

Model Errors

False Positive: It is not default, but the model accuses it of being default.
False Negative: It is default, but the model classifies it as not being default.

I will plot the results generated by the confusion matrix, in both models created, side-by-side, so that it is easier to compare them.

# plot normalized confusion matrix
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))

# XGBoost
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True,
                                    title='XGBoost',
                                    text_fontsize='large', ax=ax[0])

# XGBoost with Feature Engineering
skplt.metrics.plot_confusion_matrix(y_fe_test, y_fe_pred, normalize=True,
                                    title='with Feature Engineering',
                                    text_fontsize='large', ax=ax[1])

plt.show()

Another interesting way to visualize the confusion matrix is through the data generated, that is, this way we will know how many customers were detected as default, how many the model got right, and how many cases the model got wrong.

# plot normalized confusion matrix
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))

# XGBoost
skplt.metrics.plot_confusion_matrix(y_test, y_pred,
                                    title='XGBoost',
                                    text_fontsize='large', ax=ax[0])

# XGBoost with Feature Engineering
skplt.metrics.plot_confusion_matrix(y_fe_test, y_fe_pred,
                                    title='with Feature Engineering',
                                    text_fontsize='large', ax=ax[1])

plt.show()

From a total of 12,520 customers analyzed, it was observed that the XGBoost model without attribute engineering corrected the default predictions in 1,308 of cases. The model with feature engineering was correct 1,373 times.

So, analyzing the results found in both models, it is observed that the algorithm generated by XGBoost with attribute engineering is slightly superior to the model created with the use of feature engineering.

However, if we look at the overall hits, that is, the True Positives plus the True Negatives, we will see that the model with attribute engineering is slightly inferior to the model without feature engineering, and we can notice that this is due to the fact that the model with attribute engineering has more errors in False Positives, that is, it points out that a client is in default when in fact they are not. This can also be a problem since the company would stop lending money and making a profit from a client who would be a good payer.

15.2. AUC

Below, I bring the results plotted for the AUC curve, as well as the value found, side-by-side, so that it is easier to compare this measurement.

# plot AUC
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

## XGBoost
skplt.metrics.plot_roc(y_test, y_prob, title='XGBoost',
                       cmap='brg', text_fontsize='small', plot_macro=False,
                       plot_micro=False, ax=ax[0])

### print AUC value for XGBoost
auc = (roc_auc_score(y_test, y_pred) * 100).round(2)
ax[0].text(0.1, 0.8, auc, fontsize=14, color='#004a8f', fontweight='bold')

## XGBoost with Feature Engineering
skplt.metrics.plot_roc(y_fe_test, y_fe_prob, title='with Feature Engineering',
                       cmap='brg', text_fontsize='small', plot_macro=False,
                       plot_micro=False, ax=ax[1])

### print AUC value for XGBoost with Feature Engineering
auc_fe = (roc_auc_score(y_fe_test, y_fe_pred) * 100).round(2)
ax[1].text(0.15, 0.7, auc_fe, fontsize=14, color='#004a8f', fontweight='bold')

plt.show()

The AUC value for the XGBoost algorithm without applying feature engineering is 66.77%, that is, it is lower than the value of 67.36% given to the model in which feature engineering was used.

15.3. Recall

Finally, the Recall values for each model. Remember that this is the metric that gives us the best measure to evaluate our problem in question. Therefore, the higher your result, the better the model will be at identifying defaults.

# print recall results
print('RECALL'.center(30) + '\n' + ('-' * 30))

## XGBoost
print('\t\t XGBoost: {:.2f}%'.format((recall_score(y_test, y_pred)) * 100))

# with Feature Engineering
print('with Feature Engineering: {:.2f}%'.format((recall_score(y_fe_test, y_fe_pred)) * 100))
'''
            RECALL            
------------------------------
   XGBoost: 65.47%
with Feature Engineering: 68.72%
'''

The Recall value is slightly higher in the XGBoost model using attribute engineering.

16. Hypothesis Test

Despite numerically observing that the Recall of the model with feature engineering was 3.25% higher than the model without it, we can perform a statistical test to prove whether this better performance is significant or not.

For this, I will use a proportions test known as the Z-test for two proportions, which is best suited to our case. That is, it determines whether the differences between the proportions of two independent samples are statistically significant. For this, I will consider the following hypotheses:

Null Hypothesis (H0) ▶ p-value> 0.05
There is no significant difference in the Recall value between the model with and without feature engineering.
Alternative Hypothesis (H1) ▶ p-value<= 0.05
Reject the null hypothesis, that is, there is a difference between the two models evaluated.

Note

The significance level (p-value) is a value determined according to each case to set a threshold for accepting or rejecting the null hypothesis. Thus, the smaller the p-value, the stronger the evidence against the null hypothesis. For example, if the result obtained for the p-value was 0.04, we would be able to reject the null hypothesis and argue that the model with feature engineering is superior with a confidence level of 96%.

Therefore, if the value is above the threshold, we will not be able to reject the null hypothesis, but this is not an absolute truth. It only means that we have not found sufficient evidence to reject it.

# set variables with the values needed for the statistical test
## model recall value without feature engineering
recall_model    = 0.6547
## model recall value with feature engineering
recall_model_fe = 0.6872
## number of customers who were evaluated
n_clients = X_test.shape[0]

# number of hits for each model
model_success    = int(recall_model * n_clients)
model_fe_success = int(recall_model_fe * n_clients)

# create function to calculate z test for two proportions
def z_test(success_a, success_b, n):
    # calculate parameters
    p1 = success_a / n
    p2 = success_b / n
    p = (success_a + success_b) / (n + n)
   # define formula with threshold of p = 0.05
    z_stat = (p1 - p2) / ((p * (1 - p) * (1/n + 1/n))**0.5)

    # calculate p-value
    p_val = (1 - norm.cdf(abs(z_stat))) * 2
    return z_stat, p_val

# apply the function to machine-learning models
z_stat, p_val = z_test(model_success, model_fe_success, n_clients)
print(z_stat, p_val)

The p-value of the statistical test performed was 4.41e-08, thus the p-value found is less than the significance level of 0.05. With this, we can reject the null hypothesis and accept that there is a difference between the two machine-learning models created. Therefore, we can say with statistical support that the model with feature engineering is superior to the model without feature engineering.

17. Conclusion

This study aimed to create a machine-learning model that predicts whether a new client may become delinquent or, in technical terms of the financial area, default. For this, an exploratory data analysis was carried out, where the structure and content of the data set were understood. At this time, the imbalance of the target class, the presence of outliers, and missing data, as well as typing errors, were detected.

The data treatment involved the removal of some attributes and the missing data in the target variable. Also, at this stage, there was the treatment of typing errors, the change of variable type, the removal of outliers, and the treatment of missing data.

After that, the preparation of the data to be inserted into machine learning models began. At this stage, Label Encoding was used to treat categorical variables. Also, the data were separated into two sets X and y, respectively the target class and the independent classes, so that a training set and a test set could be created.

Next, a model validation function was created, consisting of a workflow, known by the term pipeline, to standardize the data using the Standard Scaler, apply the classifier chosen in the parameter, and perform a cross-validation evaluating the model by the value of Recall. With this, it was possible to create a base model, with Random Forest, whose evaluation value was 0.0290.

It is worth mentioning that the performance evaluation metric of the models chosen was Recall because it is better suited to the study problem. That is, cases of false negatives are more harmful to the company than cases of false positives.

With the value of the base model in hand, the data were standardized and balanced, through the Standard Scaler and Undersampling, as a way to improve the model’s performance. From this, 7 more models were created, where, for comparison, the Random Forest which was previously 0.0290, improved to 0.6344. Here, the best model was given to the LGBMClassifier with a value of 0.6562, and, in second place was the XGBoost with 0.6483.

XGBoost was chosen to continue this project because it is an algorithm with a larger number of hyperparameters to be adjusted, which has made it an excellent choice among data professionals for applying to problems in various areas. For this reason, the next step was to search for the best parameters for our specific problem. With this, we moved from a Recall value of 0.6483 and managed to improve it to 0.6640.

To try to further improve the model’s performance, feature engineering was carried out, which consisted of removing some more variables from the data set and creating 4 new attributes. Once again, the data were prepared and, this time, the treatment of categorical variables was also improved with the use of Label Encoding and dummy variables.

When running the base model, we had an improvement from 0.0290 to 0.0513. After applying standardization and balancing the data, we recreated the 7 models for comparison and obtained an improvement in half of the models. However, the improvement was more significant than in the cases where the model worsened.

We continued with XGBoost, which had a Recall value of 0.6427, and after optimizing the hyperparameters, we reached 0.6663. The best value found so far.

Finally, we compared the XGBoost models created, with and without feature engineering, applied to the test set. In the confusion matrix, the immediate observation is that when it comes to true positives, the model with feature engineering is superior to the model without. However, a closer look will notice that the model without feature engineering has more overall hits, which is the sum of True Negatives and True Positives.

When we look at the AUC, the model with feature engineering is superior, presenting a result of 67.36 against 66.77 for the model without the engineering. This is repeated even more expressively in terms of the Recall value, which was 0.6547 in the model without engineering and 0.6872 with feature engineering.

To prove the superiority of the model with feature engineering, a z-statistical test was performed, which resulted in a p-value of 4.41e-08. With this, it was possible to reject the null hypothesis and accept that one model performs better than the other.

It should also be noted that the model with feature engineering resulted in a much smaller number of trees than the model without the addition of attributes, despite it having a larger number of variables. That is, the model with engineering jumped from 37 to 105 attributes. However, the number of trees in the model with engineering is much less than in the model without: 25 compared to 85. This implies the speed at which the model will run and generate results, which is essential for the type of business of Nubank.

Finally, I would like to add some suggestions that may contribute to this study:

A correlation analysis between the variables could be performed to find those that best relate to default and test the identification of these cases with these variables.
The Recall evaluation metric was used here, however, another interesting metric to be considered would be the F1-Score since False Positives can also be harmful to the business, as it means not making a loan to a customer who would actually be a good payer but was mistakenly pointed out by the model as being in default.

Get to know more about this study

This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.