Regression Analysis Part I — Predicting Credit Score, Whether or Not Customer has Defaulted (Linear Regression Models)

Anaswar Jayakumar
36 min readFeb 15, 2024

--

Overview

This project involves the analysis of a dataset consisting of 84 features derived from the financial transactions and current financial standing for 1000 customers. In this dataset, the variable CUST_ID is a unique customer identifier while the variables CREDIT_SCORE and DEFAULT are two such key target variables, CREDIT_SCORE is a numerical target variable representing the customer’s credit score (integer) while DEFAULT is a Binary target variable indicating if the customer has defaulted (1) or not (0). Besides the key target variables CREDIT_SCORE and DEFAULT, other explanatory (independent) variables are present in the dataset as well. The explanatory variables present in the dataset fall in one of four such categories:

  • Core Variables: Income, Savings, Debt, R_SAVINGS_INCOME (Ratio of savings to income), R_DEBT_INCOME (Ratio of debt to income), and R_DEBT_SAVINGS (Ratio of debt to savings)
  • Transaction Groups: Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Tax, Fines
  • Total Expenditure
  • Categorical Variables: CAT_GAMBLING (none, low, high), CAT_DEBT (1 if the customer has debt; 0 otherwise), CAT_CREDIT_CARD: (1 if the customer has a credit card; 0 otherwise), CAT_MORTGAGE (1 if the customer has a mortgage; 0 otherwise), CAT_SAVINGS_ACCOUNT (1 if the customer has a savings account; 0 otherwise), CAT_DEPENDENTS (1 if the customer has any dependents; 0 otherwise)

In this project, Python was the language of choice although R could have certainly been used as well. I personally find that Python is much more suited compared to R as the regression analysis portion of this assignment will involve machine learning techniques that are better suited for Python compared to R. Data was obtained from Kaggle, an online website that hosts various data science competitions. The following is the link to the CSV file that was used for this project: https://www.kaggle.com/datasets/conorsully1/credit-score

Objective

The objective of this project is twofold: predict a customer’s credit score as well as predict the likelihood of a customer defaulting, both of which will involve the usage of 84 features derived from the financial transactions and current financial standing across 1000 customers. To achieve the aforementioned objective, linear regression models will be implemented to predict a customer’s credit score while logistic regression models will be implemented to predict the likelihood of a customer defaulting. This project will entail analyzing the following aspects of the customer financial transactions and current financial standing dataset:

  • Core Variables: Income, Savings, Debt, R_SAVINGS_INCOME (Ratio of savings to income), R_DEBT_INCOME (Ratio of debt to income), and R_DEBT_SAVINGS (Ratio of debt to savings)
  • Transaction Groups: Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Tax, Fines
  • Total Expenditure
  • Categorical Variables: CAT_GAMBLING (none, low, high), CAT_DEBT (1 if the customer has debt; 0 otherwise), CAT_CREDIT_CARD: (1 if the customer has a credit card; 0 otherwise), CAT_MORTGAGE (1 if the customer has a mortgage; 0 otherwise), CAT_SAVINGS_ACCOUNT (1 if the customer has a savings account; 0 otherwise), CAT_DEPENDENTS (1 if the customer has any dependents; 0 otherwise)

Review of Data Sources

The data that was used for this assignment (credit_score.csv) was provided by Kaggle and the pandas library in Python was used to load the data into the dataframe: credit_score_data (Credit Score Dataset). The dataframe contain 1000 rows × 87 columns. The dataframe did contain two such columns that are object variables (‘CUST_ID’, ‘CAT_GAMBLING’) while the remaining columns are all numerical variables.

In order to prepare the dataset for analysis and model evaluation, the columns were first renamed and a mapping was then implemented in order to convert the column CATGAMBLING from a categorical column to a numerical column. Such a mapping involved assigning unique values to the corresponding categories and in the case of the column CATGAMBLING, two such categories are present: none, low, high. Therefore, a value of 0 will be assigned to the category High, a value of 1 will be assigned to the category No, and a value of 2 will be assigned to the category Low. Once the mapping was implemented and the categorical variable CATGAMBLING is converted to a numerical variable, the data preparation stage is complete. The next step was to perform exploratory data analysis

Exploratory Data Analysis (EDA)

EDA was the next step of this project, the goal being to get a better understanding of the data at large. EDA is comprised of three such components: descriptive statistics, histograms, and correlation analysis. For the purposes of this article, I will focus the EDA more on the histograms and the correlation analysis since both were instrumental in the subsequent regression analysis portion of this project.

Histograms were generated to better understand the underlying distribution of the independent variables while correlation analysis was instrumental in determining the predictor variables that will ultimately be used to predict a customer’s credit score as well as predict the likelihood of a customer defaulting. In particular, the EDA focused on the following aspects of the customer financial transactions and current financial standing dataset:

  • Core Variables: Income, Savings, Debt, R_SAVINGS_INCOME (Ratio of savings to income), R_DEBT_INCOME (Ratio of debt to income), and R_DEBT_SAVINGS (Ratio of debt to savings)
  • Transaction Groups: Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Tax, Fines
  • Total Expenditure
  • Categorical Variables: CAT_GAMBLING (none, low, high), CAT_DEBT (1 if the customer has debt; 0 otherwise), CAT_CREDIT_CARD: (1 if the customer has a credit card; 0 otherwise), CAT_MORTGAGE (1 if the customer has a mortgage; 0 otherwise), CAT_SAVINGS_ACCOUNT (1 if the customer has a savings account; 0 otherwise), CAT_DEPENDENTS (1 if the customer has any dependents; 0 otherwise)

Histograms

Core Variables:

The variables ‘INCOME’, ‘SAVINGS’, ‘DEBT’, ‘SAVINGSINCOME(R)’, ‘DEBTINCOME(R)’, and ‘DEBTSAVINGS(R)’ all seem to mostly resemble a positively (right) skewed distribution. The mean of the variables ‘INCOME’, ‘SAVINGS’, ‘DEBT’, ‘SAVINGSINCOME(R)’, ‘DEBTINCOME(R)’, and ‘DEBTSAVINGS(R)’ are 121,610.019000, 4.131896e+05, 7.907180e+05, 4.063477, 6.068449, and 5.867252 respectively while the standard deviation are 113,716.699591, 4.429160e+05, 9.817904e+05, 3.968097, 5.847878, and 16.788356 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable INCOME, on average customers have approximately between 0 and approximately $200,000 in total income during the last 12 months but with a long tail of customers that have a higher income during the last 12 months.
  • In the distribution of the variable SAVINGS, on average customers have approximately between 0 and approximately $700,000 in total savings during the last 12 months but with a long tail of customers that have more savings during the last 12 months.
  • In the distribution of the variable DEBT, on average customers have a approximately between 0 and approximately $600,000 in total debt during the last 12 months but with a long tail of customers that have more savings during the last 12 months.
  • In the distribution of the variable SAVINGSINCOME(R), on average a customer’s savings to income ratio is approximately between 0 and 5 but with a long tail of customers that have a higher savings to income ratio
  • In the distribution of the variable DEBTINCOME(R), on average a customer’s debt to income ratio is approximately between 0 and 10 but with a long tail of customers that have a higher debt to income ratio
  • In the distribution of the variable DEBTSAVINGS(R), on average a customer’s debt to savings ratio is approximately between 0 and 6 but with a long tail of customers that have a higher debt to savings ratio

Transaction Groups:

Group 1 — Clothing

The variables ‘CLOTHING(T12)’, ‘CLOTHING(T6)’, ‘CLOTHINGINCOME(R)’, ‘CLOTHINGSAVINGS(R)’, and ‘CLOTHINGDEBT(R)’ all seem to mostly resemble a positively (right) skewed distribution while the variable CLOTHING(R) seems to mostly resemble a normal distribution. The mean of the variables ‘CLOTHING(T12)’, ‘CLOTHING(T6)’, ‘CLOTHING(R)’, ‘CLOTHINGINCOME(R)’, ‘CLOTHINGSAVINGS(R)’, and ‘CLOTHINGDEBT(R)’ are 6822.401000, 3466.320000, 0.454848, 0.055557, 0.048057, and 0.030536 respectively while the standard deviations are 7486.225932, 5118.942977, 0.236036, 0.037568, 0.097712, and 0.084469 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable CLOTHING(T12), on average a customer’s total clothing expenditure in the last 12 months is approximately between 0 and $7,000 but with a long tail of customers who have a higher total clothing expenditure in the last 12 months
  • In the distribution of the variable CLOTHING(T6), on average a customer’s total clothing expenditure in the last 6 months is approximately between 0 and $5,000 but with a long tail of customers who have a higher total clothing expenditure in the last 6 months
  • In the distribution of the variable CLOTHINGINCOME(R), on average the ratio of total clothing expenditure to income for a given customer during the past 12 months is approximately 0.05 but with a long tail of customers who have a higher ratio of total clothing expenditure to income
  • In the distribution of the variable CLOTHINGSAVINGS(R), on average the ratio of total clothing expenditure to savings for a given customer during the past 12 months is approximately 0.04 but with a long tail of customers who have a higher ratio of total clothing expenditure to savings
  • In the distribution of the variable CLOTHINGDEBT(R), on average the ratio of total clothing expenditure to debt for a given customer during the past 12 months is approximately 0.03 but with a long tail of customers who have a higher ratio of total clothing expenditure to debt
  • In the distribution of the variable CLOTHING(R), on average the ratio of total clothing expenditure in the last 6 months to total clothing expenditure in the last 12 months is approximately 0.5 but with some customers who have a lower ratio of total clothing expenditure in the last 6 months to total clothing expenditure in the last 12 months and some customers who have a higher ratio of total clothing expenditure in the last 6 months to total clothing expenditure in the last 12 months

Group 2 — Education

The variables ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATIONINCOME(R)’, ‘EDUCATIONSAVINGS(R)’, and ‘EDUCATIONDEBT(R)’ all seem to mostly resemble a positively (right) skewed distribution while the variable EDUCATION(R) seems to mostly resemble a normal distribution. The mean of the variables ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATION(R)’, ‘EDUCATION(R)’, ‘CLOTHINGSAVINGS(R)’, and ‘EDUCATION(R)’ are 3,604.26000, 1,811.460000, 0.5024180.038695, 0.054301, and 0.011843 respectively while the standard deviations are 7065.70035, 3551.440702, 0.001910, 0.074037, 0.242352, and 0.059634 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable EDUCATION(T12), on average a customer’s total education expenditure in the last 12 months is approximately between 0 and $3,700 but with a long tail of customers who have a higher total education expenditure in the last 12 months
  • In the distribution of the variable EDUCATION(T6), on average a customer’s total clothing expenditure in the last 6 months is approximately between 0 and $2,500 but with a long tail of customers who have a higher total education expenditure in the last 6 months
  • In the distribution of the variable EDUCATIONINCOME(R), on average the ratio of total education expenditure to income for a given customer during the past 12 months is approximately between 0 and 0.04 but with a long tail of customers who have a higher ratio of total education expenditure to income
  • In the distribution of the variable EDUCATIONSAVINGS(R), on average the ratio of total education expenditure to savings for a given customer during the past 12 months is approximately between 0 and 0.05 but with a long tail of customers who have a higher ratio of total education expenditure to savings
  • In the distribution of the variable EDUCATIONDEBT(R), on average the ratio of total education expenditure to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total education expenditure to debt
  • In the distribution of the variable EDUCATION(R), on average the ratio of total education expenditure in the last 6 months to total education expenditure in the last 12 months is approximately between 0.5 and 0.5025 but with some customers who have a lower ratio of total education expenditure in the last 6 months to total education expenditure in the last 12 months and some customers who have a higher ratio of total education expenditure in the last 6 months to total education expenditure in the last 12 months

Group 3— Entertainment

The variables ENTERTAINMENT(T12), ENTERTAINMENT(T6), ENTERTAINMENTINCOME(R), ENTERTAINMENTSAVINGS(R), and ENTERTAINMENTDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while for thr the variable ENTERTAINMENT(R) the distribution seems to be a bit unclear. The mean of the variables ENTERTAINMENT(T12), ENTERTAINMENT(T6), ENTERTAINMENT(R), ENTERTAINMENTINCOME(R), ENTERTAINMENTSAVINGS(R), and ENTERTAINMENTDEBT(R) are 14,261.255000, 7945.307000, 0.546432, 0.167514, 0.219330, 0.094456 respectively while the standard deviations are 12388.187688, 7374.463757, 0.062581, 0.136778, 0.577628, and 0.237148 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable ENTERTAINMENT(T12), on average a customer’s total expenditure on entertainment is approximately between $0 and $15,000 in the last 12 months but with a long tail of customers who have a higher total expenditure on entertainment in the last 12 months
  • In the distribution of the variable ENTERTAINMENT(T6), on average a customer’s total expenditure on entertainment is approximately between $0 and $10,000 in the last 6 months but with a long tail of customers who have a higher total expenditure on entertainment in the last 6 months
  • In the distribution of the variable ENTERTAINMENTINCOME(R), on average the ratio of total expenditure on entertainment to income for a given customer during the past 12 months is approximately 0.17 but with a long tail of customers who have a higher ratio of total education expenditure to income
  • In the distribution of the variable ENTERTAINMENTSAVINGS(R), on average the ratio of total expenditure on entertainment to savings for a given customer during the past 12 months is approximately 0.22 but with a long tail of customers who have a higher ratio of total education expenditure to savings
  • In the distribution of the variable ENTERTAINMENTDEBT(R), on average the ratio of total expenditure on entertainment to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total education expenditure to debt
  • In the distribution of the variable ENTERTAINMENT(R), the distribution seems to be a bit unclear. Starting from a ratio of entertainment expenditure between the past 6 and 12 months (entertainment expenditure over the past 6 months / entertainment expenditure over the past 12 months) of 0.55, the distribution seems to be mostly normal. Moreover, when further looking at the set of consumers who have a ratio of entertainment expenditure between the past 6 and 12 months of at least 0.55, on average most consumers have a ratio of entertainment expenditure between the past 6 and 12 months that is approximately 0.65 but with some consumers who have a lower ratio of entertainment expenditure between the past 6 and 12 months and some customers who have a higher ratio of entertainment expenditure between the past 6 and 12 months. That being said, outliers are present in the distribution of the ratio of entertainment expenditure between the past 6 and 12 months. In particular, a significant number of consumers have a ratio of entertainment expenditure between the past 6 and 12 months of approximately 0.5.

Group 4 — Fines

The variables FINES(T12), FINES(T6), FINESINCOME(R), FINESSAVINGS(R), and FINESDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable FINES(R) seems to mostly resemble a normal distribution. The mean of the variables FINES(T12), FINES(T6), FINES(R), FINESINCOME(R), FINESSAVINGS(R), and FINESDEBT(R) are 26.504000, 14.460000, 0.760875, 0.000291, 0.000400, and 0.000098 respectively while the standard deviations are 136.171755, 88.509176, 0.286043, 0.001390, 0.003359, and 0.000783 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable FINES(T12), on average a customer’s total expenditure spent on fines is approximately between $0 and $27 in the last 12 months but with a long tail of customers who have a higher total expenditure spent on fines in the last 12 months
  • In the distribution of the variable FINES(T6), on average a customer’s total expenditure spent on fines is approximately between $0 and $15 in the last 6 months but with a long tail of customers who have a higher total expenditure spent on fines in the last 6 months
  • In the distribution of the variable FINESINCOME(R), on average the ratio of total expenditure spent on fines to income for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on fines to income
  • In the distribution of the variable FINESSAVINGS(R), on average the ratio of total expenditure spent on fines to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on fines to savings
  • In the distribution of the variable FINESDEBT(R), on average the ratio of total expenditure spent on fines to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on fines to debt
  • In the distribution of the variable FINES(R), on average a consumer has a ratio of total expenditure spent on fines between the past 6 and 12 months (expenditure spent on fines over the past 6 months / expenditure spent on fines over the past 12 months) of approximately between 0.6 and 0.8 but with some customers who have a lower ratio of total expenditure spent on fines between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on fines between the past 6 and 12 months

Group 5 — Gambling

The variables GAMBLING(T12), GAMBLING(T6), GAMBLINGINCOME(R), GAMBLINGSAVINGS(R), and GAMBLINGDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable GAMBLING(R) seems to mostly resemble a normal distribution. The mean of the variables GAMBLING(T12), GAMBLING(T6), GAMBLING(R), GAMBLINGINCOME(R), GAMBLINGSAVINGS(R), and GAMBLINGDEBT(R) are 2433.58700, 1241.625000, 0.515578, 0.018471, 0.018333, and 0.006730 respectively while the standard deviations are 5007.15757, 2570.476975, 0.052257, 0.032843, 0.063437, and 0.017247 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable GAMBLING(T12), on average a customer’s total expenditure on gambling over the past 12 months is approximately between $0 and $2,500 but with a long tail of customers who have a higher total expenditure on gambling.
  • In the distribution of the variable GAMBLING(T6), on average a customer’s total expenditure on gambling over the past 6 months is approximately between $0 and $1,250 but with a long tail of customers who have a higher total expenditure on gambling.
  • In the distribution of the variable GAMBLINGINCOME(R), on average the ratio of total expenditure spent on gambling to income for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on gambling to income
  • In the distribution of the variable GAMBLINGSAVINGS(R), on average the ratio of total expenditure spent on gambling to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on gambling to savings
  • In the distribution of the variable GAMBLINGDEBT(R), on average the ratio of total expenditure spent on gambling to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on gambling to debt
  • In the distribution of the variable GAMBLING(R), on average a consumer has a ratio of total expenditure spent on gambling between the past 6 and 12 months (expenditure spent on gambling over the past 6 months / expenditure spent on gambling over the past 12 months) of approximately 0.5 but with some customers who have a lower ratio of total expenditure spent on gambling between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on gambling between the past 6 and 12 months

Group 6 — Groceries

The variables GROCERIES(T12), GROCERIES(T6), GROCERIESINCOME(R), GROCERIESSAVINGS(R), and GROCERIESDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable GROCERIES(R) seems to mostly resemble a normal distribution. The mean of the variables GROCERIES(T12), GROCERIES(T6), GROCERIES(R), GROCERIESINCOME(R), GROCERIESSAVINGS(R), and GROCERIESDEBT(R) are 18027.602000, 9327.58900, 0.509240, 0.156475, 0.112795, and 0.096222 respectively while the standard deviations are 19207.309541, 10313.29888, 0.057203, 0.088929, 0.227134, and 0.222194 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable GROCERIES(T12), on average a customer’s total expenditure on groceries over the past 12 months is approximately between $0 and $20,000 but with a long tail of customers who have a higher total expenditure on groceries.
  • In the distribution of the variable GROCERIES(T6), on average a customer’s total expenditure on groceries over the past 6 months is approximately between $0 and $10,000 but with a long tail of customers who have a higher total expenditure on groceries.
  • In the distribution of the variable GROCERIESINCOME(R), on average the ratio of total expenditure spent on groceries to income for a given customer during the past 12 months is approximately between 0.10 and 0.20 but with a long tail of customers who have a higher ratio of total expenditure spent on groceries to income
  • In the distribution of the variable GROCERIESSAVINGS(R), on average the ratio of total expenditure spent on groceries to savings for a given customer during the past 12 months is approximately between 0 and 0.10 but with a long tail of customers who have a higher ratio of total expenditure spent on groceries to savings
  • In the distribution of the variable GROCERIESDEBT(R), on average the ratio of total expenditure spent on groceries to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on groceries to debt
  • In the distribution of the variable GROCERIES(R), on average a consumer has a ratio of total expenditure spent on groceries between the past 6 and 12 months (expenditure spent on groceries over the past 6 months / expenditure spent on groceries over the past 12 months) of approximately 0.5 but with some customers who have a lower ratio of total expenditure spent on groceries between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on groceries between the past 6 and 12 months

Group 7 — Health

The variables HEALTH(T12), HEALTH(T6), HEALTHINCOME(R), HEALTHSAVINGS(R), and HEALTHDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable HEALTH(R) seems to mostly resemble a normal distribution. The mean of the variables HEALTH(T12), HEALTH(T6), HEALTH(R), HEALTHINCOME(R), HEALTHSAVINGS(R), and HEALTHDEBT(R) are 5379.650000, 2657.589000, 0.465418, 0.052300, 0.024068, and 0.050076 respectively while the standard deviations are 5316.800612, 3386.038271, 0.212398, 0.045199, 0.044281, and 0.141497 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable HEALTH(T12), on average a customer’s total expenditure on healthcare over the past 12 months is approximately between $0 and $10,000 but with a long tail of customers who have a higher total expenditure on healthcare.
  • In the distribution of the variable HEALTH(T6), on average a customer’s total expenditure on healthcare over the past 6 months is approximately between $0 and $5,000 but with a long tail of customers who have a higher total expenditure on healthcare.
  • In the distribution of the variable HEALTHINCOME(R), on average the ratio of total expenditure spent on healthcare to income for a given customer during the past 12 months is approximately 0.05 but with a long tail of customers who have a higher ratio of total expenditure spent on healthcare to income
  • In the distribution of the variable HEALTHSAVINGS(R), on average the ratio of total expenditure spent on healthcare to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on healthcare to savings
  • In the distribution of the variable HEALTHDEBT(R), on average the ratio of total expenditure spent on healthcare to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on healthcare to debt
  • In the distribution of the variable HEALTH(R), on average a consumer has a ratio of total expenditure spent on healthcare between the past 6 and 12 months (expenditure spent on healthcare over the past 6 months/expenditure spent on healthcare over the past 12 months) of approximately between 0.4 and 0.6 but with some customers who have a lower ratio of total expenditure spent on healthcare between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on healthcare between the past 6 and 12 months

Group 8 — Housing

The variables HOUSING(T12), HOUSING(T6), HOUSINGINCOME(R), HOUSINGSAVINGS(R), and HOUSINGDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable HOUSING(R) seems to mostly resemble a normal distribution. The mean of the variables HOUSING(T12), HOUSING(T6), HOUSING(R), HOUSINGINCOME(R), HOUSINGSAVINGS(R), and HOUSINGDEBT(R) are 11146.711000, 5641.330000, 0.506100, 0.092608, 0.056445, and 0.073870 respectively while the standard deviations are 17892.426895, 9055.316562, 0.000115, 0.110893, 0.140885, and 0.236318 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable HOUSING(T12), on average a customer’s total expenditure on housing over the past 12 months is approximately between $0 and $11,000 but with a long tail of customers who have a higher total expenditure on housing.
  • In the distribution of the variable HOUSING(T6), on average a customer’s total expenditure on housing over the past 6 months is approximately between $0 and $6,000 but with a long tail of customers who have a higher total expenditure on housing.
  • In the distribution of the variable HOUSINGINCOME(R), on average the ratio of total expenditure spent on housing to income for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on housing to income
  • In the distribution of the variable HOUSINGSAVINGS(R), on average the ratio of total expenditure spent on housing to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on housing to savings
  • In the distribution of the variable HOUSINGDEBT(R), on average the ratio of total expenditure spent on housing to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on housing to debt
  • In the distribution of the variable HOUSING(R), on average a consumer has a ratio of total expenditure spent on housing between the past 6 and 12 months (expenditure spent on housing over the past 6 months/expenditure spent on housing over the past 12 months) of approximately between 0.506 and 0.50625 but with some customers who have a lower ratio of total expenditure spent on housing between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on housing between the past 6 and 12 months

Group 9 — Tax

The variables TAX(T12), TAX(T6), TAXINCOME(R), TAXSAVINGS(R), and TAXDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable TAX(R) seems to mostly resemble a normal distribution. The mean of the variables TAX(T12), TAX(T6), TAX(R), TAXINCOME(R), TAXSAVINGS(R), and TAXDEBT(R) are 4110.759000, 2072.056000, 0.502023, 0.025089, 0.014386, and 0.009917 respectively while the standard deviations are 4642.793109, 2355.615151, 0.036988, 0.019101, 0.021287, and 0.022221 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable TAX(T12), on average a customer’s total expenditure on taxes over the past 12 months is approximately between $0 and $5,000 but with a long tail of customers who have a higher total expenditure on taxes.
  • In the distribution of the variable TAX(T6), on average a customer’s total expenditure on taxes over the past 6 months is approximately between $0 and $2,000 but with a long tail of customers who have a higher total expenditure on taxes.
  • In the distribution of the variable TAXINCOME(R), on average the ratio of total expenditure spent on taxes to income for a given customer during the past 12 months is approximately between 0 and 0.03 but with a long tail of customers who have a higher ratio of total expenditure spent on taxes to income
  • In the distribution of the variable TAXSAVINGS(R), on average the ratio of total expenditure spent on taxes to savings for a given customer during the past 12 months is approximately between 0 and 0.015 but with a long tail of customers who have a higher ratio of total expenditure spent on taxes to savings
  • In the distribution of the variable TAXDEBT(R), on average the ratio of total expenditure spent on taxes to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on taxes to debt
  • In the distribution of the variable TAX(R), on average a consumer has a ratio of total expenditure spent on taxes between the past 6 and 12 months (expenditure spent on taxes over the past 6 months/expenditure spent on taxes over the past 12 months) of approximately 0.5 but with some customers who have a lower ratio of total expenditure spent on taxes between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on taxes between the past 6 and 12 months

Group 10 — Travel

The variables TRAVEL(T12), TRAVEL(T6), TRAVELSAVINGS(R), and TRAVELDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable TRAVEL(R) seems to mostly resemble a normal distribution. Moreover, the distribution of the variable TRAVELINCOME(R) seems to be a bit unclear. The mean of the variables TRAVEL(T12), TRAVEL(T6), TRAVEL(R), TRAVELINCOME(R), TRAVELSAVINGS(R), and TRAVELDEBT(R) are 31762.33000, 16675.370000, 0.488455, 0.282834, 0.331022, and 0.182340 respectively while the standard deviations are 35822.00011, 22305.675848, 0.179431, 0.196954, 0.787567, and 0.494932 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable TRAVEL(T12), on average a customer’s total expenditure on travel over the past 12 months is approximately between $0 and $32,000 but with a long tail of customers who have a higher total expenditure on travel.
  • In the distribution of the variable TRAVEL(T6), on average a customer’s total expenditure on travel over the past 6 months is approximately between $0 and $20,000 but with a long tail of customers who have a higher total expenditure on travel.
  • In the distribution of the variable TRAVELSAVINGS(R), on average the ratio of total expenditure spent on travel to savings for a given customer during the past 12 months is approximately between 0 and 0.5 but with a long tail of customers who have a higher ratio of total expenditure spent on travel to savings
  • In the distribution of the variable TRAVELDEBT(R), on average the ratio of total expenditure spent on travel to debt for a given customer during the past 12 months is approximately between 0 and 0.2 but with a long tail of customers who have a higher ratio of total expenditure spent on travel to debt
  • In the distribution of the variable TRAVEL(R), on average a consumer has a ratio of total expenditure spent on travel between the past 6 and 12 months (expenditure spent on travel over the past 6 months/expenditure spent on travel over the past 12 months) of approximately between 0.4 and 0.6 but with some customers who have a lower ratio of total expenditure spent on travel between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on travel between the past 6 and 12 months
  • In the distribution of the variable TRAVELINCOME(R), the distribution seems to be mostly normal albeit with a significant number of outliers, specifically a significant number of customers have a ratio of total expenditure spent on travel to income of 0 during the past 12 months. Starting from a ratio of total expenditure spent on travel to income of 0.2, on average most consumers have a ratio of total expenditure spent on travel to income that is approximately 0.5 but with some consumers that have a lower ratio of total expenditure spent on travel to income and some customers who have a higher ratio of total expenditure spent on travel to income

Group 11 — Utilities

The variables UTILITIES(T12), UTILITIES(T6), UTILITIESSAVINGS(R), and UTILITIESDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable UTILITIES(R) seems to mostly resemble a normal distribution. Moreover, the distribution of the variable UTILITIESINCOME(R) seems to be a bit unclear. The mean of the variables UTILITIES(T12), UTILITIES(T6), UTILITIES(R), UTILITIESINCOME(R), UTILITIESSAVINGS(R), and UTILITIESDEBT(R) are 6755.765000, 3394.665000, 0.502487, 0.054655, 0.033229, and 0.036638 respectively while the standard deviations are 6313.708805, 3172.759148, 0.001859, 0.025812, 0.052815, and 0.080381 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable UTILITIES(T12), on average a customer’s total expenditure on utilities over the past 12 months is approximately between $0 and $7,000 but with a long tail of customers who have a higher total expenditure on utilities.
  • In the distribution of the variable UTILITIES(T6), on average a customer’s total expenditure on utilities over the past 6 months is approximately between $0 and $4,000 but with a long tail of customers who have a higher total expenditure on utilities.
  • In the distribution of the variable UTILITIESSAVINGS(R), on average the ratio of total expenditure spent on utilities to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on utilities to savings
  • In the distribution of the variable UTILITIESDEBT(R), on average the ratio of total expenditure spent on utilities to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on utilities to debt
  • In the distribution of the variable UTILITIES(R), on average a consumer has a ratio of total expenditure spent on utilities between the past 6 and 12 months (expenditure spent on utilities over the past 6 months/expenditure spent on utilities over the past 12 months) of approximately between 0.5 and 0.505 but with some customers who have a lower ratio of total expenditure spent on utilities between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on utilities between the past 6 and 12 months
  • In the distribution of the variable UTILITIESINCOME(R), the distribution seems to be mostly normal albeit with a significant number of outliers, specifically a significant number of customers have a ratio of total expenditure spent on utilities to income of 0 during the past 12 months. Starting from a ratio of total expenditure spent on utilities to income of 0.02, on average most consumers have a ratio of total expenditure spent on travel to income that is approximately between 0.04 and 0.06 but with some consumers that have a lower ratio of total expenditure spent on utilities to income and some customers who have a higher ratio of total expenditure spent on utilities to income

Total Expenditure

The variables EXPENDITURE(T12), EXPENDITURE(T6), EXPENDITUREINCOME(R), EXPENDITURESAVINGS(R), and EXPENDITUREDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable EXPENDITURE(R) mostly seems to resemble a normal distribution. The mean of the variables EXPENDITURE(T12), EXPENDITURE(T6), EXPENDITURE(R), EXPENDITUREINCOME(R), EXPENDITURESAVINGS(R), and EXPENDITUREDEBT(R) are 104330.824000, 54247.771000, 0.512560, 0.943607, 0.913340, and 0.605276 respectively while the standard deviation are 89250.193047, 49853.939283, 0.079740, 0.168989, 1.625278, and 1.299382 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable EXPENDITURE(T12), on average a customer’s combined expenditure across all eleven categories over (Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Taxes, Fines) over the past 12 months is approximately between 0 and $100,000 but with a long tail of customers who have a higher combined expenditure across all eleven categories
  • In the distribution of the variable EXPENDITURE(T6), on average a customer’s combined expenditure across all eleven categories over (Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Taxes, Fines) over the past 6 months is approximately between 0 and $50,000 but with a long tail of customers who have a higher combined expenditure across all eleven categories
  • In the distribution of the variable EXPENDITURE(R), on average the ratio of a customer’s combined expenditure over the past 12 months and six months (combined expenditure over the past 6 months/combined expenditure over the past 12 months) is approximately 0.5 but with some customers who have a lower ratio of combined expenditure in the last 6 months to combined expenditure in the last 12 months and some customers who have a higher ratio of combined clothing expenditure in the last 6 months to combined clothing expenditure in the last 12 months
  • In the distribution of EXPENDITUREINCOME(R), on average the ratio of a customer’s combined expenditure over the past 12 months to their income is approximately 1 but with a long tail of customers who have a higher ratio of combined expenditure over the past 12 months to income
  • In the distribution of EXPENDITURESAVINGS(R), on average the ratio of a customer’s combined expenditure over the past 12 months to their savings is approximately between 0 and 1 but with a long tail of customers who have a higher ratio of combined expenditure over the past 12 months to savings
  • In the distribution of EXPENDITUREDEBT(R), on average the ratio of a customer’s combined expenditure over the past 12 months to their debt is approximately between 0 and 0.5 but with a long tail of customers who have a higher ratio of combined expenditure over the past 12 months to debt

Categorical Variables:

The variable CATGAMBLING seems to mostly resemble a normal distribution while for the variables CATDEBT, CATCREDITCARD, CATMORTGAGE, CATSAVINGSACCOUNT, and CATDEPENDENTS the distributions seem to be a bit unclear. The mean of the variables CATGAMBLING, CATDEBT, CATCREDITCARD, CATMORTGAGE, CATSAVINGSACCOUNT, and CATDEPENDENTS are 0.852000, 0.944000, 0.236000, 0.173000, 0.993000, and 0.173000 respectively while the standard deviations are 0.598711, 0.230037, 0.424835, 0.378437, 0.083414, and 0.378437 respectively. The distributions of the variables imply the following:

  • In the distribution of the variable CATGAMBLING, the majority of customers fall under category 1 (dont participate in gambling) while some customers fall under gambling category 0 (High, frequently participate in gambling) as well as gambling category 2 (Low, occasionally participate in gambling
  • In the distribution of the variable CATDEBT, the majority of customers have debt while very few customers have debt.
  • In the distribution of the variable CATCREDITCARD, the majority of customers dont have credit card debt while few customers do have credit card debt
  • In the distribution of the variable CATMORTGAGE, the majority of customers are currently not still paying off a mortgage while few customers are continuing to pay off a mortgage
  • In the distribution of the variable CATSAVINGSACCOUNT, almost all customers do have a savings account of some kind
  • In the distribution of the variable CATDEPENDENTS, the majority of customers claim strictly less than 1 dependent i.e., zero dependents while few customers do claim at least one dependent

Correlation Analysis

Correlation matrices were generated to better understand the relationship between the variables of interest and the dependent (response) variable, CREDITSCORE which is a numerical target variable representing the customer’s credit score (integer). The correlation matrices will also be crucial in determining which variables of interest best predict the customer’s credit score. In other words, the correlation matrices will be used to determine which variables of interest will end up being the independent variables in the regression model.

Its also worth noting that variables that either have a correlation greater than 0.3 or less than -0.3 are suitable variables for predicting the customer’s credit score since a correlation of 0.3 indicates a moderate positive relationship while a correlation of -0.3 indicates a moderate negative relationship. While using the correlation values of the independent variables is certainly not a hard and fast rule for choosing the independent variables that best predict the customer’s credit score, correlation values certainly serve as a guideline for choosing suitable and appropriate predictor variables for predicting the customer’s credit score.

Predictor Variables — Income, Savings, Debt, Ratio of Savings to Income, Ratio of Debt to Income, Ratio of Debt to Savings

The correlation between the dependent variables CREDITSCORE and the following independent variables was determined: ‘INCOME’, ‘SAVINGS’, ‘DEBT’, ‘SAVINGSINCOME(R)’, ‘DEBTINCOME(R)’, ‘DEBTSAVINGS(R)’. Based on the correlation values, there seems to be moderate negative relationship between the dependent variable and the independent variable DEBT, a very strong negative relationship between the dependent variable and the independent variable DEBTINCOME(R), and a strong negative relationship between the dependent variable and the independent variable DEBTSAVINGS(R).

Taking a closer look at the remaining predictor variables, the correlation between the dependent variable CREDITSCORE and the independent variable SAVINGSINCOME(R) seems to indicate a weak positive relationship while the correlation between the dependent variable and the independent variable INCOME seems to suggest a negligible relationship. Likewise the correlation between the dependent variable and the independent variable SAVINGS seems to suggest a similar relationship as well

In conclusion, based on the correlation values, the variables DEBT, DEBTINCOME(R), and DEBTSAVINGS(R) are good predictor variables of the dependent variable CREDITSCORE while the variables INCOME, SAVINGS, and SAVINGSINCOME(R) arent good predictor variables of the dependent variable

Predictor Variables — Transaction Groups

Group 1 — Clothing

The correlation between the dependent variable CREDITSCORE and the following independent variables was determined: ‘CLOTHING(T12)’, ‘CLOTHING(T6)’, ‘CLOTHING(R)’, ‘CLOTHINGINCOME(R)’, ‘CLOTHINGSAVINGS(R)’, ‘CLOTHINGDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable CLOTHINGDEBT(R) indicates a weak positive relationship while the correlation values suggest a negligible relationship between the dependent variables CREDITSCORE and the following independent variables: CLOTHING(T12), CLOTHING(T6), CLOTHING(R), CLOTHINGINCOME(R), and CLOTHINGSAVINGS(R).

In conclusion, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 2 — Education

The correlation between the dependent variable CREDITSCORE and the following independent variables was determined: ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATION(R)’, ‘EDUCATIONINCOME(R)’,
‘EDUCATIONSAVINGS(R)’, ‘EDUCATIONDEBT(R)’
. In particular, the correlation between the dependent variable and the independent variable EDUCATIONINCOME(R) indicates a moderate negative relationship while the correlation between the dependent variable and the independent variable EDUCATIONSAVINGS(R) indicates a weak negative relationship.

On the other hand, the correlation values indicate a negligible relationship between the dependent variable CREDITSCORE and the following independent variables: ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATION(R)’, and ‘EDUCATIONDEBT(R)’.

In conclusion, based on the correlation values, the variable EDUCATIONINCOME(R) is a good predictor variable of the dependent variable CREDITSCORE while the variables EDUCATIONSAVINGS(R), EDUCATION(T12), EDUCATION(T6), EDUCATION(R), and EDUCATIONDEBT(R) arent good predictor variables of the dependent variable

Group 3 — Entertainment

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘ENTERTAINMENT(T12)’, ‘ENTERTAINMENT(T6)’, ‘ENTERTAINMENT(R)’, ‘ENTERTAINMENTINCOME(R)’, ‘ENTERTAINMENTSAVINGS(R)’, ‘ENTERTAINMENTDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable ENTERTAINMENTDEBT(R) indicates a weak positive relationship

On the other hand, the correlation values indicate a negligible relationship between the dependent variable CREDITSCORE and the following independent variables: ENTERTAINMENT(T12), ENTERTAINMENT(T6), ENTERTAINMENT(R), ENTERTAINMENTINCOME(R), and ENTERTAINMENTSAVINGS(R).

In conclusion, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 4 — Fines

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘FINES(T12)’, ‘FINES(T6)’, ‘FINES(R)’, ‘FINESINCOME(R)’, ‘FINESSAVINGS(R)’, ‘FINESDEBT(R)’. Based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable. In particular, the correlation values indicate a negligible relationship between the dependent variable and the independent variables

Group 5 — Gambling

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘GAMBLING(T12)’, ‘GAMBLING(T6)’, ‘GAMBLING(R)’, ‘GAMBLINGINCOME(R)’, ‘GAMBLINGSAVINGS(R)’, ‘GAMBLINGDEBT(R)’. Based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable. In particular, the correlation values indicate a negligible relationship between the dependent variable and the independent variables

Group 6 — Groceries

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘GROCERIES(T12)’, ‘GROCERIES(T6)’, ‘GROCERIES(R)’, ‘GROCERIESINCOME(R)’, ‘GROCERIESSAVINGS(R)’, ‘GROCERIESDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable GROCERIESDEBT(R) indicates a weak positive relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: GROCERIES(T12), GROCERIES(T6), GROCERIES(R), GROCERIESINCOME(R), and GROCERIESSAVINGS(R).

Therefore, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 7 — Health

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘HEALTH(T12)’, ‘HEALTH(T6)’, ‘HEALTH(R)’, ‘HEALTHINCOME(R)’, ‘HEALTHSAVINGS(R)’, ‘HEALTHDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable HEALTHDEBT(R) indicates a weak positive relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: HEALTH(T12), HEALTH(T6), HEALTH(R), HEALTHINCOME(R), and HEALTHSAVINGS(R).

Therefore, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 8 — Housing

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘HOUSING(T12)’, ‘HOUSING(T6)’, ‘HOUSING(R)’, ‘HOUSINGINCOME(R)’, ‘HOUSINGSAVINGS(R)’, ‘HOUSINGDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable HOUSINGDEBT(R) indicates a weak positive relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: HOUSING(T12), HOUSING(T6), HOUSING(R), HOUSINGINCOME(R), and HOUSINGSAVINGS(R).

Therefore, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 9 — Taxes

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘TAX(T12)’, ‘TAX(T6)’, ‘TAX(R)’, ‘TAXINCOME(R)’, ‘TAXSAVINGS(R)’, ‘TAXDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable TAXDEBT(R) indicates a weak positive relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: TAX(T12), TAX(T6), TAX(R), TAXINCOME(R), and TAXSAVINGS(R).

Therefore, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 10 — Travel

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘TRAVEL(T12)’, ‘TRAVEL(T6)’, ‘TRAVEL(R)’, ‘TRAVELINCOME(R)’, ‘TRAVELSAVINGS(R)’, ‘TRAVELDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable TRAVELDEBT(R) indicates a weak positive relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: TRAVEL(T12), TRAVEL(T6), TRAVEL(R), TRAVELINCOME(R), and TRAVELSAVINGS(R).

Therefore, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 11 — Utilities

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘UTILITIES(T12)’, ‘GAMBLING(T6)’, ‘UTILITIES(R)’, ‘UTILITIESINCOME(R)’, ‘UTILITIESSAVINGS(R)’, ‘UTILITIESDEBT(R)’. Based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable. In particular, the correlation values indicate a negligible relationship between the dependent variable and the independent variables

Predictor Variables — Total Expenditure

The correlation between the dependent variable CREDITSCORE and the independent variables was determined: ‘EXPENDITURE(T12)’, ‘EXPENDITURE(T6)’, ‘EXPENDITURE(R)’, ‘EXPENDITUREINCOME(R)’, ‘EXPENDITURESAVINGS(R)’, ‘EXPENDITUREDEBT(R)'. The correlation between the dependent variable and the independent variable EXPENDITUREDEBT(R) indicates a moderate positive relationship while the correlation between the dependent variable and the following independent variables all indicate a negligible relationship: EXPENDITURE(T12), EXPENDITURE(T6), EXPENDITURE(R), EXPENDITUREINCOME(R), and EXPENDITURESAVINGS(R).

Therefore, based on the correlation values, the variable EXPENDITUREDEBT(R) is a good predictor variable of the dependent variable CREDITSCORE while the variables EXPENDITURE(T12), EXPENDITURE(T6), EXPENDITURE(R), EXPENDITUREINCOME(R), and EXPENDITURESAVINGS(R) arent good predictor variables of the dependent variable

Predictor Variables — Categorical Variables

The correlation between the dependent variable CREDITSCORE and the independent variables was determined: ‘CATGAMBLING’, ‘CATDEBT’, ‘CATCREDITCARD’, ‘CATMORTGAGE’, ‘CATSAVINGSACCOUNT’,
‘CATMORTGAGE’.
The correlation between the dependent variable and the independent variable CATCREDITCARD indicated a moderate negative relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: CATGAMBLING, CATDEBT, CATMORTGAGE, CATSAVINGSACCOUNT, and CATMORTGAGE.

Therefore, based on the correlation values, the variable CATCREDITCARD is a good predictor variable of the dependent variable CREDITSCORE while the variables CATGAMBLING, CATDEBT, CATMORTGAGE, CATSAVINGSACCOUNT, and CATMORTGAGE arent good predictor variables of the dependent variable

Regression Analysis

Now that the EDA portion has been completed, the last step is to perform a regression analysis in order to determine the best performing model and ultimately which model best predicts CREDITSCORE, the dependent (response) variable. As part of the regression analysis, a total of five models were created to predict the customer’s credit score. Its also worth noting that a train — test split methodology was implemented since the size of the customer credit card dataset was of size 1000 x 87 (1000 rows, 87 columns)

  • Model 1 — Income, Savings, Debt, Ratio of Savings to Income, Ratio of Debt to Income, Ratio of Debt to Savings
  • Model 2 — Transaction Groups
  • Model 3 — Total Expenditures
  • Model 4 — Categorical Variables
  • Model 5 — Full Model (Combine Models 1, 2, 3, 4)

In order to evaluate model performance, linear regression metrics such as R-Squared and RMSE will be used, with R-Squared representing model goodness of fit and RMSE representing the average distance between the predicted values from the model and the actual values or in other words how close or far the residuals (measure of how far from the regression line data points are) are from the regression line of best fit. Model performance will also be evaluated with respect to the test set as well

In summary, both the RMSE and R-Squared measures a linear regression model goodness of fit. Ideally, the model that has the highest R-Squared and the lowest RMSE is the model of choice for predicting the customer’s credit score.

With respect to the test RMSE and R — Squared values, the best performing models are models 2 and 5. Model 2 has a test RMSE of 0 and a test R — Squared of 100 while model 5 has a test RMSE of 0 and a test R — Squared of 100 as well. A RMSE of 0 implies that the predicted values perfectly match the actual values in the dataset. In other words, there is no error between the predicted and actual values; they are identical. This is an ideal scenario but is often rare and may indicate overfitting or a flaw in the evaluation process, especially if the model is tested on the same data it was trained on. In practical terms, achieving an RMSE of exactly 0 is typically not possible due to noise and variability in real-world data.

Moreover an R-squared value of 100 implies that the model perfectly predicts the variability of the response data around its mean. In other words, the model explains all of the variability in the dependent variable using the independent variables. This scenario is quite rare in practice and might indicate overfitting, where the model fits the training data so closely that it does not generalize well to new, unseen data. Therefore, while models 2 and 5 are ideal models, such models are simply not feasible given that their respective test RMSE and R — Squared values are virtually 0 and 1 and therefore could be potentially misleading.

Further evaluating the test RMSE and R — Squared values, model 1 seems to be a much better and feasible model as indicated by a test RMSE of 31.73 and a test R — Squared of 75. A test RMSE of 31.73 is still pretty good and indicates strong model fit while a test R — Squared that is close to 1 indicates strong model performance as well. On the other hand, models 3 and 4 are not the models of choice given their high respective test RMSE values as well as their low test R — Squared values when compared to model 1

--

--

Anaswar Jayakumar

Data Scientist - Leverages data science and statistical techniques to make recommendations that align with business priorities.