Regression Analysis Part I — Predicting Credit Score, Whether or Not Customer has Defaulted (Linear Regression Models)

36 min readFeb 15, 2024

Overview

This project involves the analysis of a dataset consisting of 84 features derived from the financial transactions and current financial standing for 1000 customers. In this dataset, the variable CUST_ID is a unique customer identifier while the variables CREDIT_SCORE and DEFAULT are two such key target variables, CREDIT_SCORE is a numerical target variable representing the customer’s credit score (integer) while DEFAULT is a Binary target variable indicating if the customer has defaulted (1) or not (0). Besides the key target variables CREDIT_SCORE and DEFAULT, other explanatory (independent) variables are present in the dataset as well. The explanatory variables present in the dataset fall in one of four such categories:

Core Variables: Income, Savings, Debt, R_SAVINGS_INCOME (Ratio of savings to income), R_DEBT_INCOME (Ratio of debt to income), and R_DEBT_SAVINGS (Ratio of debt to savings)
Transaction Groups: Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Tax, Fines
Total Expenditure
Categorical Variables: CAT_GAMBLING (none, low, high), CAT_DEBT (1 if the customer has debt; 0 otherwise), CAT_CREDIT_CARD: (1 if the customer has a credit card; 0 otherwise), CAT_MORTGAGE (1 if the customer has a mortgage; 0 otherwise), CAT_SAVINGS_ACCOUNT (1 if the customer has a savings account; 0 otherwise), CAT_DEPENDENTS (1 if the customer has any dependents; 0 otherwise)

In this project, Python was the language of choice although R could have certainly been used as well. I personally find that Python is much more suited compared to R as the regression analysis portion of this assignment will involve machine learning techniques that are better suited for Python compared to R. Data was obtained from Kaggle, an online website that hosts various data science competitions. The following is the link to the CSV file that was used for this project: https://www.kaggle.com/datasets/conorsully1/credit-score

Objective

The objective of this project is twofold: predict a customer’s credit score as well as predict the likelihood of a customer defaulting, both of which will involve the usage of 84 features derived from the financial transactions and current financial standing across 1000 customers. To achieve the aforementioned objective, linear regression models will be implemented to predict a customer’s credit score while logistic regression models will be implemented to predict the likelihood of a customer defaulting. This project will entail analyzing the following aspects of the customer financial transactions and current financial standing dataset:

Core Variables: Income, Savings, Debt, R_SAVINGS_INCOME (Ratio of savings to income), R_DEBT_INCOME (Ratio of debt to income), and R_DEBT_SAVINGS (Ratio of debt to savings)
Transaction Groups: Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Tax, Fines
Total Expenditure
Categorical Variables: CAT_GAMBLING (none, low, high), CAT_DEBT (1 if the customer has debt; 0 otherwise), CAT_CREDIT_CARD: (1 if the customer has a credit card; 0 otherwise), CAT_MORTGAGE (1 if the customer has a mortgage; 0 otherwise), CAT_SAVINGS_ACCOUNT (1 if the customer has a savings account; 0 otherwise), CAT_DEPENDENTS (1 if the customer has any dependents; 0 otherwise)

Review of Data Sources

The data that was used for this assignment (credit_score.csv) was provided by Kaggle and the pandas library in Python was used to load the data into the dataframe: credit_score_data (Credit Score Dataset). The dataframe contain 1000 rows × 87 columns. The dataframe did contain two such columns that are object variables (‘CUST_ID’, ‘CAT_GAMBLING’) while the remaining columns are all numerical variables.

In order to prepare the dataset for analysis and model evaluation, the columns were first renamed and a mapping was then implemented in order to convert the column CATGAMBLING from a categorical column to a numerical column. Such a mapping involved assigning unique values to the corresponding categories and in the case of the column CATGAMBLING, two such categories are present: none, low, high. Therefore, a value of 0 will be assigned to the category High, a value of 1 will be assigned to the category No, and a value of 2 will be assigned to the category Low. Once the mapping was implemented and the categorical variable CATGAMBLING is converted to a numerical variable, the data preparation stage is complete. The next step was to perform exploratory data analysis

Exploratory Data Analysis (EDA)

EDA was the next step of this project, the goal being to get a better understanding of the data at large. EDA is comprised of three such components: descriptive statistics, histograms, and correlation analysis. For the purposes of this article, I will focus the EDA more on the histograms and the correlation analysis since both were instrumental in the subsequent regression analysis portion of this project.

Histograms were generated to better understand the underlying distribution of the independent variables while correlation analysis was instrumental in determining the predictor variables that will ultimately be used to predict a customer’s credit score as well as predict the likelihood of a customer defaulting. In particular, the EDA focused on the following aspects of the customer financial transactions and current financial standing dataset:

Core Variables: Income, Savings, Debt, R_SAVINGS_INCOME (Ratio of savings to income), R_DEBT_INCOME (Ratio of debt to income), and R_DEBT_SAVINGS (Ratio of debt to savings)
Transaction Groups: Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Tax, Fines
Total Expenditure
Categorical Variables: CAT_GAMBLING (none, low, high), CAT_DEBT (1 if the customer has debt; 0 otherwise), CAT_CREDIT_CARD: (1 if the customer has a credit card; 0 otherwise), CAT_MORTGAGE (1 if the customer has a mortgage; 0 otherwise), CAT_SAVINGS_ACCOUNT (1 if the customer has a savings account; 0 otherwise), CAT_DEPENDENTS (1 if the customer has any dependents; 0 otherwise)

Histograms

Core Variables:

The variables ‘INCOME’, ‘SAVINGS’, ‘DEBT’, ‘SAVINGSINCOME(R)’, ‘DEBTINCOME(R)’, and ‘DEBTSAVINGS(R)’ all seem to mostly resemble a positively (right) skewed distribution. The mean of the variables ‘INCOME’, ‘SAVINGS’, ‘DEBT’, ‘SAVINGSINCOME(R)’, ‘DEBTINCOME(R)’, and ‘DEBTSAVINGS(R)’ are 121,610.019000, 4.131896e+05, 7.907180e+05, 4.063477, 6.068449, and 5.867252 respectively while the standard deviation are 113,716.699591, 4.429160e+05, 9.817904e+05, 3.968097, 5.847878, and 16.788356 respectively. The distributions of the variables imply the following:

In the distribution of the variable INCOME, on average customers have approximately between 0 and approximately $200,000 in total income during the last 12 months but with a long tail of customers that have a higher income during the last 12 months.
In the distribution of the variable SAVINGS, on average customers have approximately between 0 and approximately $700,000 in total savings during the last 12 months but with a long tail of customers that have more savings during the last 12 months.
In the distribution of the variable DEBT, on average customers have a approximately between 0 and approximately $600,000 in total debt during the last 12 months but with a long tail of customers that have more savings during the last 12 months.

In the distribution of the variable SAVINGSINCOME(R), on average a customer’s savings to income ratio is approximately between 0 and 5 but with a long tail of customers that have a higher savings to income ratio
In the distribution of the variable DEBTINCOME(R), on average a customer’s debt to income ratio is approximately between 0 and 10 but with a long tail of customers that have a higher debt to income ratio
In the distribution of the variable DEBTSAVINGS(R), on average a customer’s debt to savings ratio is approximately between 0 and 6 but with a long tail of customers that have a higher debt to savings ratio

Transaction Groups:

Group 1 — Clothing

The variables ‘CLOTHING(T12)’, ‘CLOTHING(T6)’, ‘CLOTHINGINCOME(R)’, ‘CLOTHINGSAVINGS(R)’, and ‘CLOTHINGDEBT(R)’ all seem to mostly resemble a positively (right) skewed distribution while the variable CLOTHING(R) seems to mostly resemble a normal distribution. The mean of the variables ‘CLOTHING(T12)’, ‘CLOTHING(T6)’, ‘CLOTHING(R)’, ‘CLOTHINGINCOME(R)’, ‘CLOTHINGSAVINGS(R)’, and ‘CLOTHINGDEBT(R)’ are 6822.401000, 3466.320000, 0.454848, 0.055557, 0.048057, and 0.030536 respectively while the standard deviations are 7486.225932, 5118.942977, 0.236036, 0.037568, 0.097712, and 0.084469 respectively. The distributions of the variables imply the following:

In the distribution of the variable CLOTHING(T12), on average a customer’s total clothing expenditure in the last 12 months is approximately between 0 and $7,000 but with a long tail of customers who have a higher total clothing expenditure in the last 12 months
In the distribution of the variable CLOTHING(T6), on average a customer’s total clothing expenditure in the last 6 months is approximately between 0 and $5,000 but with a long tail of customers who have a higher total clothing expenditure in the last 6 months

In the distribution of the variable CLOTHINGINCOME(R), on average the ratio of total clothing expenditure to income for a given customer during the past 12 months is approximately 0.05 but with a long tail of customers who have a higher ratio of total clothing expenditure to income
In the distribution of the variable CLOTHINGSAVINGS(R), on average the ratio of total clothing expenditure to savings for a given customer during the past 12 months is approximately 0.04 but with a long tail of customers who have a higher ratio of total clothing expenditure to savings
In the distribution of the variable CLOTHINGDEBT(R), on average the ratio of total clothing expenditure to debt for a given customer during the past 12 months is approximately 0.03 but with a long tail of customers who have a higher ratio of total clothing expenditure to debt
In the distribution of the variable CLOTHING(R), on average the ratio of total clothing expenditure in the last 6 months to total clothing expenditure in the last 12 months is approximately 0.5 but with some customers who have a lower ratio of total clothing expenditure in the last 6 months to total clothing expenditure in the last 12 months and some customers who have a higher ratio of total clothing expenditure in the last 6 months to total clothing expenditure in the last 12 months

Group 2 — Education

The variables ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATIONINCOME(R)’, ‘EDUCATIONSAVINGS(R)’, and ‘EDUCATIONDEBT(R)’ all seem to mostly resemble a positively (right) skewed distribution while the variable EDUCATION(R) seems to mostly resemble a normal distribution. The mean of the variables ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATION(R)’, ‘EDUCATION(R)’, ‘CLOTHINGSAVINGS(R)’, and ‘EDUCATION(R)’ are 3,604.26000, 1,811.460000, 0.5024180.038695, 0.054301, and 0.011843 respectively while the standard deviations are 7065.70035, 3551.440702, 0.001910, 0.074037, 0.242352, and 0.059634 respectively. The distributions of the variables imply the following:

In the distribution of the variable EDUCATION(T12), on average a customer’s total education expenditure in the last 12 months is approximately between 0 and $3,700 but with a long tail of customers who have a higher total education expenditure in the last 12 months
In the distribution of the variable EDUCATION(T6), on average a customer’s total clothing expenditure in the last 6 months is approximately between 0 and $2,500 but with a long tail of customers who have a higher total education expenditure in the last 6 months

In the distribution of the variable EDUCATIONINCOME(R), on average the ratio of total education expenditure to income for a given customer during the past 12 months is approximately between 0 and 0.04 but with a long tail of customers who have a higher ratio of total education expenditure to income
In the distribution of the variable EDUCATIONSAVINGS(R), on average the ratio of total education expenditure to savings for a given customer during the past 12 months is approximately between 0 and 0.05 but with a long tail of customers who have a higher ratio of total education expenditure to savings
In the distribution of the variable EDUCATIONDEBT(R), on average the ratio of total education expenditure to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total education expenditure to debt
In the distribution of the variable EDUCATION(R), on average the ratio of total education expenditure in the last 6 months to total education expenditure in the last 12 months is approximately between 0.5 and 0.5025 but with some customers who have a lower ratio of total education expenditure in the last 6 months to total education expenditure in the last 12 months and some customers who have a higher ratio of total education expenditure in the last 6 months to total education expenditure in the last 12 months

Group 3— Entertainment

The variables ENTERTAINMENT(T12), ENTERTAINMENT(T6), ENTERTAINMENTINCOME(R), ENTERTAINMENTSAVINGS(R), and ENTERTAINMENTDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while for thr the variable ENTERTAINMENT(R) the distribution seems to be a bit unclear. The mean of the variables ENTERTAINMENT(T12), ENTERTAINMENT(T6), ENTERTAINMENT(R), ENTERTAINMENTINCOME(R), ENTERTAINMENTSAVINGS(R), and ENTERTAINMENTDEBT(R) are 14,261.255000, 7945.307000, 0.546432, 0.167514, 0.219330, 0.094456 respectively while the standard deviations are 12388.187688, 7374.463757, 0.062581, 0.136778, 0.577628, and 0.237148 respectively. The distributions of the variables imply the following:

In the distribution of the variable ENTERTAINMENT(T12), on average a customer’s total expenditure on entertainment is approximately between $0 and $15,000 in the last 12 months but with a long tail of customers who have a higher total expenditure on entertainment in the last 12 months
In the distribution of the variable ENTERTAINMENT(T6), on average a customer’s total expenditure on entertainment is approximately between $0 and $10,000 in the last 6 months but with a long tail of customers who have a higher total expenditure on entertainment in the last 6 months

In the distribution of the variable ENTERTAINMENTINCOME(R), on average the ratio of total expenditure on entertainment to income for a given customer during the past 12 months is approximately 0.17 but with a long tail of customers who have a higher ratio of total education expenditure to income
In the distribution of the variable ENTERTAINMENTSAVINGS(R), on average the ratio of total expenditure on entertainment to savings for a given customer during the past 12 months is approximately 0.22 but with a long tail of customers who have a higher ratio of total education expenditure to savings
In the distribution of the variable ENTERTAINMENTDEBT(R), on average the ratio of total expenditure on entertainment to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total education expenditure to debt
In the distribution of the variable ENTERTAINMENT(R), the distribution seems to be a bit unclear. Starting from a ratio of entertainment expenditure between the past 6 and 12 months (entertainment expenditure over the past 6 months / entertainment expenditure over the past 12 months) of 0.55, the distribution seems to be mostly normal. Moreover, when further looking at the set of consumers who have a ratio of entertainment expenditure between the past 6 and 12 months of at least 0.55, on average most consumers have a ratio of entertainment expenditure between the past 6 and 12 months that is approximately 0.65 but with some consumers who have a lower ratio of entertainment expenditure between the past 6 and 12 months and some customers who have a higher ratio of entertainment expenditure between the past 6 and 12 months. That being said, outliers are present in the distribution of the ratio of entertainment expenditure between the past 6 and 12 months. In particular, a significant number of consumers have a ratio of entertainment expenditure between the past 6 and 12 months of approximately 0.5.

Group 4 — Fines

The variables FINES(T12), FINES(T6), FINESINCOME(R), FINESSAVINGS(R), and FINESDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable FINES(R) seems to mostly resemble a normal distribution. The mean of the variables FINES(T12), FINES(T6), FINES(R), FINESINCOME(R), FINESSAVINGS(R), and FINESDEBT(R) are 26.504000, 14.460000, 0.760875, 0.000291, 0.000400, and 0.000098 respectively while the standard deviations are 136.171755, 88.509176, 0.286043, 0.001390, 0.003359, and 0.000783 respectively. The distributions of the variables imply the following:

In the distribution of the variable FINES(T12), on average a customer’s total expenditure spent on fines is approximately between $0 and $27 in the last 12 months but with a long tail of customers who have a higher total expenditure spent on fines in the last 12 months
In the distribution of the variable FINES(T6), on average a customer’s total expenditure spent on fines is approximately between $0 and $15 in the last 6 months but with a long tail of customers who have a higher total expenditure spent on fines in the last 6 months

In the distribution of the variable FINESINCOME(R), on average the ratio of total expenditure spent on fines to income for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on fines to income
In the distribution of the variable FINESSAVINGS(R), on average the ratio of total expenditure spent on fines to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on fines to savings
In the distribution of the variable FINESDEBT(R), on average the ratio of total expenditure spent on fines to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on fines to debt
In the distribution of the variable FINES(R), on average a consumer has a ratio of total expenditure spent on fines between the past 6 and 12 months (expenditure spent on fines over the past 6 months / expenditure spent on fines over the past 12 months) of approximately between 0.6 and 0.8 but with some customers who have a lower ratio of total expenditure spent on fines between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on fines between the past 6 and 12 months

Group 5 — Gambling

The variables GAMBLING(T12), GAMBLING(T6), GAMBLINGINCOME(R), GAMBLINGSAVINGS(R), and GAMBLINGDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable GAMBLING(R) seems to mostly resemble a normal distribution. The mean of the variables GAMBLING(T12), GAMBLING(T6), GAMBLING(R), GAMBLINGINCOME(R), GAMBLINGSAVINGS(R), and GAMBLINGDEBT(R) are 2433.58700, 1241.625000, 0.515578, 0.018471, 0.018333, and 0.006730 respectively while the standard deviations are 5007.15757, 2570.476975, 0.052257, 0.032843, 0.063437, and 0.017247 respectively. The distributions of the variables imply the following:

In the distribution of the variable GAMBLING(T12), on average a customer’s total expenditure on gambling over the past 12 months is approximately between $0 and $2,500 but with a long tail of customers who have a higher total expenditure on gambling.
In the distribution of the variable GAMBLING(T6), on average a customer’s total expenditure on gambling over the past 6 months is approximately between $0 and $1,250 but with a long tail of customers who have a higher total expenditure on gambling.

In the distribution of the variable GAMBLINGINCOME(R), on average the ratio of total expenditure spent on gambling to income for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on gambling to income
In the distribution of the variable GAMBLINGSAVINGS(R), on average the ratio of total expenditure spent on gambling to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on gambling to savings
In the distribution of the variable GAMBLINGDEBT(R), on average the ratio of total expenditure spent on gambling to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on gambling to debt
In the distribution of the variable GAMBLING(R), on average a consumer has a ratio of total expenditure spent on gambling between the past 6 and 12 months (expenditure spent on gambling over the past 6 months / expenditure spent on gambling over the past 12 months) of approximately 0.5 but with some customers who have a lower ratio of total expenditure spent on gambling between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on gambling between the past 6 and 12 months

Group 6 — Groceries

The variables GROCERIES(T12), GROCERIES(T6), GROCERIESINCOME(R), GROCERIESSAVINGS(R), and GROCERIESDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable GROCERIES(R) seems to mostly resemble a normal distribution. The mean of the variables GROCERIES(T12), GROCERIES(T6), GROCERIES(R), GROCERIESINCOME(R), GROCERIESSAVINGS(R), and GROCERIESDEBT(R) are 18027.602000, 9327.58900, 0.509240, 0.156475, 0.112795, and 0.096222 respectively while the standard deviations are 19207.309541, 10313.29888, 0.057203, 0.088929, 0.227134, and 0.222194 respectively. The distributions of the variables imply the following:

In the distribution of the variable GROCERIES(T12), on average a customer’s total expenditure on groceries over the past 12 months is approximately between $0 and $20,000 but with a long tail of customers who have a higher total expenditure on groceries.
In the distribution of the variable GROCERIES(T6), on average a customer’s total expenditure on groceries over the past 6 months is approximately between $0 and $10,000 but with a long tail of customers who have a higher total expenditure on groceries.

In the distribution of the variable GROCERIESINCOME(R), on average the ratio of total expenditure spent on groceries to income for a given customer during the past 12 months is approximately between 0.10 and 0.20 but with a long tail of customers who have a higher ratio of total expenditure spent on groceries to income
In the distribution of the variable GROCERIESSAVINGS(R), on average the ratio of total expenditure spent on groceries to savings for a given customer during the past 12 months is approximately between 0 and 0.10 but with a long tail of customers who have a higher ratio of total expenditure spent on groceries to savings
In the distribution of the variable GROCERIESDEBT(R), on average the ratio of total expenditure spent on groceries to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on groceries to debt
In the distribution of the variable GROCERIES(R), on average a consumer has a ratio of total expenditure spent on groceries between the past 6 and 12 months (expenditure spent on groceries over the past 6 months / expenditure spent on groceries over the past 12 months) of approximately 0.5 but with some customers who have a lower ratio of total expenditure spent on groceries between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on groceries between the past 6 and 12 months

Group 7 — Health

The variables HEALTH(T12), HEALTH(T6), HEALTHINCOME(R), HEALTHSAVINGS(R), and HEALTHDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable HEALTH(R) seems to mostly resemble a normal distribution. The mean of the variables HEALTH(T12), HEALTH(T6), HEALTH(R), HEALTHINCOME(R), HEALTHSAVINGS(R), and HEALTHDEBT(R) are 5379.650000, 2657.589000, 0.465418, 0.052300, 0.024068, and 0.050076 respectively while the standard deviations are 5316.800612, 3386.038271, 0.212398, 0.045199, 0.044281, and 0.141497 respectively. The distributions of the variables imply the following:

In the distribution of the variable HEALTH(T12), on average a customer’s total expenditure on healthcare over the past 12 months is approximately between $0 and $10,000 but with a long tail of customers who have a higher total expenditure on healthcare.
In the distribution of the variable HEALTH(T6), on average a customer’s total expenditure on healthcare over the past 6 months is approximately between $0 and $5,000 but with a long tail of customers who have a higher total expenditure on healthcare.

In the distribution of the variable HEALTHINCOME(R), on average the ratio of total expenditure spent on healthcare to income for a given customer during the past 12 months is approximately 0.05 but with a long tail of customers who have a higher ratio of total expenditure spent on healthcare to income
In the distribution of the variable HEALTHSAVINGS(R), on average the ratio of total expenditure spent on healthcare to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on healthcare to savings
In the distribution of the variable HEALTHDEBT(R), on average the ratio of total expenditure spent on healthcare to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on healthcare to debt
In the distribution of the variable HEALTH(R), on average a consumer has a ratio of total expenditure spent on healthcare between the past 6 and 12 months (expenditure spent on healthcare over the past 6 months/expenditure spent on healthcare over the past 12 months) of approximately between 0.4 and 0.6 but with some customers who have a lower ratio of total expenditure spent on healthcare between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on healthcare between the past 6 and 12 months

Group 8 — Housing

The variables HOUSING(T12), HOUSING(T6), HOUSINGINCOME(R), HOUSINGSAVINGS(R), and HOUSINGDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable HOUSING(R) seems to mostly resemble a normal distribution. The mean of the variables HOUSING(T12), HOUSING(T6), HOUSING(R), HOUSINGINCOME(R), HOUSINGSAVINGS(R), and HOUSINGDEBT(R) are 11146.711000, 5641.330000, 0.506100, 0.092608, 0.056445, and 0.073870 respectively while the standard deviations are 17892.426895, 9055.316562, 0.000115, 0.110893, 0.140885, and 0.236318 respectively. The distributions of the variables imply the following:

In the distribution of the variable HOUSING(T12), on average a customer’s total expenditure on housing over the past 12 months is approximately between $0 and $11,000 but with a long tail of customers who have a higher total expenditure on housing.
In the distribution of the variable HOUSING(T6), on average a customer’s total expenditure on housing over the past 6 months is approximately between $0 and $6,000 but with a long tail of customers who have a higher total expenditure on housing.

In the distribution of the variable HOUSINGINCOME(R), on average the ratio of total expenditure spent on housing to income for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on housing to income
In the distribution of the variable HOUSINGSAVINGS(R), on average the ratio of total expenditure spent on housing to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on housing to savings
In the distribution of the variable HOUSINGDEBT(R), on average the ratio of total expenditure spent on housing to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on housing to debt
In the distribution of the variable HOUSING(R), on average a consumer has a ratio of total expenditure spent on housing between the past 6 and 12 months (expenditure spent on housing over the past 6 months/expenditure spent on housing over the past 12 months) of approximately between 0.506 and 0.50625 but with some customers who have a lower ratio of total expenditure spent on housing between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on housing between the past 6 and 12 months

Group 9 — Tax

The variables TAX(T12), TAX(T6), TAXINCOME(R), TAXSAVINGS(R), and TAXDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable TAX(R) seems to mostly resemble a normal distribution. The mean of the variables TAX(T12), TAX(T6), TAX(R), TAXINCOME(R), TAXSAVINGS(R), and TAXDEBT(R) are 4110.759000, 2072.056000, 0.502023, 0.025089, 0.014386, and 0.009917 respectively while the standard deviations are 4642.793109, 2355.615151, 0.036988, 0.019101, 0.021287, and 0.022221 respectively. The distributions of the variables imply the following:

In the distribution of the variable TAX(T12), on average a customer’s total expenditure on taxes over the past 12 months is approximately between $0 and $5,000 but with a long tail of customers who have a higher total expenditure on taxes.
In the distribution of the variable TAX(T6), on average a customer’s total expenditure on taxes over the past 6 months is approximately between $0 and $2,000 but with a long tail of customers who have a higher total expenditure on taxes.

In the distribution of the variable TAXINCOME(R), on average the ratio of total expenditure spent on taxes to income for a given customer during the past 12 months is approximately between 0 and 0.03 but with a long tail of customers who have a higher ratio of total expenditure spent on taxes to income
In the distribution of the variable TAXSAVINGS(R), on average the ratio of total expenditure spent on taxes to savings for a given customer during the past 12 months is approximately between 0 and 0.015 but with a long tail of customers who have a higher ratio of total expenditure spent on taxes to savings
In the distribution of the variable TAXDEBT(R), on average the ratio of total expenditure spent on taxes to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher total expenditure spent on taxes to debt
In the distribution of the variable TAX(R), on average a consumer has a ratio of total expenditure spent on taxes between the past 6 and 12 months (expenditure spent on taxes over the past 6 months/expenditure spent on taxes over the past 12 months) of approximately 0.5 but with some customers who have a lower ratio of total expenditure spent on taxes between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on taxes between the past 6 and 12 months

Group 10 — Travel

The variables TRAVEL(T12), TRAVEL(T6), TRAVELSAVINGS(R), and TRAVELDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable TRAVEL(R) seems to mostly resemble a normal distribution. Moreover, the distribution of the variable TRAVELINCOME(R) seems to be a bit unclear. The mean of the variables TRAVEL(T12), TRAVEL(T6), TRAVEL(R), TRAVELINCOME(R), TRAVELSAVINGS(R), and TRAVELDEBT(R) are 31762.33000, 16675.370000, 0.488455, 0.282834, 0.331022, and 0.182340 respectively while the standard deviations are 35822.00011, 22305.675848, 0.179431, 0.196954, 0.787567, and 0.494932 respectively. The distributions of the variables imply the following:

In the distribution of the variable TRAVEL(T12), on average a customer’s total expenditure on travel over the past 12 months is approximately between $0 and $32,000 but with a long tail of customers who have a higher total expenditure on travel.
In the distribution of the variable TRAVEL(T6), on average a customer’s total expenditure on travel over the past 6 months is approximately between $0 and $20,000 but with a long tail of customers who have a higher total expenditure on travel.

In the distribution of the variable TRAVELSAVINGS(R), on average the ratio of total expenditure spent on travel to savings for a given customer during the past 12 months is approximately between 0 and 0.5 but with a long tail of customers who have a higher ratio of total expenditure spent on travel to savings
In the distribution of the variable TRAVELDEBT(R), on average the ratio of total expenditure spent on travel to debt for a given customer during the past 12 months is approximately between 0 and 0.2 but with a long tail of customers who have a higher ratio of total expenditure spent on travel to debt
In the distribution of the variable TRAVEL(R), on average a consumer has a ratio of total expenditure spent on travel between the past 6 and 12 months (expenditure spent on travel over the past 6 months/expenditure spent on travel over the past 12 months) of approximately between 0.4 and 0.6 but with some customers who have a lower ratio of total expenditure spent on travel between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on travel between the past 6 and 12 months
In the distribution of the variable TRAVELINCOME(R), the distribution seems to be mostly normal albeit with a significant number of outliers, specifically a significant number of customers have a ratio of total expenditure spent on travel to income of 0 during the past 12 months. Starting from a ratio of total expenditure spent on travel to income of 0.2, on average most consumers have a ratio of total expenditure spent on travel to income that is approximately 0.5 but with some consumers that have a lower ratio of total expenditure spent on travel to income and some customers who have a higher ratio of total expenditure spent on travel to income

Group 11 — Utilities

The variables UTILITIES(T12), UTILITIES(T6), UTILITIESSAVINGS(R), and UTILITIESDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable UTILITIES(R) seems to mostly resemble a normal distribution. Moreover, the distribution of the variable UTILITIESINCOME(R) seems to be a bit unclear. The mean of the variables UTILITIES(T12), UTILITIES(T6), UTILITIES(R), UTILITIESINCOME(R), UTILITIESSAVINGS(R), and UTILITIESDEBT(R) are 6755.765000, 3394.665000, 0.502487, 0.054655, 0.033229, and 0.036638 respectively while the standard deviations are 6313.708805, 3172.759148, 0.001859, 0.025812, 0.052815, and 0.080381 respectively. The distributions of the variables imply the following:

In the distribution of the variable UTILITIES(T12), on average a customer’s total expenditure on utilities over the past 12 months is approximately between $0 and $7,000 but with a long tail of customers who have a higher total expenditure on utilities.
In the distribution of the variable UTILITIES(T6), on average a customer’s total expenditure on utilities over the past 6 months is approximately between $0 and $4,000 but with a long tail of customers who have a higher total expenditure on utilities.

In the distribution of the variable UTILITIESSAVINGS(R), on average the ratio of total expenditure spent on utilities to savings for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on utilities to savings
In the distribution of the variable UTILITIESDEBT(R), on average the ratio of total expenditure spent on utilities to debt for a given customer during the past 12 months is approximately 0 but with a long tail of customers who have a higher ratio of total expenditure spent on utilities to debt
In the distribution of the variable UTILITIES(R), on average a consumer has a ratio of total expenditure spent on utilities between the past 6 and 12 months (expenditure spent on utilities over the past 6 months/expenditure spent on utilities over the past 12 months) of approximately between 0.5 and 0.505 but with some customers who have a lower ratio of total expenditure spent on utilities between the past 6 and 12 months and some customers who have a higher ratio of total expenditure spent on utilities between the past 6 and 12 months
In the distribution of the variable UTILITIESINCOME(R), the distribution seems to be mostly normal albeit with a significant number of outliers, specifically a significant number of customers have a ratio of total expenditure spent on utilities to income of 0 during the past 12 months. Starting from a ratio of total expenditure spent on utilities to income of 0.02, on average most consumers have a ratio of total expenditure spent on travel to income that is approximately between 0.04 and 0.06 but with some consumers that have a lower ratio of total expenditure spent on utilities to income and some customers who have a higher ratio of total expenditure spent on utilities to income

Total Expenditure

The variables EXPENDITURE(T12), EXPENDITURE(T6), EXPENDITUREINCOME(R), EXPENDITURESAVINGS(R), and EXPENDITUREDEBT(R) all seem to mostly resemble a positively (right) skewed distribution while the variable EXPENDITURE(R) mostly seems to resemble a normal distribution. The mean of the variables EXPENDITURE(T12), EXPENDITURE(T6), EXPENDITURE(R), EXPENDITUREINCOME(R), EXPENDITURESAVINGS(R), and EXPENDITUREDEBT(R) are 104330.824000, 54247.771000, 0.512560, 0.943607, 0.913340, and 0.605276 respectively while the standard deviation are 89250.193047, 49853.939283, 0.079740, 0.168989, 1.625278, and 1.299382 respectively. The distributions of the variables imply the following:

In the distribution of the variable EXPENDITURE(T12), on average a customer’s combined expenditure across all eleven categories over (Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Taxes, Fines) over the past 12 months is approximately between 0 and $100,000 but with a long tail of customers who have a higher combined expenditure across all eleven categories
In the distribution of the variable EXPENDITURE(T6), on average a customer’s combined expenditure across all eleven categories over (Groceries, Clothing, Housing, Education, Health, Travel, Entertainment, Gambling, Utilities, Taxes, Fines) over the past 6 months is approximately between 0 and $50,000 but with a long tail of customers who have a higher combined expenditure across all eleven categories

In the distribution of the variable EXPENDITURE(R), on average the ratio of a customer’s combined expenditure over the past 12 months and six months (combined expenditure over the past 6 months/combined expenditure over the past 12 months) is approximately 0.5 but with some customers who have a lower ratio of combined expenditure in the last 6 months to combined expenditure in the last 12 months and some customers who have a higher ratio of combined clothing expenditure in the last 6 months to combined clothing expenditure in the last 12 months
In the distribution of EXPENDITUREINCOME(R), on average the ratio of a customer’s combined expenditure over the past 12 months to their income is approximately 1 but with a long tail of customers who have a higher ratio of combined expenditure over the past 12 months to income
In the distribution of EXPENDITURESAVINGS(R), on average the ratio of a customer’s combined expenditure over the past 12 months to their savings is approximately between 0 and 1 but with a long tail of customers who have a higher ratio of combined expenditure over the past 12 months to savings
In the distribution of EXPENDITUREDEBT(R), on average the ratio of a customer’s combined expenditure over the past 12 months to their debt is approximately between 0 and 0.5 but with a long tail of customers who have a higher ratio of combined expenditure over the past 12 months to debt

Categorical Variables:

The variable CATGAMBLING seems to mostly resemble a normal distribution while for the variables CATDEBT, CATCREDITCARD, CATMORTGAGE, CATSAVINGSACCOUNT, and CATDEPENDENTS the distributions seem to be a bit unclear. The mean of the variables CATGAMBLING, CATDEBT, CATCREDITCARD, CATMORTGAGE, CATSAVINGSACCOUNT, and CATDEPENDENTS are 0.852000, 0.944000, 0.236000, 0.173000, 0.993000, and 0.173000 respectively while the standard deviations are 0.598711, 0.230037, 0.424835, 0.378437, 0.083414, and 0.378437 respectively. The distributions of the variables imply the following:

In the distribution of the variable CATGAMBLING, the majority of customers fall under category 1 (dont participate in gambling) while some customers fall under gambling category 0 (High, frequently participate in gambling) as well as gambling category 2 (Low, occasionally participate in gambling
In the distribution of the variable CATDEBT, the majority of customers have debt while very few customers have debt.
In the distribution of the variable CATCREDITCARD, the majority of customers dont have credit card debt while few customers do have credit card debt

In the distribution of the variable CATMORTGAGE, the majority of customers are currently not still paying off a mortgage while few customers are continuing to pay off a mortgage
In the distribution of the variable CATSAVINGSACCOUNT, almost all customers do have a savings account of some kind
In the distribution of the variable CATDEPENDENTS, the majority of customers claim strictly less than 1 dependent i.e., zero dependents while few customers do claim at least one dependent

Correlation Analysis

Correlation matrices were generated to better understand the relationship between the variables of interest and the dependent (response) variable, CREDITSCORE which is a numerical target variable representing the customer’s credit score (integer). The correlation matrices will also be crucial in determining which variables of interest best predict the customer’s credit score. In other words, the correlation matrices will be used to determine which variables of interest will end up being the independent variables in the regression model.

Its also worth noting that variables that either have a correlation greater than 0.3 or less than -0.3 are suitable variables for predicting the customer’s credit score since a correlation of 0.3 indicates a moderate positive relationship while a correlation of -0.3 indicates a moderate negative relationship. While using the correlation values of the independent variables is certainly not a hard and fast rule for choosing the independent variables that best predict the customer’s credit score, correlation values certainly serve as a guideline for choosing suitable and appropriate predictor variables for predicting the customer’s credit score.

Predictor Variables — Income, Savings, Debt, Ratio of Savings to Income, Ratio of Debt to Income, Ratio of Debt to Savings

The correlation between the dependent variables CREDITSCORE and the following independent variables was determined: ‘INCOME’, ‘SAVINGS’, ‘DEBT’, ‘SAVINGSINCOME(R)’, ‘DEBTINCOME(R)’, ‘DEBTSAVINGS(R)’. Based on the correlation values, there seems to be moderate negative relationship between the dependent variable and the independent variable DEBT, a very strong negative relationship between the dependent variable and the independent variable DEBTINCOME(R), and a strong negative relationship between the dependent variable and the independent variable DEBTSAVINGS(R).

Taking a closer look at the remaining predictor variables, the correlation between the dependent variable CREDITSCORE and the independent variable SAVINGSINCOME(R) seems to indicate a weak positive relationship while the correlation between the dependent variable and the independent variable INCOME seems to suggest a negligible relationship. Likewise the correlation between the dependent variable and the independent variable SAVINGS seems to suggest a similar relationship as well

In conclusion, based on the correlation values, the variables DEBT, DEBTINCOME(R), and DEBTSAVINGS(R) are good predictor variables of the dependent variable CREDITSCORE while the variables INCOME, SAVINGS, and SAVINGSINCOME(R) arent good predictor variables of the dependent variable

Predictor Variables — Transaction Groups

Group 1 — Clothing

The correlation between the dependent variable CREDITSCORE and the following independent variables was determined: ‘CLOTHING(T12)’, ‘CLOTHING(T6)’, ‘CLOTHING(R)’, ‘CLOTHINGINCOME(R)’, ‘CLOTHINGSAVINGS(R)’, ‘CLOTHINGDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable CLOTHINGDEBT(R) indicates a weak positive relationship while the correlation values suggest a negligible relationship between the dependent variables CREDITSCORE and the following independent variables: CLOTHING(T12), CLOTHING(T6), CLOTHING(R), CLOTHINGINCOME(R), and CLOTHINGSAVINGS(R).

In conclusion, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 2 — Education

The correlation between the dependent variable CREDITSCORE and the following independent variables was determined: ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATION(R)’, ‘EDUCATIONINCOME(R)’,
‘EDUCATIONSAVINGS(R)’, ‘EDUCATIONDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable EDUCATIONINCOME(R) indicates a moderate negative relationship while the correlation between the dependent variable and the independent variable EDUCATIONSAVINGS(R) indicates a weak negative relationship.

On the other hand, the correlation values indicate a negligible relationship between the dependent variable CREDITSCORE and the following independent variables: ‘EDUCATION(T12)’, ‘EDUCATION(T6)’, ‘EDUCATION(R)’, and ‘EDUCATIONDEBT(R)’.

In conclusion, based on the correlation values, the variable EDUCATIONINCOME(R) is a good predictor variable of the dependent variable CREDITSCORE while the variables EDUCATIONSAVINGS(R), EDUCATION(T12), EDUCATION(T6), EDUCATION(R), and EDUCATIONDEBT(R) arent good predictor variables of the dependent variable

Group 3 — Entertainment

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘ENTERTAINMENT(T12)’, ‘ENTERTAINMENT(T6)’, ‘ENTERTAINMENT(R)’, ‘ENTERTAINMENTINCOME(R)’, ‘ENTERTAINMENTSAVINGS(R)’, ‘ENTERTAINMENTDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable ENTERTAINMENTDEBT(R) indicates a weak positive relationship

On the other hand, the correlation values indicate a negligible relationship between the dependent variable CREDITSCORE and the following independent variables: ENTERTAINMENT(T12), ENTERTAINMENT(T6), ENTERTAINMENT(R), ENTERTAINMENTINCOME(R), and ENTERTAINMENTSAVINGS(R).

In conclusion, based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable.

Group 4 — Fines

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘FINES(T12)’, ‘FINES(T6)’, ‘FINES(R)’, ‘FINESINCOME(R)’, ‘FINESSAVINGS(R)’, ‘FINESDEBT(R)’. Based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable. In particular, the correlation values indicate a negligible relationship between the dependent variable and the independent variables

Group 5 — Gambling

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘GAMBLING(T12)’, ‘GAMBLING(T6)’, ‘GAMBLING(R)’, ‘GAMBLINGINCOME(R)’, ‘GAMBLINGSAVINGS(R)’, ‘GAMBLINGDEBT(R)’. Based on the correlation values, none of the variables seem to be good predictor variables of the dependent variable. In particular, the correlation values indicate a negligible relationship between the dependent variable and the independent variables

Group 6 — Groceries

The correlation between the dependent CREDITSCORE and the following independent variables was determined: ‘GROCERIES(T12)’, ‘GROCERIES(T6)’, ‘GROCERIES(R)’, ‘GROCERIESINCOME(R)’, ‘GROCERIESSAVINGS(R)’, ‘GROCERIESDEBT(R)’. In particular, the correlation between the dependent variable and the independent variable GROCERIESDEBT(R) indicates a weak positive relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: GROCERIES(T12), GROCERIES(T6), GROCERIES(R), GROCERIESINCOME(R), and GROCERIESSAVINGS(R).