Credit Card Defaulters — Exploratory Data Analysis and Frequent Pattern Mining

Khawaja Junaid
25 min readMay 13, 2022

--

Defaulter/non-Defaulter Dataset
A dataset of credit card holders labeled to identify them as defaulters/non-defaulters (loan paid or not).

Crushing loans for defaulters

Can there be certain patterns or features from the data that can help bank scrutinize who they offer loans to ?

The Dataset

Before we start to work on the data formally, it is essential to know what our data consists of and what the attributes in the data represent. If we lack knowledge about what our attribute means and what our data represents, we can make erroneous assumptions and, ultimately, an incorrect analysis. The attributes in the data set consisted of:

  • TARGET
    people labeled 1 — Defaulters
    people labeled 0 — Non-Defaulters
  • CONTRACT_TYPE
    Identification if the loan is cash or revolving
  • GENDER
    Male
    Female
  • FLAG_OWN_CAR
    Flag if the client owns a car
  • FLAG_OWN_REALTY
    Flag if the client owns a house or a flat
  • CNT_CHILDREN
    Number of children the client has
  • INCOME_TOTAL
    Income of the client
  • EDUCATION_TYPE
    Level of highest education the client has achieved
  • FAMILY_STATUS
    Family status of the client
  • HOUSING_TYPE
    (What is the housing situation of the client)
    Rent
    Living with Parents
  • OCCUPATION_TYPE
    What kind of occupation does the client have
  • FAM_MEMBERS
    How many family members does the client have
  • REGION_RATING_CLIENT
    Rating of the region where the client lives
  • ORGANIZATION_TYPE
    Type of organization where the client works

Exploratory Data Analysis

After we have had a good look and understanding of the data the next step is to get our hands dirty and delve deeper into the data itself. Exploratory data analysis is a crucial task for any data mining or data science-related task. EDA consists of preprocessing, visualizations, and correlation analysis — in essence, exploring the data to find meaningful information from the raw data itself.

Preprocessing

The Preprocessing phase is the data enhancing phase. In which the data is prepared for the latter stages which is the required processing of the data, where, for example, various algorithms can be run in the data and results can be drawn from the data.

Identifying and Removing duplicate data

Duplicate values in the data may arise from errors while importing or exporting the data or even at the time when data entries are being made.

We dealt with the duplicate values by dropping all the duplicates and keeping 1 instance — the original data value — in the dataset.

Using the .duplicated() function, we get a boolean data frame that indicates duplicate values that are present in the original database.

We created a boolean data frame that returned true for duplicates and false for non duplicate values.

bool_series = df.duplicated()
pd.DataFrame(bool_series.value_counts()).rename(columns={0:”Count”})

The duplicate values were then dropped using the following function.

df.drop_duplicates(keep = False, inplace = True)

Dealing with Missing Values

Raw data can often come with missing value. Missing values need to be dealt with as they can reduce the statistical power of the analysis, loss of important information or even bring about a possible bias.

Missing values may often hold important information, more than what some of us think. Missing values can further be divided in different categories:

  • Missing Completely at Random — MCAR
  • Missing at Random — MAR
  • Missing not at Random — MNAR

For the purpose of the project we assumed that all of the values in the data were MCAR.

Missing values can be either dropped completely, however this further contributes to the loss of information. For example, dropping the entire row because there was a missing value in 1 out of 15 of the attributes means we also lose data for the rest of the 14 attributes that contained data.

We first of all identified missing values that were present in the entire data frame — using a boolean frame of our data.

missingValues = df.notnull()

From the boolean frame we extracted the names of the attributes that contained missing values.

The following attributes were found to contain missing values.

The missing values were replaced using the modefill method.

Because we have a large amount of data and the number of missing values is significantly less, filling them with the mode value makes sense, as it will bring about less variance.

Dealing with Corrupted Values

When going through the unique values for each column, we found that the gender column has null values but with a different representation, so we had to deal with that case separately. Null values that are usually represented by NA were shown by XNA in the CODE_GENDER columns.

df[‘CODE_GENDER’] = df[‘CODE_GENDER’].replace([‘XNA’], df[‘CODE_GENDER’].mode()[0])

Data Visualizations

Numerical Variables Summary Stats

df.describe()

Summary Stats for Defaulters

df_defaulter=df[df[‘TARGET’]==1]
df_defaulter[[‘CNT_CHILDREN’,’INCOME_TOTAL’,’CREDIT’,’FAM_MEMBERS’,’REGION_RATING_CLIENT’]].describe()

Summary Stats for Non-Defaulters

df_nondefaulter=df[df[‘TARGET’]==0]
df_nondefaulter[[‘CNT_CHILDREN’,’INCOME_TOTAL’,’CREDIT’,’FAM_MEMBERS’,’REGION_RATING_CLIENT’]].describe

Preprocessing

The Preprocessing phase is the data enhancing phase. In which the data is prepared for the latter stages which is the required processing of the data, where, for example, various algorithms can be run in the data and results can be drawn from the data.

Identifying and Removing duplicate data

Duplicate values in the data may arise from errors while importing or exporting the data or even at the time when data entries are being made.

We dealt with the duplicate values by dropping all the duplicates and keeping 1 instance — the original data value — in the dataset.

Using the .duplicated() function, we get a boolean data frame that indicates duplicate values that are present in the original database.

We created a boolean data frame that returned true for duplicates and false for non duplicate values.

bool_series = df.duplicated()
pd.DataFrame(bool_series.value_counts()).rename(columns={0:”Count”})

The duplicate values were then dropped using the following function.

df.drop_duplicates(keep = False, inplace = True)

Dealing with Missing Values

Raw data can often come with missing value. Missing values need to be dealt with as they can reduce the statistical power of the analysis, loss of important information or even bring about a possible bias.

Missing values may often hold important information, more than what some of us think. Missing values can further be divided in different categories:

  • Missing Completely at Random — MCAR
  • Missing at Random — MAR
  • Missing not at Random — MNAR

For the purpose of the project we assumed that all of the values in the data were MCAR.

Missing values can be either dropped completely, however this further contributes to the loss of information. For example, dropping the entire row because there was a missing value in 1 out of 15 of the attributes means we also lose data for the rest of the 14 attributes that contained data.

We first of all identified missing values that were present in the entire data frame — using a boolean frame of our data.

missingValues = df.notnull()

From the boolean frame we extracted the names of the attributes that contained missing values.

The following attributes were found to contain missing values.

The missing values were replaced using the modefill method.

Because we have a large amount of data and the number of missing values is significantly less, filling them with the mode value makes sense, as it will bring about less variance.

Dealing with Corrupted Values

When going through the unique values for each column, we found that the gender column has null values but with a different representation, so we had to deal with that case separately. Null values that are usually represented by NA were shown by XNA in the CODE_GENDER columns.

df[‘CODE_GENDER’] = df[‘CODE_GENDER’].replace([‘XNA’], df[‘CODE_GENDER’].mode()[0])

Data Visualizations

Numerical Variables Summary Stats

df.describe()

Summary Stats for Defaulters

df_defaulter=df[df[‘TARGET’]==1]
df_defaulter[[‘CNT_CHILDREN’,’INCOME_TOTAL’,’CREDIT’,’FAM_MEMBERS’,’REGION_RATING_CLIENT’]].describe()

Summary Stats for Non-Defaulters

Preprocessing

The Preprocessing phase is the data enhancing phase. In which the data is prepared for the latter stages which is the required processing of the data, where, for example, various algorithms can be run in the data and results can be drawn from the data.

Identifying and Removing duplicate data

Duplicate values in the data may arise from errors while importing or exporting the data or even at the time when data entries are being made.

We dealt with the duplicate values by dropping all the duplicates and keeping 1 instance — the original data value — in the dataset.

Using the .duplicated() function, we get a boolean data frame that indicates duplicate values that are present in the original database.

We created a boolean data frame that returned true for duplicates and false for non duplicate values.

bool_series = df.duplicated()
pd.DataFrame(bool_series.value_counts()).rename(columns={0:”Count”})

The duplicate values were then dropped using the following function.

df.drop_duplicates(keep = False, inplace = True)

Dealing with Missing Values

Raw data can often come with missing value. Missing values need to be dealt with as they can reduce the statistical power of the analysis, loss of important information or even bring about a possible bias.

Missing values may often hold important information, more than what some of us think. Missing values can further be divided in different categories:

  • Missing Completely at Random — MCAR
  • Missing at Random — MAR
  • Missing not at Random — MNAR

For the purpose of the project we assumed that all of the values in the data were MCAR.

Missing values can be either dropped completely, however this further contributes to the loss of information. For example, dropping the entire row because there was a missing value in 1 out of 15 of the attributes means we also lose data for the rest of the 14 attributes that contained data.

We first of all identified missing values that were present in the entire data frame — using a boolean frame of our data.

missingValues = df.notnull()

From the boolean frame we extracted the names of the attributes that contained missing values.

The following attributes were found to contain missing values.

The missing values were replaced using the modefill method.

Because we have a large amount of data and the number of missing values is significantly less, filling them with the mode value makes sense, as it will bring about less variance.

Dealing with Corrupted Values

When going through the unique values for each column, we found that the gender column has null values but with a different representation, so we had to deal with that case separately. Null values that are usually represented by NA were shown by XNA in the CODE_GENDER columns.

df[‘CODE_GENDER’] = df[‘CODE_GENDER’].replace([‘XNA’], df[‘CODE_GENDER’].mode()[0])

Data Visualizations

Numerical Variables Summary Stats

df.describe()

Summary Stats for Defaulters

df_defaulter=df[df[‘TARGET’]==1]
df_defaulter[[‘CNT_CHILDREN’,’INCOME_TOTAL’,’CREDIT’,’FAM_MEMBERS’,’REGION_RATING_CLIENT’]].describe()

Summary Stats for Non-Defaulters

df_nondefaulter=df[df[‘TARGET’]==0]
df_nondefaulter[[‘CNT_CHILDREN’,’INCOME_TOTAL’,’CREDIT’,’FAM_MEMBERS’,’REGION_RATING_CLIENT’]].describe

Basic Visualizations

Most of the data visualizations consisted of viewing the empirical distribution within the data which we carried out by using bar graphs to represent counts of various categories and identify different factors that could assure which category in the dataset had a similar trend.

Target Variable — Defaulter/Non Defaulter

Gender

As shown by the results, we see that there is gender inbalance in the data, the number of females in the dataset is approximately double the number of males that are present in the data set.

Family Status

Contract Type

As seen by the above graph, we see that are two type of loan facilities, cash loans and revolving loans.

  • A revolving loan can be defined as: A committed loan facility allowing a borrower to borrow (up to a limit), repay and re-borrow loans.
  • A cash basis loan is one in which interest is recorded as earned when payment is collected.

We see that there is a significant difference between the count of the types of loans.

Cars

Realty

Cars Vs Realty

Using the above displayed data frames for Owned Realty and Cars Owned, we see that about 200,000 people own a house or a flat but out of those 200,000 only about 100,000 people own a car i.e. only 50% of the people who own a house also own a car.

Type of Loan

A Simple Box Plot

The above summary stats can be better visualized with box plots for the two categories as well as with their particular gender. The box plot will also help us to identify possible outliers in the data.

There were extreme outliers for the defaulter category which are not being shown in the plot

The plot above shows that the Target=1 (defaulter) have a lower mean income as

compared to the non defaulters which would mean that they are unable to pay their loans, apart from that it can also be seen that female defaulters have a lower mean income for both categories.

For the other categories we see that the non defaulters:

  1. Have a lower region rating which means the defaulters live in a region with higher rating meaning that the cost of living there would be higher as well.
  2. Have a lower mean for count of children according to the summary statistics, meaning less to spend on other members.

Advanced Visualizations

In the initial portion of the EDA we used a simple set of graphs to see the structure and composition of our data, in the part below, we have used visualizations to showcase more meaningful analysis and relations in the dataset.

Target Grouped by Gender

From the graph above we can see that the number of female defaulters is very large compared to the male defaulters. One major reason for this is that the number of females in the original dataset is twice the number of males, there exists a gender imbalance in the data set. One way to deal with this would be to randomly select an equal number of males and females from the dataset and then perform the analysis on that data.

Target Based on Credit Loan Amount

The Credit Loan Amount attribute is a continuous numerical variable — to visualize the count we decided to split the data based on the mean value of the attribute now we had two classes

  1. Data with Credit_Amount less than the mean credit amount
  2. Data with Credit_Amount more than the mean credit amount

Having split the data we can now make two plots.

  1. The people who are defaulters/non-defaulters who have credit amount less than the mean credit amount
  2. The people who are defaulters/non-defaulters who have credit amount more than the mean credit amount

less = targetCredit[targetCredit[‘CREDIT’] < meanAmountCreditLoan]

more = targetCredit[targetCredit[‘CREDIT’] >= meanAmountCreditLoan]

The data was split into two parts based on the Credit loan amount being less or more than the mean Credit Amount. Mean was just one of the metrics that could have been used as a pivot to perform the split. The data could have also been plotted based on the mode amount, however since the Credit variable was continuous and not discrete using the mode would have meant a very large number of bar plots.

From the above results:

  • For Credit loan < 604781, # Defaulters = 16,119
  • For Credit loan >= 604781, # Defaulters = 8554

It is surprising to note, that for a greater amount of credit loan, the number of defaulters dropped to approximately half of those who took out a smaller loan. However, it still can not be concluded that people with higher credit loans are less likely to become defaulters.

Target Based on Occupation Type

From the above graph we can see that Laborers are the most frequent occupation type in the data set and IT-Staff being the least common. And hence laborers have the highest Defaulter to Non-Defaulter ration in the entire data set. One reason for laborers having higher numbers in defaulters is that they might be earning less compared to other professions generally. The below code compares the mean for the income of the Laborers and then for the entire data set excluding the Laborers. It is important to note that there is no single profession in the entire data set, which does not exist in the defaulters list.

Target Grouped by Loan Type

From the data:

  • The percentage of cash loans = 0.9065984142973638
  • The percentage of revolving loans = 0.093401585702613621
  • The percentage for defaulters in cash loans = 0.08722288150326155
  • The percentage for defaulters in revolving loans = 0.058473954512105

From the data above we can see that there is a greater number of defaulters who use cash-loans compared to those who receive revolving loans. It is also important to note, a significantly high percentage of the users opted for the cash loans instead of the revolving loans.

Encoding the Categorical Data

Creating Mappings

We created mappings to translate the categorical string variables to assigned numeric labels. A few of those mappings are shown below.

map_contract = {‘Cash loans’: 0, ‘Revolving loans’:1}
map_gender = {‘F’:0, ‘M’:1}
map_car = {’N’: 0, ‘Y’: 1}
map_realty = {’N’: 0, ‘Y’: 1}

Mapping Labels onto the data

try:
copy.NAME_CONTRACT_TYPE = copy.NAME_CONTRACT_TYPE.map(map_contract)
copy.CODE_GENDER = copy.CODE_GENDER.map(map_gender)
copy.FLAG_OWN_CAR = copy.FLAG_OWN_CAR.map(map_car)
copy.FLAG_OWN_REALTY = copy.FLAG_OWN_REALTY.map(map_realty)
copy.NAME_INCOME_TYPE = copy.NAME_INCOME_TYPE.map(map_income)
copy.EDUCATION_TYPE = copy.EDUCATION_TYPE.map(map_education)
copy.FAMILY_STATUS = copy.FAMILY_STATUS.map(map_family_status)
copy.HOUSING_TYPE = copy.HOUSING_TYPE.map(map_housing_type)
copy.OCCUPATION_TYPE = copy.OCCUPATION_TYPE.map(map_occupation_type)
copy.ORGANIZATION_TYPE = copy.ORGANIZATION_TYPE.map(map_organization_type)
except:
pass

Attribute Dependency

We used the chi-square test to assess attribute dependency. The results of which are given below.

for category in categories:
grp1 = list(copy[category])
grp2 = list(copy[‘CREDIT’])

# Assumption(H0) is that there is no relationship between variables
s=[grp1,grp2]
stat, p, dof, expected = chi2_contingency(s)

# interpret p-value
alpha = 0.05
print(“p value is “ + str(p))
if p <= alpha:
print(‘CREDIT and {} are Dependent (reject H0)’.format(category))
else:
print(‘CREDIT and {} are Independent (H0 holds true)’.format(category))
print(‘’)

for category in categories:
grp1 = list(copy[category])
grp2 = list(copy[‘INCOME_TOTAL’])

# Assumption(H0) is that there is no relationship between variables
s=[grp1,grp2]
stat, p, dof, expected = chi2_contingency(s)

# interpret p-value
alpha = 0.05
print(“p value is “ , p)
if p <= alpha:
print(‘INCOME_TOTAL and {} are Dependent (reject H0)’.format(category))
else:
print(‘INCOME_TOTAL and {} are Independent (H0 holds true)’.format(category))
print(‘’)

Results of Chi-Square Test

Using the chi-square test, we can see dependency between attributes. Above results show that if there is a dependency between categorical variables and numerical variables(INCOME TOTAL, CREDIT) or not. Since the p-values we got for most of the attributes is very small, even smaller than 0.05 which shows that these attributes are dependent(reject null hypothesis,since null hypothesis is that there is no relation between attributes).

The Correlation and Covariance Matrix

corr = df.corr()

ax = sns.heatmap(corr, annot=True, cmap=”YlGnBu”, linewidths=1)

The correlation matrix signifies the variables that are affected with change in the target variable which suggests that a correlation exists, for variables such as ,

  1. A positive but weak correlation with region client rating where people living in higher rated areas are more defaulters perhaps given the expenses
  2. A positive but weak correlation with the children count where more children would suggest more areas to spend instead of returning back the loan
  3. A negative correlation with income total and credit, which also suggests lower the income the lower the target variable meaning leaning to defaulter.
  4. A negative correlation with education, which means higher educated people would not be a defaulter.

cov = copy.cov()

The covariance matrix can be used to determine the direction of the relationship(linear) between two variables/attributes.

  1. If there is a direct relationship between the variables, they will either decrease or increase together. Their coefficient will also be positive.
  2. If the existing relationship that exists between them is of inverse nature, i.e. one variable decreases and the other increases, then the coefficient will be negative.

As seen from the covariance matrix above, each point in the table is a coefficient for the relationship between two of those attributes (xi, xj). If the value is positive, there exists a positive correlation. If both the attributes tend to decrease together the coefficient is negative.

Frequent Pattern Mining

“Mining” patterns from the data

Feature Selection and Engineering

From the results of phase 1, we see that not all the variables have a strong correlation with the target variable. The strength of the relation can be termed as strong, weak, or non-existent. To deal with variables that have a weak or no correlation with the target variable, we can drop some of the columns that we deem do not have any impact on the target variable based on our findings from the Exploratory Data Analysis of the dataset.

However, since we are looking for frequent patterns in the dataset, we might eventually end up losing meaningful data, because there is a possibility that although attributes may apparently not be related to the target variable, some of those attributes may be working together to have some kind of relationship with the target variable.

Encoding Numerical Data

For us to run frequent pattern analysis algorithms such as FP Growth and Apriori we need to drop the Continuous Numerical variables as they can not form a part of the frequent item sets.

From our dataset, the attributes that were Numerical-Continuous were INCOME_TOTAL and CREDIT. We can choose to drop the numerical columns using the code below:

copy.drop([‘INCOME_TOTAL’, ‘CREDIT’], axis=1, inplace=True)

Another approach would be to encode them. We have two columns in our data that are of the type Numerical-Continuous Variables. To feed these columns into the data we need to encode them. Encoding all the values of a continuous variable is nearly impossible. A solution to this conundrum can be to use a central tendency value such as the mean of the attribute, i.e. assign 0 to the values that are below the mean amount and assign 1 to the values that are greater than the mean amount.

meanIncomeTotal = copy[‘INCOME_TOTAL’].mean()
copy[‘INCOME_TOTAL’] = copy[‘INCOME_TOTAL’].apply(lambda x: 0 if x < meanIncomeTotal else 1)
meanCredit = copy[‘CREDIT’].mean()
copy[‘CREDIT’] = copy[‘CREDIT’].apply(lambda x: 0 if x < meanCredit else 1)

One Hot Encoding

For the dataset to be run through the FPA algorithms, we need to encode the data.

For encoding we used One-Hot-Encoding. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

encoder = OneHotEncoder(handle_unknown=’ignore’)
encode = pd.DataFrame(encoder.fit_transform(copy).toarray())
encode.columns = encoder.get_feature_names_out()

FP Growth Algorithm Implementation

Choice of Algorithm

For conducting the frequent pattern analysis we chose to run the FP Growth algorithm as it is faster and more efficient as compared to other algorithms, it does not require candidate generation like the Apriori algorithm and due to a large amount of data it would crash the notebook at runtime, therefore, rendering the algorithm useless.

Separating defaulters and non-defaulters data

defaulters_df = encode[encode[‘TARGET_1’]==1]
nondefaulters_df = encode[encode[‘TARGET_0’]==1]

Running the FP Growth Algorithm

fpGrowth_defaulters = fpgrowth(defaulters_df, min_support = 0.2, use_colnames=True)

fpGrowth_nondefaulter = fpgrowth(nondefaulters_df, min_support = 0.2, use_colnames=True)

Appending the Length of the itemset column to the results

fpGrowth_defaulters[‘Length’] = fpGrowth_defaulters[‘itemsets’].apply(lambda x: len(x))
fpGrowth_nondefaulters[‘Length’] = fpGrowth_nondefaulters[‘itemsets’].apply(lambda x: len(x))

Results of Frequent Pattern Mining

The Fp Growth algorithm gave us results that further confirmed our findings from our previous phases, with a support count of 0.2 to get meaningful pairs of frequent itemsets for our targets, Target_0 meaning the person is not a defaulter whereas Targer_1 means the person is a defaulter.

Defaulters:

The frequent items mined for our defaulter led us to insights that could help classify certain individuals at the time they demand a loan, the pointers the bank could use can be taken from the frequent items mined.

‘CNT_CHILDREN_0’, ‘FAMILY_STATUS_0’, ‘NAME_CONTRACT_TYPE_0’, ‘TARGET_1’, ‘EDUCATION_TYPE_1’, ‘FAM_MEMBERS_2.0’, ‘HOUSING_TYPE_0’}

Our findings for the longest items set mined shows that the person who has a low level of education with poor housing facilities on rent, along with having a few family members, is contracting for cash loans instead of revolving loans.

The association rules further confirm our association for defaulters ‘NAME_CONTRACT_TYPE_0’: cash loans ‘TARGET_1’: defaulter, this rule alludes to the fact that a person taking a cash loan will most definitely lead to them becoming a defaulter, a strong rule and a confidence value of 1.

Non-defaulters:

The frequent pattern for non defaulters mined showed us some differentiating items present in the itemset

{‘CNT_CHILDREN_0’, ‘FAMILY_STATUS_0’, ‘REGION_RATING_CLIENT_2’, ‘NAME_CONTRACT_TYPE_0’, ‘TARGET_0’, ‘FAM_MEMBERS_2.0’, ‘HOUSING_TYPE_0’}

We see that for high support, longest length items the education level is higher, with a better region rating meaning the people who live in a better region along with a decent sized family.

The association rules for non defaulters show that people with cash loans and living on rent are non defaulters.

Both the mined patterns show that there are some distinctive features the bank can ask which includes their housing type, their regional rating, education level and the contract they wish to pursue, certain choices together can then point out if the person could be potential defaulter and more scrutiny could be done to ensure the defaulter is not given the loan until they fill a certain criteria.

Association Rules

association_rules(fpGrowth_defaulters, metric=”confidence”, min_threshold=0.7)

association_rules(fpGrowth_nondefaulters, metric=”confidence”, min_threshold=0.7)

Antecedents and Consequences

An association rule has two parts:

  1. an antecedent (if)
  2. a consequent (then)

An antecedent is an item found within the data.

A consequent is an item found in combination with the antecedent.

  1. The support tells us the number of times that items co-occur with each other
  2. The confidence indicates the number of times that a rule occurs
  3. The lift tells us about the strength of that association

Based on the above stated metrics we can choose which rules to keep and which rules to discard. The rules can then be used to analyze what features were more prevalent in the people who were found to be defaulters and based on this data we can classify data as defaulter/non-defaulter. Other metrics such as support and confidence can also be used to test the strength of these associations.

Using these sets of association rules we can determine, for example, if a person was found to be positive for attribute_x, then they were found to be identified as a defaulter.

Clustering

Clusters in Data

The Elbow Method

The elbow method is a heuristic used in determining the number of clusters in a data set. We used this technique to identify the optimal number of clusters for our cluster analysis

Plotting Number of clusters against WCSS

Optimal number of clusters came out to be 4.

K-means Clustering

Normalizing and Fitting datavalues

Setting n_clusters=4, as the optimal value of cluster came out 4.

kmeans = KMeans(n_clusters=4, init=’k-means++’)
#used normalizer to normalize datavalues
normalizer = Normalizer()
pipeline = make_pipeline(normalizer,kmeans)
pipeline.fit(X)
#assigned cluster to each data value
clusters = pipeline.predict(X)

Getting data values that belong to specific cluster:

points1 = np.array([X[j] for j in range(len(X)) if clusters[j] == 1])
df1 = pd. DataFrame(points1, columns=my_df.columns)
pd.DataFrame(df1[‘TARGET’].value_counts())

Data Values that belong to cluster 0:

Data Values that belong to cluster 1:

Data Values that belong to cluster 2:

Data Values that belong to cluster 3:

Results of Clustering:

As we know that, using k mean clustering we try to cluster similar kinds of observations. So, from the above observation, we can see how data values are spread over different 4 clusters. Also, we can observe that most of the values belong to cluster 1 in our case. And the people that belong to cluster 1 are mostly non- defaulters(0).

We used k-mean clustering, as we know, in k-means clustering we cluster values which have minimum distance from centroid(updated after each iteration). So we actually calculate distance from all the cluster’s centroids and put it in the closest cluster. In our dataset we have numerical variables INCOME_TOTAL and CREDIT that play a huge role in clustering. So,that’s why, there is almost the same pattern in each cluster.

Cluster 1 defaulters EDUCATION_TYPE and HOUSING_TYPE:

Cluster 2 Defaulter EDUCATION_TYPE and HOUSING_TYPE:

From above clusters, we can see a pattern, that is, defaulters cluster mostly have low education levels along with poor housing facilities as mentioned in the earlier part.

Outlier Detection Plots

Box Plots

Scatter Plot

Income vs Credit Total

Dropping Outliers

drop = drop[(np.abs(stats.zscore(drop[‘INCOME_TOTAL’])) < 3)]
drop = drop[(np.abs(stats.zscore(drop[‘CREDIT’])) < 3)]
drop = drop[(np.abs(stats.zscore(drop[‘FAM_MEMBERS’])) < 3)]
drop = drop[(np.abs(stats.zscore(drop[‘CNT_CHILDREN’])) < 3)]

For any point whose Z score value lies beyond +/- 3 is considered to be an outlier. From the empirical rule relating to the normal distribution curve, 99.7% of the data lies with the (mean+3s.d) and (mean+3s.d) range. Because of this reason, as a general rule of thumb the value of 3 was used.

Dropping outliers has caused the data to become concentrated in one area, and the data points that were far off have been removed.

Analysis

Outliers may cause a decrease in normality of the data, because they are not normally distributed. They can also cause bias in estimates and reduce the power of statistical testing and classification/prediction power of the algorithm. They may cause strong associations to appear in the frequent pattern analysis phase.

Outliers may come from a wrong data entry. They may also occur naturally in the data set, for example, an Income that is very very large than the rest of the data is basically there because the client may be doing multiple jobs to earn extra money or may be running a high return business, not because of wrong data entry. In such a case, the outlier should not be dropped as it is providing relevant information about the data set. That may be important during analysis of the results.

We can see from the graphs above, that the number of outliers present in the data is significantly low, compared to the overall size of the data, i.e. 2,91,858. For such a large size, the outliers might have little impact on the statistics of the data, nevertheless they will still disrupt the normality of the data.

From the visualizations above: there are outliers present in the FAM_MEMBERS and CNT_CHILDREN variables. It is also important to note that both the variables are positively correlated with each other, as was suggested by our correlation analysis. A large number of children translates to a greater family size, hence greater number of family members.

  • Total Number of Attributes = 16
  • Number of Categorical Attributes = 4
  • Number of Numeric Attributes = 12

75 % of the attributes are discrete categorical in nature. For the categorical variables, it is not possible to perform an outlier analysis, neither does it make any sense. For example, let’s take a look at the GENDER attribute, which has the values [male, female] that were later mapped to [0,1]. A data point can either exist as an instance of male or female, there can be no instances of any value in this set that can be considered to be different to the rest of the data sets. Making a plot of categorical variables will give us distinct regions and all the points of the data will lie within this region. Hence the outlier analysis phase does not apply to such attributes.

How can the banks use this information to protect themselves from defaulters?

Helping banks on cracking down defaulters

Banks can use the above provided analysis and the results of the FP growth algorithm along with the association rules to determine people with which prevalent attributes have a higher probability to be deemed as a potential defaulter. Again, the bank can not with certainty deem any of their clients as a defaulter until proven. However this data can provide information about what features are more common in defaulters, and for those people the bank can initiate a (more strict) screening process.

The bank can change parameters to however strict they want to be and run the Frequent Pattern Mining Algorithm again and set their own cutt-offs. After that the client is assigned to a cluster which will help the bank to identify if the relevant client is a potential defaulter or non-defaulter.

Banks need to use the above mentioned methodology with a grain of salt.

A better approach would be to gather as much data as possible that can be handled and run our models on that, followed by the evaluation of the model in terms of accuracy.

Even the outliers identified in the data above does not mean that the person is likely to be a defaulter, for example it is perfectly normal to have a large family or a six figure yearly income etc.

One approach to minimize bank loss, based on the information we have, is that for people who are more likely to defaulters set a maximum limit for loan amount.

Before giving out a loan the bank should definitely consider looking at various attributes, such as education, rating of region, realty etc. based on that they can take preventive measures and then give out the loan. For example. If someone owns a house, the bank can issue the loan against their horse ownership documents.

Now the bank absolutely can not view each person as an individual and then decide if the person should be given a loan or not, to save time. We can further go through the data returned by the FPA and see which features were prevalent in defaulters. And when a person with that specific set of features ,found in a defaulter, comes by the bank can carry out a more thorough screening.

In addition to this, as the data of the bank continues to grow the results returned by our suggested method will become more and more accurate nevertheless with a slight hint of uncertainty. We might perhaps never be able to say that our model or predictions are a 100% accurate, however these suggestions will definitely help the bank to cut down their loss by identifying some of the possible defaulters.

References

  1. https://blog.eduonix.com/bigdata-and-hadoop/importance-exploratory-data-analysis-ml-modelling/#:~:text=Exploratory%20Data%20Analysis%20(EDA)%20is,test%20hypotheses%2C%20and%20verify%20assumptions.
  2. https://en.wikipedia.org/wiki/Data_pre-processing
  3. https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/#:~:text=Types%20Of%20Missing%20Values,Missing%20Not%20At%20Random%20(MNAR)

--

--