Machine Learning Steps Explained Using Credit Card Approval Dataset

Published in

Analytics Vidhya

7 min readJan 5, 2020

In this article I have explained how I worked on credit card approval dataset to build a machine learning model that predicts whether a credit card has to be approved or not.

Define Objective

Initial step in Machine Learning is to define the problem statement. Here our aim is to build a model which predicts whether the credit card has to be approved or not based on some details given by the user.

Data Gathering

The objective is clear now. It’s time to collect data related to the problem statement here it is credit card approval so the details like annual income, experience and many mote features which are well known by banks will be collected. I collected the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/

Data Cleaning

The well known fact is that data is never be structured, it always some anomalies like missing values, inconsistent, redundant data.A good Machine Learning model can build with good data set. So before training the model make sure the data set is free from all these errors. Now observing our credit card data set, it has 16 columns in which the last column is the one that has to be predicted. A few details of the data set are

Title: Credit Approval
Relevant Information: This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
This dataset is interesting because there is a good mix of
attributes — continuous, nominal with small numbers of
values, and nominal with larger numbers of values. There
are also a few missing values.
Number of Instances: 690
Number of Attributes: 15 + class attribute
Attribute Information:

A1: b, a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t, f.
A10: t, f.
A11: continuous.
A12: t, f.
A13: g, p, s.
A14: continuous.
A15: continuous.
A16: +,- (class attribute)

6. Missing Attribute Values:
37 cases (5%) have one or more missing values. The missing
values from particular attributes are:

A1: 12
A2: 12
A4: 6
A5: 6
A6: 9
A7: 9
A14: 13

7. Class Distribution

+: 307 (44.5%)
-: 383 (55.5%)

These are the few details that are available about the dataset.

There are total 690 instances in the dataset and out of them 37 instances have missing values and in total there are 67 missing values, these things we have to take care in this section.

Different ways to deal with missing values:

Delete the rows which are having the missing values.
If the variables are continuous replace all the missing values with the mean or median of the attribute.
If the variable in categorical replace with the most common value of that variable
For the categorical variables replace each value of the variable and then perform clustering for best analysis of data.

Now the approach I used is the 4th one which is the best one because there is no loss of data, every value will have it’s appropriate result.

Initially read the dataset using pandas

import pandas as pddata = pd.read_csv(“creditcard.csv”)

Now identify that in this data set the missing value is represented as ‘?’ so check for the rows which have missing values. I took all these rows into another data frame and If it found any value of continuous variable is missing replaced with the mean of that variable if I found any categorical variable is missing created rows by replacing with different possible values of the continuous variable. Now there is a chance of creating duplicate rows, so i deleted duplicate rows from the data frame.

Now the rows in which the missing values are replaced may not have the correct output(class), I considered it as data without labels and performed clustering algorithms with two clusters. i used K-Means clustering to create the cluster. Fortunately the data is able to form well defined clusters without overlapping.

Applied K-Means Clustering to analyse the data

kmeans = KMeans(n_clusters=2)y = kmeans.fit_predict(df)df['Cluster'] = y
### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=2).fit_transform(df)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])sns.scatterplot(x="pca1", y="pca2", hue=y, data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Now predicted the outputs for the rows in which the missing values are updated

Exploratory Data Analysis

It is very important step in Machine learning in which the you have to become a detective to observe some patterns in the data based on that you have to find which algorithm is suitable for current data.

#importing packages
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
plt.style.use('ggplot')

The rows and columns present in the updated data set(the dataset without missing values)

#shape
print('This data frame has {} rows and {} columns.'.format(data.shape[0], data.shape[1]))

Now looking at a sample of data to know how the data is

data.sample(5)

To know about each attribute and it’s datatype

data.info()

Now plot the data based on the amount of instances which are approved and which are not approved

plt.figure(figsize=(8,6))
sns.barplot(x=counts.index, y=counts)
plt.title('Count of approved vs. Not approved Credit cards')
plt.ylabel('Count')
plt.xlabel('Class (+:approved, -:Not approved)')

we can find that there are less cards which are not approved.

Now observe the correlation between the continuous variables.

#heatmap
corr = data.corr()
plt.figure(figsize=(12,10))
heat = sns.heatmap(data=corr)
plt.title('Heatmap of Correlation')

If a variable is categorical we cannot work on it because the algorithms can understand the data which have numerical data. So the Categorical varibales have to be converted to numerical.

Now it will increase the size of the dataset since it create new column for each value in the categorical data, now the 16 columns will become 47 columns.

Now splitting the data into training and testing to create a sub sample out of them.

splittig the data into train and test
#manual train test split using numpy's random.rand
mask = np.random.rand(len(X)) < 0.9
train = X[mask]
test = X[~mask]
print('Train Shape: {}\nTest Shape: {}'.format(train.shape, test.shape))no_of_notapprov = train.A16.value_counts()[0]
not_approv = train[train['A16'] == 0]
approv = train[train['A16'] == 1]
selected = approv.sample(no_of_notapprov)
subsample = pd.concat([selected, not_approv])
len(subsample)

Now looking for the data in the sub sample.

#shuffling our data set
subsample = subsample.sample(frac=1).reset_index(drop=True)
subsample.head(10)

The sub sample is created to make the two class distributions equal. In the sub sample the number of instances of Approved and Not-Approved are equal. Now we can perform analysis

new_counts = subsample.A16.value_counts()
plt.figure(figsize=(8,6))
sns.barplot(x=new_counts.index, y=new_counts)
plt.title(‘Count of Approved vs. Not-Approved CreditCards In Subsample’)
plt.ylabel(‘Count’)
plt.xlabel(‘Class (0:Not-Approved, 1:Approved)’)

Now performed T-SNE to reduce the dimension of the data. Here the data dimension is reduced to 2.

from sklearn.manifold import TSNEx = X.drop('A16', axis=1)
y = X['A16']
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(x.values)
import matplotlib.patches as mpatchesf, ax = plt.subplots(figsize=(24,16))blue_patch = mpatches.Patch(color='#0A0AFF', label='Approved')
red_patch = mpatches.Patch(color='#AF0000', label='Not Approved')ax.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Approved', linewidths=2)
ax.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='Not Approved', linewidths=2)
ax.set_title('t-SNE', fontsize=14)ax.grid(True)ax.legend(handles=[blue_patch, red_patch])

Building the model

This step must be followed by the cross validation which evaluates accuracy of different algorithms and selects the best among them. First we have to identify which type of problem it is, It is clearly classification problem since we have to predict the output in which class it will belong(Approved or Not Approved). Evaluating the performance of classification algorithms like logistic regression, SVM, Decision Tree, Random Forest, Kth Nearest Neighbour.

Since here the data is linearly separable SVM gives the best accuracy. Training the data with Support Vector Machine algorithm.

from sklearn.preprocessing import StandardScaler, scale
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrixfrom sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)

Predicting the output

Now model building is completed it’s time to predict the results and check for the accuracy of the model on the testing dataset.

y_pred = svclassifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

Finally evaluating model performance using confusion matrix if the model is not performing well, we can apply different methods like changing the parameter if the algorithms, increasing the size of the dataset, cross validation.