SVM — Credit Card — Start to Finished

A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI — AISeries — Episode #05

J3
Jungletronics
11 min readMay 25, 2021

--

In this post, we going to understand the Support Vector Machine Algorithm using scikit-learn.

Here is the colab notebook: link.

What is a Support Vector Machine?

A simple way to classify observations into two classes is to draw a linear boundary between them:

Fig 1. Linearly separable data

However, even if the data is perfectly separable in this way there are many possible linear boundaries that could be used.

Fig 2. Which line is the best?

Which is the best?

One approach is to use some kind of statistical distribution fit.

However, this means that even points far from the boundary have an influence on where the boundary is located.

Intuitively it seems like a better approach for this kind of problem would be to put the boundary as far as possible from any of the observations.

This is the basic idea behind support vector machine classification.

If the data is actually linearly separable this results in a simple optimization problem; choose the coefficients of the linear boundary to maximize the margin that is the distance between the boundary and the nearest observations subject to the constraint that all observations must be on the correct side of the boundary.

Note that in the end the optimal solution is determined only by the observations nearest to the boundary.

Fig 3. Support vectors

These observations are referred to as the support vectors.

Real noisy data may not be linearly separable that is it is not linear the boundary that can correctly classify every observation in this case we can modify the optimization problem to maximize the margin but with a penalty term for misclassified observations.

Note that observations are correctly classified only if they lie on the correct side of the margin so the penalty term prevents a solution that cheats by having a huge margin.

The SVM solution then is the one that gives the best possible separation between classes that is the widest margin without unnecessary misclassifications.

Linear boundaries between classes are not appropriate for all problems.

Fig 4. Not all data is separable by line…

However, SVM’s can still be used on nonlinear classification problems by performing a transformation of variables into space, where the classes are linearly separable the linear boundary in that space is equivalent to a non-linear boundary.

Fig 5. SVM get another dimension to separate classes :)
Gif 1. 2D to 3D. voila!
Gif 2. Go to Geogebra to get this app :) link :)

Fine!

Now let’s do some code.

01#Step — Open your Google colab and type this:


import pandas as pd
import numpy as npimport matplotlib.pyplot as plt%matplotlib inline

02#step — Now, let’s import our Credit Card Database.

I will use the default of credit card clients Data Set from University of California, Irvine (UCI) Machine Learning Repository.

The Attribute pieces of information are:


This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:`
X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.X2: Gender (1 = male; 2 = female).X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)X4: Marital status (1 = married; 2 = single; 3 = others).X5: Age (year).X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

To import the database, type:

df = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls', header=1)

03#Step — Some information about the data:

df.info()

04#step — Getting acquaintance of data type:

df.head(10)

05#step — Let’s rename the last column to DEFAULT:

df.rename({'default payment next month': 'DEFAULT'}, axis='columns', inplace=True)df.head()

What does the default mean? 1) Failure to do something required by duty or law: neglect. 2) archaic: fault. 3) economics: a failure to pay financial debts was in default on her loan mortgage defaults. 4) a law: failure to appear at the required time in a legal proceeding. The defendant is in default.

06#step —Remove the ID Column cause it is not informative:

df.drop('ID', axis=1, inplace=True) df.head()

07#step — Dealing w/ Missing Data:

First, Let’s see what sort of data is in each column:

df.dtypes

We see that every column is int64, which is good since it tells us that they don’t mix letters and numbers :) There are no NA values. Let’s make sure that each column contains acceptable values (please see descriptions above).

First, the sex column:

Gender (1 = male; 2 = female):

print(df['SEX'].unique())[2 1]

Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)

Values 0,5 and 6 are unknown. It is possible that 0 is for missing data and 5 and 6 are not mentioned categories. That is just a guess :)

print(df['EDUCATION'].unique())[2 1 3 5 4 6 0]

Marital status (1 = married; 2 = single; 3 = others)

print(df['MARRIAGE'].unique())[1 2 3 0]

Like EDUCATION, MARRIAGE contains 0, which we are guessing represents the missing data.

len(df.loc[(df['EDUCATION'] == 0) | (df['MARRIAGE'] == 0)])68df_no_missing = df.loc[(df['EDUCATION'] != 0) & (df['MARRIAGE'] != 0)]len(df_no_missing)29932

That is 30000–68 = 29932.

Just Doing the Math:

Missing data represents 0,0022 or least than 0.23% of the database :)

(1 - ((30000-68)/30000))*1000.2266666666666639

08#step —Now, let’s use a heatmap from seaborn:

import seaborn as snsplt.figure(figsize=(8,6))sns.heatmap(df_no_missing, yticklabels=False, cbar=False, robust=True,cmap='viridis')
Fig 2. Half of the database are useful; the rest are not:/

THE FIRST HALF OF THE DATABASE ARE NOW READY TO GO! (column from 1 to 10th ;) Confirming there is no missing in the above two columns:

print(df_no_missing['EDUCATION'].unique())print(df_no_missing['MARRIAGE'].unique())[2 1 3 5 4 6] 
[1 2 3]

09#step — Downsampling the database:

Let’s remind ourselves how many customers are in the dataset:

len(df_no_missing)29932

Why downsampling? SVM works better in a reasonable database. Let’s get 1,000 of each category.

We start by splitting the database into two, one for people that are paying (no_default) regularly credit card debt, and the other for people that the debt goes unpaid (default).

default payment (Yes = 1, No = 0)

First for no-Default:)

from sklearn.utils import resampledf_no_default = df_no_missing[df_no_missing['DEFAULT']==0]df_no_default_downsampled = resample(df_no_default, replace=False, n_samples=1000, random_state=42 )len(df_no_default_downsampled)1000

Now for Default:/

from sklearn.utils import resampledf_default = df_no_missing[df_no_missing['DEFAULT']==1]df_default_downsampled = resample(df_default, replace=False, n_samples=1000, random_state=42 )len(df_default_downsampled)1000

Now merging the databases back:

df_downsample = pd.concat([df_no_default_downsampled, df_default_downsampled ])len(df_downsample)2000

10#step — Preparing for Train the Data Splitting:

X = df_downsample.drop('DEFAULT', axis=1).copy() 
# Alternatively: X = df_dow.iloc[:, :-1].copy()
X.head()
y = df_downsample['DEFAULT'].copy()y.head()

11#step — Hot Encoding procedures (categorical data):

X_encoded = pd.get_dummies(X, columns=['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])X_encoded.head()

12#step — Centering & Scaling the data:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42)from sklearn.preprocessing import scaleX_train_scaled = scale(X_train)X_test_scaled = scale(X_test)

13#step — Building a preliminary SVM (Fit the Model to the data:):

from sklearn.svm import SVCclf_svm = SVC(random_state = 42)clf_svm.fit(X_train_scaled, y_train)SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',     max_iter=-1, probability=False, random_state=42, shrinking=True, tol=0.001,     verbose=False)

14#step — Confusion Matrix:

from sklearn.metrics import confusion_matrixfrom sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(clf_svm, X_test_scaled, y_test, values_format='d', display_labels=['Did Not Default', 'Defaulted'])

Analyzing Confusion Matrix:

From 233 + 69 = 302 people that Did Not Default, 69 was missclassified (21%)

From 125 + 173 = 298 people that Defaulted, 125 was missclassified (41.610%)

That is not acceptable!

Let’s fix it! (at least, try:/)

15#step — Cross-Validation & GridSearch — Optimization Techniques:

from sklearn.model_selection import GridSearchCVparam_grid = {'C':[0.5,0.1,1,10,100,1000], 'gamma':['scale', 1,0.1, 0.01,0.001,0.0001], 'kernel':['rbf']}# we including C=1 and gamma = 'scale' cause these are default
# values;
# rbf = radial basis function, cause typically it gives us the best
# performance
# visit: Radial basis function
# https://en.wikipedia.org/wiki/Radial_basis_function
optimal_params = GridSearchCV(SVC(), param_grid, cv = 5, scoring='accuracy', verbose=3)

Now fitting the data scaled:

optimal_params.fit(X_train_scaled, y_train)print(optimal_params.best_params_)

The ideal value for C = .5, which mean we will use regularization, and the ideal value for gamma = 0.01

16#step — Making Predictions:

Ideal value for C = .5, which mean we will use regularization, adn the ideal value for gamma = 0.01
clf_svm = SVC(random_state = 42, C=1, gamma=0.01)
clf_svm.fit(X_train_scaled, y_train)SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=42, shrinking=True, tol=0.001, verbose=False)

17#step — Confusion Matrix Again:

plot_confusion_matrix(clf_svm, X_test_scaled, y_test, values_format='d', display_labels=['Did Not Default', 'Defaulted'])

Analyzing Confusion Matrix:

From 236 + 66 = 302 people that Did Not Default, 66 was misclassified (21%)

From 127 + 171 = 298 people that Defaulted, 127 was misclassified (42.617%)

A bit worse :/

Why?

Let’s try to plot the graph of boundary decision regions.

This is a very complicated step. Every time you run it, it will yield different results:/

Anyway, Let’s get started! Dimensions reduction is the key here…

len(df_downsample.columns)24

This will be required a 24-dimension graph. That is impossible…how to overcome this problem?

The answers:

PCA (Principal Components Analysis)

from sklearn.decomposition import PCApca = PCA()X_train_pca = pca.fit_transform(X_train_scaled)per_var = np.round(pca.explained_variance_ratio_*100, decimals=1)labels = [str(x) for x in range(1, len(per_var)+1)]plt.bar(x=range(1, len(per_var)+1), height=per_var)plt.tick_params(axis='x', which = 'both', bottom=False, top=False, labelbottom=False)plt.ylabel('Percentage of Explained Variance')plt.xlabel('Principal Components')plt.title('Scree Plot')plt.show()
Fig 3. Every time you run it, it will yield different results:? this is not helping:/ let’s do the graph…

18#step —Retraining the data:

Let’s retrain fitting the Data with PCA

pc1 Contains the x-axis coordinates of the data after PCA

pc2 Contains the y-axis coordinates of the data after PCA

train_pc1_coords = X_train_pca[:, 0]train_pc2_coords = X_train_pca[:, 1]#Centering & Scalingpca_train_scaled = scale(np.column_stack((train_pc1_coords, train_pc2_coords)))param_grid = {'C':[0.5,0.1,1,10,100,1000], 'gamma':['scale', 1,0.1, 0.01,0.001,0.0001], 'kernel':['rbf']}# we including C=1 and gamma = 'scale' cause these are default values;# rbf = radial basis function, cause typically it gives us the best performance# visit: Radial basis function https://en.wikipedia.org/wiki/Radial_basis_functionoptimal_params = GridSearchCV(SVC(), param_grid, cv = 5, scoring='accuracy', verbose=3)optimal_params.fit(pca_train_scaled, y_train)print(optimal_params.best_params_)

19#step — Decision Bauldary Region:

import matplotlib.colors as colorsclf_svm = SVC(random_state=42, C=1000, gamma=0.001)clf_svm.fit(pca_train_scaled, y_train)X_test_pca = pca.transform(X_train_scaled)test_pc1_coords = X_test_pca[:, 0]test_pc2_coords = X_test_pca[:, 1]x_min = test_pc1_coords.min()-1x_max = test_pc1_coords.max()+1y_min = test_pc2_coords.min()-1y_max = test_pc2_coords.max()+1xx, yy = np.meshgrid(np.arange(start=x_min, stop=x_max, step=0.1),np.arange(start=y_min, stop=y_max, step=0.1) )Z = clf_svm.predict(np.column_stack((xx.ravel(), yy.ravel())))Z = Z.reshape(xx.shape)fig, ax = plt.subplots(figsize=(10,10))ax.contourf(xx,yy, Z, alpha=0.1)cmap = colors.ListedColormap(['#e41a1c', '#4daf4a'])scatter = ax.scatter(test_pc1_coords, test_pc2_coords, c=y_train, cmap=cmap, s=100, edgecolors='k', alpha=0.7)legend = ax.legend(scatter.legend_elements()[0], scatter.legend_elements()[1], loc='upper right')legend.get_texts()[0].set_text('No Defaults')
legend.get_texts()[1].set_text('Yes Defaults')
ax.set_ylabel('PC2')
ax.set_xlabel('PC1')
ax.set_title('Decision Surface Using the PCA Transformed/Projected Features')
#plt.savefig('svm_defaults.png)plt.show()
Fig 4. the data is incredibly noisy…but that is all we’ve got so far…Why there is too much noise in the dataset?

20#step — That’s it:

print("We studied: 'default of credit card clients Data Set' from UCI.\nThat's it! I hope this helps!\nThank you!")We studied: 'default of credit card clients Data Set' from UCI. That's it! I hope this helps! Thank you!

If you find this post helpful, please click the applause button and subscribe to the page for more articles like this one.

Until next time!

I wish you an excellent day!

Download The File For This Project

29_credit_card_svm.ipynb

Credits & References

Based on: Support Vector Machines in Python from Start to Finish. https://youtu.be/8A7L0GsBiLQ by StatQuest with Josh Starmer

default of credit card clients Data Set — Download: Data Folder, Data Set Description

FAQ:

01# What is the decision boundary in a linear classification model?A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. If the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly separable. 02# What does it mean for a classification model to be is linear?In the field of machine learning, the goal of statistical classification is to use an object’s characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics.03# Categorical data. What is it?In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.(wikipedia)04# Decision boundaries. What is it?Decision boundaries are not always clear cut. That is, the transition from one class in the feature space to another is not discontinuous, but gradual. This effect is common in fuzzy logic based classification algorithms, where membership in one class or another is ambiguous.

Related Posts

00#Episode — AISeries — ML — Machine Learning Intro — What Is It and How It Evolves Over Time?

01#Episode — AISeries — Huawei ML FAQ — How do I get an HCIA certificate?

02#Episode — AISeries — Huawei ML FAQ Again — More annotation from Huawei Mock Exam

03#Episode — AISeries — AI In Graphics — Getting Intuition About Complex Math & More

04#Episode — AISeries — Huawei ML FAQ — Advanced — Even More annotation from Huawei Mock Exam

05#Episode — AISeries — SVM — Credit Card — Start to Finished—A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI (this one)

06#Episode — AISeries — SVM — Breast Cancer — Start to Finished— A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI

07#Episode — AISeries — SVM — Cupcakes or Muffins? — Start To Finished — Based on Alice Zhao post

--

--

J3
Jungletronics

Hi, Guys o/ I am J3! I am just a hobby-dev, playing around with Python, Django, Ruby, Rails, Lego, Arduino, Raspy, PIC, AI… Welcome! Join us!