Comparison of Machine Learning Classification Models for Credit Card Default Data

Vijaya Beeravalli
20 min readSep 30, 2018

--

Naveen Krishna, Data Scientist AltairQ, Vijaya Beeravalli, Data Science Tutor, Monash University, Melbourne, Ravindranath Pandian, Founder Director AltairQ

1. Introduction to Machine Learning

Machine learning is a data analytics technique that teaches computers to do what comes naturally to humans and animals: learn from experience. Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model.

The algorithms adaptively improve their performance as the number of samples available for learning increases.

2. How Machine Learning Works

Machine learning uses two types of techniques:

· Supervised learning, which trains a model on known input and output data so that it can predict future outputs,

· and Unsupervised learning, which finds hidden patterns or intrinsic structures in input data.

Figure 1: Machine learning techniques include both unsupervised and supervised learning.

3. Machine Learning Classifiers

Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modelling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y). Classification belongs to the category of supervised learning where the input data is provided with some target label. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc. [1]

4. The Business Context

A Taiwan-based credit card issuer wants to better predict the likelihood of default for its customers, as well as identify the key drivers that determine this likelihood. This would inform the issuer’s decisions on who to give a credit card to and what credit limit to provide. It would also help the issuer have a better understanding of their current and potential customers, which would inform their future strategy, including their planning of offering targeted credit products to their customers.

5. Our Objective

We wish to compare ten Machine Learning Algorithms, namely

1. Binary Logistic Regression (BLR)

2. KNN

3. SVM

4. Neural Network

5. Naive Bayes

6. Decision Tree

7. Random Forest

8. LDA

9. Gradient Boosting

10. XGB

on the “Default of Credit Card Clients in Taiwan in 2005” dataset from the UCI machine learning repository comprising of 25 attributes and 30000 instances. The dataset has been employed to analyze the performance of algorithms in predicting credit card defaulters based on the various parameters obtained from the model.

6. Data Structure and Description

The dataset is obtained from the UCI Machine Learning Repository credit card defaulter [2]. It is a newly published dataset (obtained in 2015). The attribute details in the dataset are given in Table 1.

This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of the probability of default among ten machine learning models. [3]

Table 1. Description of the attributes in the dataset

7. Data View

Figure 2: Data View

8. Machine Learning Algorithms for Classification

1. Binary Logistic Regression (BLR)

R Package: stats, Function: glm() [19]

It is the go-to method for binary classification problems (problems with two class values). It is a multiple regression with an outcome variable (or dependent variable) that is the categorical dichotomy in nature and explanatory variables(x) that can be either continuous or categorical. The model outputs probability estimates for each instance of variables.

To predict which class an observation belongs to, a threshold can be set. Based upon this threshold, the obtained estimated probability is classified into classes.

Figure 3: BLR

Decision boundary can be linear or non-linear. The polynomial order can be increased to get a complex decision boundary.

Application:

Logistic regression moves with non-linear function hence can work with linearly and non-linearly separable problems.

It can be used for predicting the probability of an event.

Strengths:

· Outputs have a nice probabilistic interpretation, and the algorithm can be regularized to avoid over fitting.

· Logistic models can be updated easily with new data using stochastic gradient descent.

Weaknesses:

· Logistic regression tends to underperform when there are multiple or non-linear decision boundaries.

· They are not flexible enough to naturally capture more complex relationships.

2. KNN

R Package: caret [21], Function: train (, data, method = “knn”,) [20]

K-Nearest-Neighbours (KNN) is a non-parametric method used for classification. It is a lazy learning algorithm where all computation is deferred until classification. It is one of the simplest classification algorithms and easy to implement. There is no explicit training phase and the algorithm does not perform any generalizations of the training data.

Being nonparametric, the algorithm does not make any assumptions about the underlying data distribution. Select the parameter K based on the data. It requires a distance metric to define proximity between any two data points. Example: Euclidean distance, Mahalanobis distance or Hamming distance.

Algorithm:

The KNN classification is performed using the following four steps:

· Compute the distance metric between the test data point and all the labelled data points.

· Order the labelled data points in the increasing order of this distance metric.

· Select the top k labelled data points and look at the class labels

· Find the class label that the majority of these k labelled data points have and assign it to the test data point.

This algorithm is memory-intensive and performs poorly for high-dimensional data. It requires a meaningful distance function to calculate similarity.

Application:

k-NN is often used in search applications where you are looking for similar” items; that is, when your task is some form of “find items similar to this one”. You’d call this a k-NN search.

Strengths:

· Simple to implement

· Flexible to feature/distance choices

· Naturally handles multi-class cases

· Can do well in practice with enough representative data

Weaknesses:

· Large search problem to find nearest neighbours

· Storage of data

· Must know we have a meaningful distance function

3. SVM

R Package: caret [21], Function: train(, data, method = “svmLinear”,) [20]

Vapnik & Chervonenkis originally invented support the vector machine. At that time, the algorithm was in the early stages. Drawing hyperplanes only for linear classifier was possible. Later in 1992, Vapnik, Boser& Guyon suggested a way for building a non-linear classifier suggested using kernel trick in SVM latest paper [6]. Vapnik& Cortes published this paper in the year 1995 [7]. From then, SVM classifier treated as one of the dominant classification algorithms.

Support vector machines (SVM) use a mechanism called kernels, which essentially calculate the distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes. [4]

In the linear classifier model, the data points are expected to be separated by an apparent gap. It predicts a straight hyperplane dividing 2 classes. The primary focus while drawing the hyperplane is on maximizing the distance from hyperplane to the nearest data point of either class. The drawn hyperplane called as a maximum-margin hyperplane. [8]

Application:

The aim of using SVM is to correctly classify unseen data.

Face detection — SVMc classify parts of the image as a face and non-face and create a square boundary around the face.

Text and hypertext categorization — SVMs allow Text and hypertext categorization for both inductive and transductive models. They use training data to classify documents into different categories. It categorizes on the basis of the score generated and then compares it with the threshold value.

Classification of images — Use of SVMs provides better search accuracy for image classification. It provides better accuracy in comparison to the traditional query-based searching techniques.

Strengths:

· SVM’s can model non-linear decision boundaries, and there are many kernels to choose from.

· They are also robust against overfitting, especially in high-dimensional space, no distribution requirement and doesn’t suffer from multicollinearity

Weaknesses:

· However, SVM’s are memory intensive, trickier to tune due to the importance of picking the right kernel, and don’t scale well to larger datasets.

· Currently in the industry, random forests are usually preferred over SVM’s.

4. Neural Network

R Package: neuralnet , Function: neuralnet() [22]

Neural Networks (NN), also called as Artificial Neural Network is named after its artificial representation of working of a human being’s nervous system.

The general structure of a neural network looks like:

Figure 4: Artificial neuron and the structure of the feed forward artificial neural network. [10]

This figure depicts a typical neural network with the working of a single neuron explained separately. Let’s understand this.

The input to each neuron is like the dendrites. Just like in the human nervous system, a neuron (artificial though!) collates all the inputs and performs an operation on them. Lastly, it transmits the output to all other neurons (of the next layer) to which it is connected. Neural Network is divided into the layer of 3 types:[11]

Input Layer: The training observations are fed through these neurons

Hidden Layers: These are the intermediate layers between input and output which help the Neural Network learn the complicated relationships involved in data.

Output Layer: The final output is extracted from the previous two layers.

Components details:

1. Neurons. A neural network is a graph of neurons. A neuron has inputs and outputs. Similarly, a neural network has inputs and outputs. The inputs and outputs of a neural network are represented by input neurons and output neurons. Input neurons have no predecessor neurons but do have an output. Similarly, an output neuron has no successor neuron but does have inputs.

2. Connections and Weights. A neural network consists of connections, each connection transferring the output of a neuron to the input of another neuron. Each connection is assigned a weight.

3. Propagation Function. The propagation function computes the input of a neuron from the outputs of predecessor neurons. The propagation function is leveraged during the forward propagation stage of training.

4. Learning from the error — feedback learner — till the error reduces

5. Learning Rule. The learning rule is a function that alters the weights of the connections. This serves to produce a favoured output for a given input for the neural network. The learning rule is leveraged during the backward propagation stage of training.

Application:

ANNs, due to some of its wonderful properties have many applications:

1. Image Processing and Character recognition: Given ANNs ability to take in a lot of inputs, process them to infer hidden as well as complex, non-linear relationships, ANNs are playing a big role in image and character recognition. Character recognition like handwriting has a lot of applications in fraud detection (e.g. bank fraud) and even national security assessments.

2. Forecasting: Forecasting is required extensively in everyday business decisions (e.g. sales, the financial allocation between products, capacity utilization), in economic and monetary policy, in finance and the stock market.

Strengths

· Neural Network is good for unstructured datasets like image, audio, and text and it does not perform well on structured data sets.

Weaknesses

· It is very complex to apply.

· It is not as easy as building a model using scikit-learn/caret.

· Training time is too high.

· Requires high computational power.

· The probably best-known disadvantage of Neural Networks is their “black box” nature, meaning that you don’t know how and why your NN came up with a certain output [12].

5. Naive Bayes

R Package: naive Bayes, Function: naive_bayes() [23]

Naive Bayes (NB) is a very simple algorithm based on conditional probability and counting.

It’s called “naive” because its core assumption of conditional independence (i.e. all input features are independent of one another) rarely holds true in the real world.

Classifies given different instances (object/data) into predefined classes (groups), assuming there is no interdependency of features
(class conditional independence).

In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes’ rule.

Application:

Categorizing news, email spam detection, face recognition, sentiment analysis, medical diagnosis, digit recognition and weather prediction are just a few of the popular use cases of Naive Bayes algorithm.

Strengths:

· Easy and fast to predict the class of test data set.

· Also performs well in multi-class prediction.

· When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression and you need less training data.

· It performs well in case of categorical input variables compared to numerical variable(s). For numerical variable, the normal distribution is assumed (bell curve, which is a strong assumption).

Weaknesses:

· If the categorical variable has a category (in test data set), which was not observed in training data set, the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

6. Decision Tree

R Package: rpart, Function: rpart() [24]

Decision Tree algorithms are referred to as CART: Classification and Regression Tree. Decision Trees are broadly used supervised models for classification and regression tasks. A decision tree can be used to visually and explicitly represent decisions and decision making.

One big advantage of decision trees is that the classifier generated is highly interpretable. The tree is constructed by repeated splitting feature space into two subsets. It can handle a mix of both discrete and continuous inputs.

Decision trees implicitly perform variable screening or feature selection.

Relatively robust to outliers and scales well to large data sets and can be modified to handle missing features. It provides an estimate of the misclassification rate for a query sample.

It is invariant under all monotone transformations of individual ordered variables. The feature importance is clear, and relations can be viewed easily. This methodology is more commonly known as learning decision tree from data and above tree is called Classification tree.

Application:

Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable.

Strengths:

  • Simple to understand, interpret, visualize.
  • Decision trees implicitly perform variable screening or feature selection.
  • The internal workings are capable of being observed and thus make it possible to reproduce work.
  • Can handle both numerical and categorical data. Performs well on large datasets and extremely fast.

Weaknesses:

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.[27]
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.

7. Random Forest

R Package: randomForest, Function: randomForest [25]

Random Forest is a supervised learning algorithm. As the name indicates, it creates a forest and makes it somehow random.

The “forest”, it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method.

To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

The general idea of the bagging method is that a combination of learning models increases the overall result.

Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time.

It is also one of the most used algorithms, because of its simplicity and the fact that it can be used for both classification and regression tasks.

Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Random Forest adds additional randomness to the model while growing the trees

Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. It means random forest replaces the data/population used to construct the tree and also the explanatory variables are bootstrapped so that partition is not done on the same important variable. This results in a wide diversity that generally results in a better model.

The main limitation of Random Forest is that many trees can make the algorithm slow and ineffective for real-time predictions.

Application:

· Banking- Random Forest algorithm is used to find loyal customers, which means customers who can take out plenty of loans and pay interest to the bank properly, and fraud customers, which means customers who have bad records like failure to pay back a loan on time or have dangerous actions.

· Medicine — Random Forest algorithm can be used to both identify the correct combination of components in medicine and to identify diseases by analyzing the patient’s medical records.

Strengths:

· Reduction in over-fitting: By averaging several trees, there is a significantly lower risk of over-fitting.

· Less variance: By using multiple trees, you reduce the chance of stumbling across a classifier that doesn’t perform well because of the relationship between the train and test data.

Weaknesses:

· It is complex. It’s hard to visualize the model or understand why it predicted something.

· It’s more difficult to implement.

· It’s more computationally expensive.

8. Linear Discriminant Analysis

R Package: MASS, Function: lda() [26]

Linear Discriminant Analysis (LDA) is a very common technique used for supervised classification problems. Linear Discriminant Analysis is a supervised classification technique which takes labels into consideration. The goal of Linear Discriminant Analysis is to project the features in higher dimension space onto a lower dimensional space.

LDA is closely related to the analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements.

However, ANOVA uses categorical variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable.

Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.

In our paper, LDA is used for classifying the customers who will get the credit card.

Application:

Identification:

It identifies customers who are likely to buy a certain product in a store. Discriminant analysis helps us in selecting features that best describes a buying customer.

Decision Making:

Diagnosis of illness to identify the disease a patient has is an important application of LDA.

Strengths:

· LDA is more useful when we have more than two response classes because it also provides low-dimensional views of the data.

· The main advantage of the LDA is the existence of an explicit solution and its computational convenience which is not the case for more advanced classification techniques such as SVM or neural networks.

Weaknesses:

· Require normal distribution and not good for few categories’ variables

· It computes the addition of Multivariate distribution compute CI and it suffers multicollinearity

9. Gradient Boosting Algorithm

R Package: caret [21], Function: train (, data, method = “gbm”,) [20]

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees (Wikipedia definition)

The logic behind gradient boosting is simple, (can be understood intuitively, without using mathematical notation). The intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better.

Once we reach a stage that residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it might lead to over-fitting).

Summary:

· We first model data with simple models and analyse data for errors.

· These errors signify data points that are difficult to fit by a simple model.

· Then for later models, we particularly focus on those hard to fit data to get them right.

· In the end, we combine all the predictors by giving some weights to each predictor.

Application:

· The high-accuracy pattern recognition applications one can efficiently assess tasks like speech and motion recognition with boosted temporal models like HMM

· The ensemble-based neural simulations — is the extraction of relevant information from large amounts of data. It is a general-purpose problem, which has been efficiently solved with boosted ensemble models in the webpage ranking area.

Strengths:

· GBTs build trees one at a time, where each new tree helps to correct errors made by the previously trained tree.

· With each tree added, the model becomes even more expressive. GBDTs will usually perform better than Random Forest and they have more hyper-parameters to tune.

Weaknesses:

· Prone to over-fitting and they are harder to get right. It is also more prone to over-fitting.

· GBDT training generally takes longer because trees are built sequentially.

10.XGBoost: A Scalable Tree Boosting System

R Package: caret [21], Function: train (, data, method = “xgbTree”,) [20]

XGBoost is one of the implementations of Gradient Boosting concept, but what makes XGBoost unique is that it uses “a more regularized model formalization to control over-fitting, which gives it better performance,” according to the author of the algorithm, Tianqi Chen. Therefore, it helps to reduce over-fitting.

XGBoost can be used for supervised learning tasks such as Regression, Classification, and Ranking.

It is built on the principles of gradient boosting framework and designed to “push the extreme of the computation limits of machines to provide a scalable, portable and accurate library.”

XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.

It “automatically does parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages,”

Histogram-based algorithm splits all the data points for a feature into discrete bins and uses these bins to find the split value of histogram.

XGBoost cannot handle categorical features by itself; it only accepts numerical values.

Therefore, one has to perform various encodings like label encoding, mean encoding or one-hot encoding before supplying categorical data to XGBoost.

XGBoost is able to perform automatic feature selection and capture high-order interactions without breaking down.

XGBoost also includes an extra randomization parameter, i.e. column sub-sampling, this help to reduce the correlation of each tree even further.

Application:

· XGBoost has wide application area and applies on numerous classification techniques viz. feature selection, feature extraction, and multi-class categorization.

· The applications of boosting include medical area, text classification, page ranking and business and so on.

Strengths:

· Are easily interpretable, are relatively fast to construct.

· Can naturally deal with both continuous and categorical data, can naturally deal with missing data and are robust to outliers in the inputs, are invariant under monotone transformations of the inputs.

· Perform implicit variable selection, can capture non-linear relationships in the data

· Can capture high-order interactions between inputs, they also scale well to large datasets.

Weaknesses:

· Tends to select predictors with a higher number of distinct values.

· Can overfit when faced with predictors with many categories.

· Are unstable and have high variance, have difficulty capturing additive structure, tend to have limited predictive performance.

9. Confusion Matrices of Models

A confusion matrix is a summary of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.

Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.

Below table shows the confusion matrices for all the ten Machine learning algorithms used for comparison in our study.

Figure 5: Confusion Matrix comparison table of ten models.

10. Comparison of Model Performances

The graph below shows the time taken for each of the Machine Learning models in seconds.

Figure 6: Comparison of Tuning Time

The graph below shows the comparison of Type 1 Error

Figure 7: Comparison of Type I Error.

Type 1 error: Also known as a “false positive”: the error of rejecting a null hypothesis when it is actually true.

Figure 8: Comparison of Type II Error.

Type 2 error: Also known as a “false negative”: the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature.

Accuracy: Accuracy, meaning as ability to get correct classification, follows a simple and obvious relationship.

Accuracy=1- Error Rate

The below graph compares the accuracy of different Machine Learning Models.

Figure 9: Comparison of Accuracy.

Sensitivity: Also known as Recall: It is the capability of the model to predict the positive results. Useful as it maybe, it is still not adequate to describe the error behaviour of model.

Figure 10: Comparison of Sensitivity (Recall).

Specificity: It is concerned with an ability to detect negative results. It matters more when classifying the 0’s correctly is more important than classifying the 1’s.

In the table below, Decision Tree has low specificity.

Figure 11: Comparison of Specificity.

Balanced Accuracy: Is calculated as the average of the proportion corrects of each class individually.

Figure 12: Comparison of Balanced Accuracy.

F1 Score: The F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score.

Figure 13: Comparison of F1 Score.

11. Prediction:

Input Data:

The input data contains three scenarios.

1) All explanatory variables contain its minimum value.

2) All explanatory variables contain its maximum value.

3) All explanatory variables contain its mean value.

Figure 14: Input data for Prediction

For a given input data, we would like to predict credit card defaulters.

12. Model selection

After comparing all the models: Logistic regression, Decision tree, Random Forest, KNN, SVM Linear, SVM Radial, Gradient Boosting Method, Extreme Gradient Boosting, Neural Network, Linear Discriminant Analysis and Naïve Bayes, we get to know that accuracy is almost similar. Best model has been chosen based on minimum value for Type 2 error.

Model would help the issuer have a better understanding of their current and potential customers, which would inform their future strategy, including their planning of offering targeted credit products to their customers.

This would also inform the issuer’s decisions on who to give a credit card to and what credit limit to provide.

13. References

[1] https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

[2] Dataset from University of California, Irvine available in their online repository http://archive.ics.uci.edu/ml/index.html.

[3]https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#

[4]https://elitedatascience.com/machine-learning-algorithms

[5] Daniel T. Larose,‎ Chantal D. Larose, “Data Mining and Predictive Analytics”, Wiley, 2015.

[6]BE Boser, IM Guyon, VN Vapnik ,“A Training Algorithm for Optimal Margin Classifiers”, 1992.

[7] Corinna Cortes and Vladimir Vapnik, “Support-Vector Networks”, 1995.

[8] https://dataaspirant.com/2017/01/13/support-vector-machine-algorithm/

[9]https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

[10]DejanTanikić and Vladimir Despotovic, “Artificial Intelligence Techniques for Modelling of Temperature in the Metal Cutting Process”,2012.

[11]https://www.analyticsvidhya.com/blog/2016/03/introduction-deep-learning-fundamentals-neural-networks/

[12] https://towardsdatascience.com/hype-disadvantages-of-neural-networks-6af04904ba5b

[13] https://blog.exploratory.io/introduction-to-extreme-gradient-boosting-in-exploratory-7bbec554ac7

[14]https://en.wikipedia.org/wiki/Linear_discriminant_analysis

[15]https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd

[16] https://www.mathworks.com/discovery/machine-learning.html

[17]https://en.wikipedia.org/wiki/F1_score

[18]https://towardsdatascience.com/introduction-to-neural-networks-advantages-and-

applications-96851bd1a207

[19]https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/glm

[20] https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/train

[21]http://topepo.github.io/caret/index.html

[22]https://www.rdocumentation.org/packages/neuralnet/versions/1.33/topics/neuralnet

[23]https://www.rdocumentation.org/packages/naivebayes/versions/0.9.2/topics/naive_bayes

[24]https://www.rdocumentation.org/packages/rpart/versions/4.1-13/topics/rpart

[25]https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest

[26]https://www.rdocumentation.org/packages/MASS/versions/7.3-50/topics/lda

--

--