Pedromics Error Bar by Ana Cs

6 Individual Machine Learning Algorithms for Classification Model — Study Case: Credit Risk Analysis

Risdan Kristori

--

Girl putting her toys in groups by Yinfan Huang

Classification is one of the frequently used supervised machine learning models. Classification models predict the correct label of a given input data. Examples of classification problems are determining email spam, image recognition, churn prediction, credit risk analysis, and even finding medical problems (absence of cancer or not).

Currently, there are many classification models that can be used, such as ensemble learning and deep learning. But in this article, I will discuss individual classification models or what is usually called ‘weak learners’. These models will be used to create a classification model that determines the status of someone’s credit.

There are 6 classification models that will be discussed here, and each model has its own unique algorithm.

  1. Logistic Regression
  2. Decision Tree
  3. Support Vector Machine
  4. Linear Discriminant Analysis
  5. K-Nearest Neighbor
  6. Naive Bayes

LOGISTIC REGRESSION

Logistic regression works by fitting a line or curve to a set of data points. The line or curve represents the relationship between the independent variables (the features that you are using to predict the outcome) and the dependent variable (the outcome that you are trying to predict).

Sigmoid Function

The logistic function is used to transform the output of the line or curve into a probability. The logistic function is a sigmoid function, which means that it has a characteristic S-shape. This shape allows the logistic regression model to predict probabilities that range from 0 to 1. This video explains clearly the mathematics behind this model.

Here are some of the advantages of using logistic regression:

  • It is a simple and easy-to-understand algorithm.
  • It can be used to solve a variety of problems, including binary classification and multi-class classification.
  • It can be used to provide probabilities for the predicted outcome.
  • It is relatively efficient to train and deploy.

Here are some of the disadvantages of using logistic regression:

  • It can be sensitive to outliers.
  • It can be computationally expensive to train for large datasets.
  • It may not be able to capture complex relationships between the independent variables and the dependent variable.

Overall, logistic regression is a powerful machine-learning algorithm that can be used to solve a variety of problems. It is a good choice for beginners because it is relatively simple to understand and implement.

DECISION TREE

A decision tree classifier works by creating a tree-like structure, where each node in the tree represents a decision rule. The branches of the tree represent the possible outcomes of each decision rule, and the leaf nodes of the tree represent the final classifications.

Decision Tree in Determining the Risk of Heart Attack

Here are some of the advantages of using decision tree classifiers:

  • They are easy to understand and interpret.
  • They are relatively efficient to train and deploy.
  • They can be used to solve a variety of classification problems.

Here are some of the disadvantages of using decision tree classifiers:

  • They can be sensitive to overfitting.
  • They can be unstable, meaning that small changes in the data can lead to large changes in the model.

SUPPORT VECTOR MACHINE

Support vector machine (SVM) works by finding the hyperplane that best separates the two classes of data. The hyperplane is a line or a plane that divides the data into two regions, with each region containing all the data points of one class.

The SVM classifier works by finding the hyperplane that has the maximum margin between the two classes. The margin is the distance between the hyperplane and the closest data points of each class. The larger the margin, the better the SVM classifier will perform.

Support Vector Machine Illustration

Here are some of the advantages of using SVM classifiers:

  • They are very accurate, especially for small datasets.
  • They are relatively robust to noise and outliers.
  • They can be used to solve both classification and regression problems.

Here are some of the disadvantages of using SVM classifiers:

  • They can be computationally expensive to train, especially for large datasets.
  • They can be sensitive to the choice of hyperparameters.
  • They can be difficult to interpret.

LINEAR DISCRIMINANT ANALYSIS

Linear discriminant analysis (LDA) works by finding a linear combination of features that maximizes the separation between the classes. LDA is a parametric method, which means that it makes assumptions about the distribution of the data. In the case of LDA, the assumption is that the data from each class follows a multivariate normal distribution.

Linear Discriminant Analysis Illustration

Here are some of the advantages of using LDA classifiers:

  • They are relatively simple to understand and implement.
  • They are relatively efficient to train and deploy.
  • They can be used to solve a variety of classification problems.

Here are some of the disadvantages of using LDA classifiers:

  • They can be sensitive to outliers.
  • They may not be able to capture complex relationships between the features and the classes.
  • They are not as accurate as some other machine learning algorithms, such as SVM classifiers.

K-NEAREST NEIGHBOR

K-nearest neighbors (KNN) work by finding the k most similar data points to a new data point and then predicting the label of the new data point based on the labels of the k nearest neighbors.

KNN Illustrasion by Datacamp

The k value is a hyperparameter that needs to be chosen by the user. The value of k can affect the accuracy of the KNN classifier. A larger value of k will make the classifier more robust to noise, but it may also make the classifier less accurate. A smaller value of k will make the classifier more accurate, but it may also be more sensitive to noise.

Here are some of the advantages of using KNN classifiers:

  • They are relatively simple to understand and implement.
  • They can be used to solve a variety of classification and regression problems.
  • They are not sensitive to overfitting.

Here are some of the disadvantages of using KNN classifiers:

  • They can be computationally expensive, especially for large datasets.
  • They can be sensitive to noise.
  • They may not be as accurate as some other machine learning algorithms, such as SVM classifiers.

NAIVE BAYES

Naive Bayes is a simple but powerful machine-learning algorithm that can be used for classification tasks. It is based on Bayes’ theorem, which is a mathematical formula that describes the probability of an event occurring given the probability of other events occurring.

Naive Bayes assumes that the features of a data point are independent of each other. This means that the presence of a particular feature does not affect the probability of another feature being present. This assumption is often not true in real-world data, but it makes Naive Bayes very fast and easy to train.

Naive Bayes Theorem

Here are some of the advantages of using Naive Bayes classifiers:

  • They are relatively simple to understand and implement.
  • They are very fast to train and predict.
  • They can be used to solve a variety of classification problems.

Here are some of the disadvantages of using Naive Bayes classifiers:

  • The assumption of independence between features may not be true in real-world data.
  • They may not be as accurate as some other machine learning algorithms, such as SVM classifiers.

Comparing Individual Classification Models Performance: Credit Risk Analysis

Next, we will try to test all these models to determine a person’s credit status. In this comparison, we will not try to do some handling imbalance or hyperparameter tuning. All models will be using standard parameters. Of course, you can do all of it to improve your model’s performance, let's just save it for another topic.

Load The Data

Data was taken from the following Kaggle link.

# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Load the data
df = pd.read_csv('credit_risk.csv')
df.head()
df.info()

The data consists of 32581 entries and 12 columns. There are missing values in Empt_length and Rate columns. For column information, you can check on the Kaggle link.

Data Cleaning

Let's do a little bit of data analysis.

df.describe(include='number').T
listItem = []
for col in df.columns :
listItem.append([col, df[col].dtype, df[col].isna().sum(), round((df[col].isna().sum()/len(df[col])) * 100,2),
df[col].nunique(), list(df[col].drop_duplicates().sample(2).values)]);

dfDesc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique', 'uniqueSample'],
data=listItem)
dfDesc

Do you see something unusual? There are three things here:

  1. The maximum age is 144, which is impossible
  2. The maximum Emp_length max is 123, which is impossible too
  3. There are missing values for Emp_length for 895 entries and Rate for 3116 entries.

Let's assume there is some error in the input data, so we will cut the error before training data to the model. Let's check the data where the age of the creditor is more than 60 years old.

df[df.Age >60]

Because only 70 entries of data, we will focus the model on the creditors with an age maximum of 60 years old (only 0.08% data loss).

df = df[df.Age <= 60]

Let's check data with high Emp_length.

There are two data where the Emp_length is 123 years, we will drop it.

df = df[df.Emp_length < 40]

Preprocessing

# Importing liberaries for preprocessing
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Drop Id column
df = df.drop(['Id'], axis=1)

# Define columns for preprocessing
impute_cols = ['Emp_length', 'Rate']
encode_cols = df.select_dtypes(exclude='number').columns.to_list()
Scale_cols = ['Age','Income','Amount','Percent_income','Cred_length']

# Impute missing values and scaling
imputer_mean = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaling', RobustScaler())
])

Preprocessing = ColumnTransformer(
transformers=[
('imputer', imputer_mean, impute_cols),
('encoder', OneHotEncoder(handle_unknown='ignore'), encode_cols),
('scaling', RobustScaler(), Scale_cols)
]
)

The missing values are imputed using KNN imputer from Scikit-learn, and all the numerical columns are scaled using Robust scaler to handle outliers.

Modelling

# Classificatin models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Model processing
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from tqdm import tqdm

# Split x and y for the model
y = df['Status']
x = df.drop(['Status'], axis=1)

# Split data to Train and Test data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=24, stratify=y)

# The models
models = [
LogisticRegression(), # Logistic Regression
DecisionTreeClassifier(), # Decision Tree
SVC(), # Support Vector Machine
LinearDiscriminantAnalysis(), # Linear Discriminant Analysis
KNeighborsClassifier(), # K-Nearest Neighbor
GaussianNB() # Naive Bayes
]

# List for score matrix
accuracy_score = []
accuracy_mean = []
accuracy_std = []

# Calculate model's matrix score
for model in tqdm(models):

# Cross validate split
crossval = KFold(n_splits=5, shuffle=True, random_state=24)

# Create pipeline for presprocessing and model
estimator = Pipeline([
('preprocessing', Preprocessing),
('model', model)
])

# Calculate the accuracy from cross validation
accuracy = cross_val_score(
estimator,
X_train,
y_train,
cv=crossval,
scoring='accuracy',
error_score='raise'
)

#accuracy_score.append(accuracy)
accuracy_mean.append(accuracy.mean())
accuracy_std.append(accuracy.std())

# Model Matrix Evaluation
model_matrix = pd.DataFrame({
'Model': ['LogisticRegression', 'DecisionTree', 'SVC', 'LinearDiscriminantAnalysis', 'KNeighborsClassifier','GaussianNB'],
'Accuracy Mean': accuracy_mean,
'Accuracy Std': accuracy_std
})
model_matrix.sort_values(by='Accuracy Mean', ascending=False)

From the validation score accuracy, the Support Vector Machine model has the highest score followed by the Decision Tree and and K Nearest Neighbor. But this doesn't mean that other models are not good, as I said before, this model only uses standard parameters, where hyperparameter tuning can still be done to improve the performance. The point is for a classification problem, we can compare using these models in order to obtain the model that best suits the problem, or we can use these models to create a more complicated model like an ensemble model. Thanks.

--

--

Risdan Kristori

Writing about what I learned and what I did, especially about `Coding`, `Data ` and `Machine Learning`.