Predicting Credit Card Approvals using Machine Learning

Aman Sangal
8 min readJan 11, 2022

--

Build a machine learning model to predict if a credit card application will get approved.

Authors: Lopa Nayak, Aman Sangal

source: google

Project Description

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low-income levels, or too many inquiries on an individual’s credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Fortunately, this task can be automated with the power of machine learning, and pretty much every commercial bank does so nowadays. In this project, we will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

The dataset used in this project is the Credit Card Approval dataset from the UCI Machine Learning Repository.

Project Tasks

Import Python Libraries

Let’s import all of our primary packages into our python environment. Throughout this article, we will provide code snippets from our jupyter notebook. (For an easy to follow explanation on integrating jupyter notebook in your medium article check the reference here)

Loading Dataset

We start the project by loading the dataset in our Jupyter notebook. We load the dataset into a pandas dataframe named df. To keep the data confidential, the contributor of the dataset has anonymized the feature names. We assign letters A-P as feature names. To quickly look at the dataset, we print the first five rows using df.head().

Knowing the Data

To understand our data better, we use handy pandas features df.info() and df.describe(). Let’s first print the information of the dataset by using df.info().

From the output we get the following information about the data: Data has a total of 690 entries i.e. approval or rejection data of 690 credit card applications with a total of 16 columns including 15 feature variables and one output variable. From the output’s Dtype column, we see several features with Dtype as object (string or mixed). Machine learning (ML) algorithms require all feature variables to be of numeric data type. We discuss this issue in detail later in the analysis section. Let’s explore more details of our data using df.describe(). The output is shown below:

df.describe() prints the summary statistics of columns with numeric dtype. Out of 4 features, 2 have a range of 0–28, 1 has a range of 2–67, and 1 has a range of 1017–100000. To ensure the convergence of cost function minimization in machine learning models, all feature variables are scaled using different feature scaling techniques. This is discussed in detail in the analysis section.

Handling the missing values:

Whether we like it or not, real-world data is messy. Data cleaning is a major part of every data science project and every dataset can possibly have missing values for multiple columns corresponding to a data entry. Before starting to analyze the data and draw conclusions, it is necessary to understand the presence of missing values in our dataset. Missing values in a dataset can be denoted by different conventions (?, NaN …). For the column with dtype as int/float, missing values are denoted by NaN. For columns with categorical datatypes, we print the unique values. We can see that the dataset has some missing values and they are labeled with ‘?’.

  • To be consistent across the dataset we replace all these missing values denoted by ? with NaN using df.replace
df = df.replace('?',np.nan)

We have used isnull().sum() to summarize the total number of missing values per column.

The table on left shows the column name and the corresponding number of missing values. The bar plot on the right visualizes the columns with non-zero missing values.

A very small fraction (0.61%) of values in our dataset is missing. There are several possible strategies to deal with the missing values. For discussion on missing values refer to articles 1 , 2 . In our case, we will use the imputation strategy to fix the missing values. For missing values in the categorical columns, we have used the pandas ffill method to replace missing values with the value from the previous row. For missing values in the numerical columns, we replaced them with the mean of non-missing values in that column. Now the dataset contains no missing values.

Analysis

The task of predicting whether a credit card application will be approved or rejected based on values of feature variables is a supervised machine learning classification task. We need to separate the dataset into features and target variables. Following the popular convention, we call the dataframe with feature variables as X and the one with target variable as y. To implement machine learning algorithms we use the popular python library scikit-learn.

Preprocessing the data

ML algorithms require all input variables to be of numeric type i.e. if the data contains categorical values, then we need to convert them to numerical before applying a machine learning algorithm. In our project, we will be using the ordinal encoding method to transform the categorical feature variables into numeric.

Let’s visualize the target variable and have a look at how many approved and declined applications are there in our dataset.

Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. This tells us that our dataset has an equal representation of both the outcomes of our binary classifier.

Using sklearn’s train_test_split, we split the feature (X) and target (y) dataframes into a training set (70%) and testing set (30%). Training set is used for building classification model and testing set is used for evaluating the performance of the model.

xTrain, xTest, yTrain, yTest = train_test_split(X, y,
test_size=0.30,
random_state=2)

Machine Learning Classifiers

Now we have our dataset ready for building a machine learning-based classifier. There are several classification models that can be used for this task. In this analysis, we will build five different types of classification models namely Logistic Regression, Decision Tree, Gradient Boost, XGBoost and, K-Nearest Neighbors (KNN). These are the most popular models used for solving classification problems. All these models can be conveniently implemented using python’s scikit-learn package except for the XGBoost model, which is implemented using the XGBoost package.

a) Logistic Regression:

Before implementing logistic regression, we scale the feature variables of our dataset using sklearn’s MinMaxScaler method. We train the Logistic Regression model with standard parameters using the training dataset. The trained model is saved as logreg.

We evaluate the performance of our model using test dataset. We use the metric classification accuracy defined as the fraction of times model prediction matches the value of the target variable. For a detailed evaluation of our model, we look at the confusion matrix. The values in the diagonal of the confusion matrix denote the fraction of correct rejection (first-row first entry) or correct approval (second-row second entry) predictions by our classification model. Our logistic regression model has a classification accuracy of 87.9 %.

b) Decision Tree

The second model we try for our classification task is the Decision tree model. We have used sklearn’s DecisionTreeClassifier algorithm to build the model. We find the optimized value of hyperparameter max_depth by varying it between 1 and 10 in steps of 1. max_depth value decides the number of times a decision tree is allowed to split. In the plot of Accuracy vs Depth for train and test data, we see for max_depth =3 both train and test accuracy are the same. We choose this value for our model as it avoids a model that is either overfitted or underfittted. The final test accuracy score of our decision tree model is 85.5 %.

c) Gradient Boost

The third model we try for our classification task is the Gradient Boost. For the gradient boost, we use sklearn’s default hyperparameter value. It gives us an accuracy of 98.1% and 87% on train and test datasets respectively.

d) XGBoost

The fourth model we try for our classification task is the XGBoost. We built this model using the XGBClassifier algorithm provided by the XGBoost package. Using the XGBoost model with default values for hyperparameters, we obtained an accuracy of 87% on the test dataset.

e) KNN

The fifth model we try for our classification task is the K-Nearest Neighbors (KNN). We have built the model using sklearn’s KNeighborsClassifier algorithm. We have optimized the hyperparameter n_neighbors by iterating through a range of values from n=2 to n=20 and comparing the accuracy scores. We select the value n_neighbors=10 as it avoids both overfitting and underfitting. With 10 neighbors, the accuracy score on the test sample is 72%.

Conclusion:

We have tried five different classification models for our credit card approval prediction task. The train and test accuracy of the models is summarized in the Figure below. We have obtained the best test data accuracy (88%) from the Logistic Regression classifier. The small difference in train and test accuracy scores indicates the absence of overfitting and underfitting.

Summary of train and test data accuracy from different machine learning classifier

Summary

  • We built a machine learning-based classifier that predicts if a credit card application will get approved or not, based on the information provided in the application.
  • While building this credit card approval predictor, we learned about common preprocessing steps such as feature scaling, label encoding, and handling missing values.
  • We implemented five different machine learning models, optimized the hyperparameters, and evaluated the performance using the accuracy score and comparing the performance between train and test data.
  • We have used python’s machine learning libraries to implement machine learning algorithms.
  • The full python script can be found here in Github.

This brings us to the end of this article. We have tried to keep the article very simple and easy to understand. Appropriate references are provided throughout the article. Thank you for reading the article. Any comment or suggestion will be highly appreciated.

Authors

[1] Lopamudra Nayak

[2] Aman Sangal

--

--