A Data Scientist uses Artificial Intelligence to Determine if Someone is a Good Credit Risk
In my most recent post I combined two datasets to determine if a debtor is a good customer. There were no guidelines as to how to determine if someone is a good customer, so I was expected to make my own assumptions as to how this might be so. The link to previous post on this subject is:- A Data Scientist uses Artificial Intelligence to Determine the Type of Debtor a Person is | by Tracyrenee | CodeX | Apr, 2021 | Medium
In this post I have expanded upon the criteria to check to see if a customer is a credit risk. Under normal circumstances a risk manager would tell the data scientist the criteria that determines whether a person is a good credit risk, but I did not have any such guidance and had to make my own assumptions as to what characteristics a good credit risk would have. I therefore made the criteria I would use simple, expanding upon my earlier post by checking if the client is a property owner and has a steady stream of income coming in to pay off any debt. When I added the two additional parameters, the good credit rating decreased from roughly 75% to 50%. This indicates to me that the more determinants are added to the model, the lower percentage of the good credit ratings are likely to be. The link to the dataset for this post can be found at:- Credit Card Approval Prediction | Kaggle
Problem statement that goes along with this dataset, which I have amended, reads below:-
“Build a machine learning model to predict if an applicant is ‘good’ or ‘bad’ credit risk, different from other tasks, the definition of ‘good’ or ‘bad’ is not given. You should use some technique, such as vintage analysis to construct a label. Also, unbalance data problem is a big problem in this task.”
The program for this post has been written in a Jupyter Notebook in the Kaggle website. Because I had previously written a program using the dataset, I merely made a copy of the original program and edited it to contain the new criteria. Therefore, there are now two Jupyter Notebooks concerning the credit risk dataset that I have made public so anyone can view them.
Once the program was created, I imported the libraries I would need to use to execute the program, being pandas, numpy, matplotlib and os:-
I then read the two csv files into the program, which will be needed during the execution of the program:-
I merged the two datasets together to form one dataframe:-
I then checked for any null values in the new dataframe:-
There were missing values in three columns, so I imputed them. The null values in OCCUPATION_TYPE were filled with “Unknown”. Null values in MONTHS_BALANCE and STATUS are filled with 0:-
I created a dictionary for the status column and assigned -1 to C and -2 to X. I then mapped this dictionary to the STATUS column of the dataframe:-
I created a new column, INCOME, which was filled with 0’s.
I then used a for loop to increment each step the length of the dataframe. If the column NAME_INCOME equals “Pensioner” or “Student” then the value equals 0, otherwise it equals 1. The value is then added to an empty list, income, that had been created for this purpose. When all of the examples had been incremented through the for loop, the array, income is posited into the column, INCOME:-
I then created a column, PROPERTY, and used the same methodology as the previously created column. The only difference is that if the customer owned a car or owned property then the value would equal 1, otherwise it would equal 0:-
And lastly, I created the column, LABEL, which will hold either a 0 or a 1, which will identify whether a client is a good credit risk or not. This column has been designed using the same methodology as the two columns that I previously created. The only difference is that if the STATUS is less than 1 and the INCOME equals 1 and PROPERTY equals 1 then the value will equal 1, otherwise the value will equal 0. When all of the rows in the dataframe had been looped through, the variable, label, will be posited in the column, LABEL:-
I then performed an analysis of the labels using the additional conditions, and found that the customers with a good credit risk had reduced from 74% in the previous post to 49% in this post:-
As can be seen from the example above, the more conditions that are placed on a customer’s ability to repay a debt, the worse his credit risk is going to be.
The code for this program can be found in its entirety in my personal Kaggle account, the link being here:- Quick credit risk — Vintage Analysis | Kaggle