Capstone Project: Give Me Some Credit
Banks play an essential role in today’s economy as they can decide who to give financial loans. These decisions may make or break a business.
Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. My capstone project aims to predict if an individual will experience financial distress in the next two years.
Dataset
The data set used for this capstone project is taken from Kaggle and is made up of 150,000 rows and 11 columns. The data set provided is a combination of customer and behavioural data. There are missing values in the data set, thus requiring some data cleaning. I have filled up the missing data with median values of respective columns.

There is an imbalance class in the data set as customers who faced financial distress makes up about 7% of the data set. The imbalance class problem will need to be addressed to have a more accurate prediction in my modelling phase of this capstone project. This problem was solved by upsampling and downsampling my training data set.
Models applied
A few models were applied to predict the probability if an individual will face financial distress in the next 2 years. These models ranged from quick & simple to slow & complex models. I have also used GridSearchCV to further fine tune the hyper-parameters for each of the models applied for a more accurate prediction. Below are the models applied for this capstone project.

Findings
The 2 metrics used to access our models will be the f-1 and AUC score. F-1 score is the harmonic average of precision and recall scores, where a f-1 score best value is at 1 (perfect recall and precision) and worst at 0.

Our second metric is AUC (Area Under Curve) score. The AUC score is a metric used for binary classification where it considers the possible threshold between True Positive and False Positive Rate. Similar to the f-1 score, a model that scores 1 for AUC score is considered as a perfect classifier.

Generally, the models applied were able to perform well with either upsampling or downsampling of my train data set, by looking at the average f1 and AUC score of each models.
Conclusion
In conclusion, taking into consideration of computational speed and results achieved, I would recommend the Random Forest Classifier model to be selected as my final model for this capstone project as the model was able to achieve a f-1 and AUC score of 0.84 and 0.85 respectively.
To view my codes at github, please click here.