end-to-end ML project implementation

Loan Defaulting Tendency Prediction — End-to-End ML implementation

A case study on the Home Credit Default Risk dataset — part 3 of 3

Narasimha Shenoy
14 min readMar 23, 2022
(by author)

Preface and setting the expectations
This end-to-end project is the first plunge taken by me, a mechanical engineering grad working in the manufacturing and energy industry for around 10 years; in the ocean of ML.

Owing to the lengthy nature, I have spread the article over three parts —

1. Introduction , dataset familiarization & Performance Metric selection (click here to read)

2. EDA, Feature Engineering and Machine Learning Modelling (click here to read)

3. ML Model deployment (this article)

This is the concluding part of the series in which we will perform the advanced feature engineering, try out a few complex or SOTA models and deploy the best ML model.

I hear and I forget. I see and I remember. I do and I understand — Confucius

In keeping with the profound quote above, the best way to go about this series is reading the articles along with my Google Colab notebooks open so that you may get a ‘hands-on’ experience.
I have created a GitHub repository of all my Colab notebooks in a phase-wise manner which can be found here.
Owing to this, I am not including code snippets in the articles. Rather, the comments in the Colab notebooks shall help you correlate the code, its output and the conclusion derived in this article.

With the intent cleared and assuming that you have gone through the first part of this series, let’s finally get to the meat of the matter.

This part contains the following —

# ADVANCED FEATURE ENGINEERING & MODELING

  • Prerequisites
  • Creation of advanced features
  • Testing the utility of the advanced features
  • Advanced modeling using the advanced features
  • Concluding Summary

# MODEL DEPLOYMENT

  • Gist of the deployment process
  • The detailed process of deployment
  • Highlights of the deployed app
  • Important information, resources & links

ADVANCED FEATURE ENGINEERING & MODELING

Prerequisites
I assume that you have gone through my previous posts. This phase uses the processed best dataset and the best model and I attempt to improve upon the model performance by augmenting the dataset with the ‘advanced features’.
Please refer the notebook titled Phase-4_Phase-4_Advanced Modeling & Feature engineering.ipynb from the GitHub repo linked earlier for the python libraries used, code snippets and the complete documentation. This is a direct link to my Colab notebook.

Creation of advanced features
As the name suggests, advanced features are usually those features that are relatively abstract in nature. Internet abounds in literature pertaining to such features, many of which are steeped into advanced mathematical concepts. To keep our model simple, I have considered the following features —

Principal Component Analysis [PCA] as a feature
PCA is performed with 5 components i.e., PCA outcome has 5 features. A data frame is formed with these 5 features and the same is visualized.

Visualization of the 5-component PCA features based on class (by author)

From the pair plots (which are scatter plots considering 2 features at a time), it is observed that the spread of wrongly predicted data is lower compared to spread for correctly predicted data. Though the features are plotted, there is no visually distinct separability between classes based on errors.

Visualization of the 5-component PCA features based on prediction correctness (by author)

Plotting with context to class based errors in predictions also shows no new insights towards error reduction, visually.
However, on account of the abstract nature of the features, regardless of visualization, i decided to use the PCA features to check for possible performance gain.
For simplification, I chose 2 component PCA. From the scatter plot it is observed that the spread of wrongly predicted data is lower compared to spread for correctly predicted data, similar insight as derived from 5 component PCA.

Visualization of the 2-component PCA features based on prediction correctness (by author)

Subsequently, PCA with 2 components is appended to our training dataset.

Errors in predictions as a feature
This is a tricky feature to work with. Basically, I used the error information from my model predictions on train dataset. If it works out to be useful, we’ll figure out a way to have such feature for the ‘unseen’ data as well (which is quite a task) or else, we’ll drop it (easier😅).

Implementation of this logic is as below —

  • LightGBM with tuned parameters (the best model) is trained on dataset with selected features (the best dataset).
  • A new column is added which indicates the prediction and confidence. For example, Correct_High indicates that the data point is correctly predicted with high confidence. Similarly data points are labelled as Correct_Low, Wrong_High and Wrong_Low.
  • This new column is added based on the Score column. The score column indicates the probability of predicted Label. A cut off point is selected
    as defining low or high confidence. This cut off value is 0.75. If the score is less than or equal to 0.75, the confidence is low otherwise high. Thus a correctly predicted point with Score more than 0.75 is labelled Correct_High.
  • Similarly other points are classified. This new test data frame is named predict_test. Prediction is also made on train data and confidence columns are added. This new data frame is named train_predict. This dataset nomenclature shall be understood by referring to the Colab notebook.
The ‘error’ based features augmenting our training dataset (by author)

LDA as a new feature
Linear Discriminant Analysis seeks to best separate (or discriminate) the samples in the training dataset by their class value. Specifically, the model seeks to find a linear combination of input variables that achieves the maximum separation for samples between classes and the minimum separation of samples within each class.
This coupled with our augmented dataset has potential to improve performance.
Thus, LDA is performed on the resultant dataset and a new
feature column named LDA is added to both train and test data.

Testing the utility of the advanced features
LightGBM classifier (as its our best model) is trained on the augmented train dataset.
Predictions are made on the test data and corresponding accuracy, AUC and confusion matrix are obtained.

Results of the model prediction using the advanced features (by author)
Performance of the lGBM classifier on augmented dataset (by author)

Following are the observations when compared to best model —
Over-fitting is noticed with substantial difference in accuracy & AuC values between predictions on train data and test data.
Although AuC has increased compared to best model, accuracy has decreased.
Based on the above observations, it is concluded that the new lGBM model is not better than best model.

Advanced modeling using the advanced features
Just to be sure and also try out alternate approach to try using our advanced features, we’ll try out a few ‘advanced’ models & see how they match up to our best model with the best dataset.

Stacking based model
Stacked model is trained using PyCaret, consisting of — Ada Boost, DT and lightGBM as the estimators.

The stacking configuration (by author)

Predictions are made on test data and corresponding accuracy, AuC and confusion matrix are obtained.

Performance of the stacking classifier on augmented dataset (by author)

Following are the observations when compared to our best model —
Overfitting is observed with substantial difference in accuracy & AuC values between predictions on train data and test data.
Accuracy and AuC are lower in comparison to best_model.
Based on the above observations, it is concluded that stacking-based model is not better than best model.

Neural Network based model
Neural network model is created using Keras.

Representation of a Neural network (source)
The NN configuration I chose for modeling on augmented dataset (by author)

Predictions are made on test data and corresponding accuracy, AuC and confusion matrix are obtained.

NN-based model performance on augmented dataset (by author)

Following are the observations when compared to our best model —
Overfitting is observed although not major.
Accuracy and AuC have decreased compared to best_model.
Based on the above observations, it is concluded that NN-based model is not better than the best model.

For the sake of sanity check and to check efficacy of NN on the data at hand, Neural Network is trained on original data (dataset with selected features).

The NN configuration I chose for modeling on best dataset (by author)
NN-based model performance on best dataset (by author)

Following are the observations when compared to our best model:
Overfitting is not observed.
Accuracy and AuC have decreased only slightly compared to best model.
The number of people who should have been rejected for loan but are predicted as eligible for loan is substantial, which is not good.
Based on the above observations, it is concluded that the lGBM model is the best case scenario.

Concluding Summary

  • Upon evaluating the myriad of models with varying datasets and feature permutations, it is evidenced that tuned lightGBM is the best performer.
  • Even deep learning models fared relatively poorly in comparison to the tuned lightGBM model.
  • This phase comprehensively concludes the model training step. Further, the finalized best model shall be deployed in the next phase.

MODEL DEPLOYMENT

Gist of the deployment process
The LightGBM model is the best model and PyCaret library was used for the whole process from data ingestion to the prediction stage on Colab notebook.
For the deployment phase, however, the entire data pipeline and model were recreated using sklearn. Reason for the same is specified later in this article.

Fastest and easiest deployment options out there (by author)

I have tried out quite a few deployment options and finally zeroed in on Streamlit (owing to its simplicity, intuitive support for HTML tags in front end)+Heroku (its easy to use, quick on deployment)

The article that follows is a summary of the steps I used in the deployment of the best model, what worked for me and especially what did not. I did not write it as a step-by-step guide towards deployment as most of the options have a very detailed tutorial and as usual, internet abounds with detailed walkthroughs (I’ll list a few of the shortest & ‘to-the-point’ ones).
What I have recorded are the issues I faced which may be the result of a specific combination I chose, for which debugging was not straightforward.
Also, after trying out the most popular deployment ecosystems out there, I do have my preferred option and reason for the same is also included so that you can directly go for the same if our thoughts resonate.
So, lets begin.

PyCaret pipeline & its quirks
PyCaret is an amazing package. I developed most of my modeling on this framework. Continuing the same, the best model & pipeline developed in PyCaret framework was used for deployment using FastAPI on Heroku.

The PyCaret model is substantially large (approximately 180mb). Though I could load the pickled model in Colab & get the predictions on defaulting tendency for the entire test dataset, deployment on Heroku threw an error.
The error also persisted upon using Streamlit.

A possible cause might be the PyCaret pipeline or the usage of GitLFS while pushing the large files to Git repo.
To debug considering GitLFS as a compatibility issue, to reduce the file sizes, the model was trained using 50% & 25% of the original training data.
However, there was no improvement.

Thus, presuming that PyCaret has compatibility issues in deployment, considering the time constraint, I decided to recreate the entire model and data pipeline in sklearn.

Recreating the model & data pipeline in sklearn
As I could not resolve the errors encountered, I recreated the whole data processing and modeling pipeline in SKlearn. The Colab notebook has the entire process well-documented for your reference & understanding.
The pickled model and pipelines were relatively very light-sized which
eliminated the need to use LFS to push these files to the GitHub repo
.
A thing worth noting is the amount of coding involved in using sklearn compared to PyCaret.

Importantly, the deployment on Heroku was successful, emphasizing that PyCaret or the LFS had compatibility issues.

The detailed process of deployment
Before listing the steps involved in this iterative process, it is important to list the various deployment strategies I tried out as this has bearing on the whole cycle.

The following deployment options were tried out by me —

  • FastAPI + Heroku
  • FastAPI + AWS
  • Streamlit + Heroku
  • Streamlit + Azure
  • Streamlit + AWS

Initially, the model in the PyCaret framework was deployed using FastAPI on Heroku.
For this, all the code files and data sets along with the auxiliary files need to be pushed to the GitHub repository.
Owing to the large file size of the datasets as well as the PyCaret’s model and pipeline size, pushing to Git through CLI or desktop was not possible and Git LFS [Large File Storage] was used to push these large files to the repo. Size restrictions for GitHub file upload & CLI/Desktop are 25mb & 100mb respectively whereas the files were around 300mb.
Usage of the LFS and PyCaret apparently has compatibility issues with either Streamlit or Heroku itself as the deployed app crashed with H10 error.

The dreaded H10 error while trying to deploy on Heroku (by author)

Searching on the internet for the error in our environment context highlighted compatibility errors between LFS and possibly PyCaret.
Owing to the time constraint, a detailed debugging was skipped in favor of trying out options to reduce dependency on LFS & PyCaret.
Thus, in order to create a light-weight model as well as the total pipeline, sklearn was used to build the whole framework from scratch.
The resulting files [pickled files] were very lightweight which eliminated the need for LFS.
By implementing the data flow pipeline in sklearn, PyCaret library was dropped too.
Having fixed the data pipeline and the model, the first deployment combination was FastAPI + Heroku.

FastAPI uses uvicorn owing to its ASGI implementation making it pretty fast. This was very easy to set up. The UI is by OpenAPI (previously Swagger) which is pretty bare-bones. Using Jinja, making a more customized UI may have been possible but time constraint compelled me to try out other options.
FastAPI is very easy to work with and for my case, the documentation needed is fairly elaborate. I deployed a few toy example models and they were extremely fast and behaving as intended. However, the UI (OpenAPI/Swagger) was too plain for me and I could not use Jinja effectively to customize the UI which made me try out Streamlit.

While using Streamlit, I tried out Azure, originally with the PyCaret pipeline. Setting up Azure though fairly easy, takes a lot of time for setting up the box as well as after connecting to Git repo.
The PyCaret model on Azure was not successfully deployed throwing up memory exceed error. The overall time to rebuild made me check the AWS platform.
FastAPI + AWS / Streamlit + AWS was tried out initially with the PyCaret model and deployment failed due to the RAM consumption. The sklearn model was deployed using Streamlit on AWS and it was successfully deployed.
Biggest advantage of AWS is the fact that large files can be easily pushed to the remote box using FTP programs like WinSCP.

AWS has elaborate documentation and online resources which make setting up ec2 instances extremely easy for people with coding experience.
Deployment on AWS though well documented, especially on external platforms, is relatively involved. I needed two additional software — Putty/PuttyGen & WinSCP for SSHing into and doing file transfers with the box respectively.
Code’s RAM usage needs to be quite optimized as a deployment that worked on Heroku failed on AWS due to RAM consumption. Thus, I kept the AWS implementation on back-burners and considered Heroku as primary.

A summary of the pros & cons of each option I used while deploying my model

Table is compiled based on my experience & bias while deploying the PyCaret &/or sklearn (by author)

Finalized deployment platform
After trying out the mentioned combinations of platforms and services, I opted for Streamlit+Heroku as the primary method of deployment for the following reasons —

  • Streamlit allowed me to customize the app UI much better than FastAPI to the extent I could.
  • Heroku required the least amount of time for iterative deployment and the entire repository being on GitHub, I could modify, build & redeploy from anywhere.

Basic architecture of the app system
The user interacts with the app via any browser on their local PC running the Streamlit client by uploading the query csv file containing the applicant data.

Simplified view of the model system architecture (by author)

The model hosted on the remote Heroku box computes the predictions and sends back the results which are displayed on the user’s browser as well as can be downloaded.
Upon linking the GitHub repo with the Heroku site for the first time, the files are pulled into the Heroku box.

Highlights of the deployed app

App engagement

  • The app accepts the Home credit applicant details in a CSV file as is in the test dataset.
  • A downloadable template is provided for the user to enter data into.
  • Individual form fields are not provided owing to the large number of fields which will result in an unpleasant UX.
  • The output of the model predictions is displayed on the screen as an interact-able dataframe as well as a downloadable CSV file appended to the original query set.

Importantly, following error handling methods are implemented —
The uploaded csv is checked for correctness w.r.t. the actual feature names required in template and in case of mismatch, displays a message stating the same.
When a new, unseen categorical variable is encountered in the query data, handling is done by ignoring it which is implemented by setting ‘handle_unknown’ parameter to ‘ignore’. This ignores the unseen category values and proceeds ahead.

Scalability, Throughput, Latency and real-world case

  • The app was fed the Home Credit raw test dataset consisting of around 50k applicant records with a file size of approximately 26mb.
  • The app, after upload [depending on internet connectivity took around 5 -30 seconds] does the entire data processing and predicts the defaulting tendency for the applicants in less than 40 seconds.
  • Considering the real-world scenario, the latency is not a strict requirement and is acceptable.
  • With context to throughput, as the app can be run frequently per day or even per application, the throughput volumes are not a limiting case.

Visible limitations and scope for improvement/innovation

  • Going through the logic of current execution, the query data is converted to a Pandas dataframe and compared with existing Bureau & Previous Home Credit data (also a Pandas dataframe). This implementation might be checked with SQL db for optimized and scalable performance which ‘might’ improve the system latency or reduce memory requirement.
  • Going ahead, the debugging of the issue with PyCaret may also be done so that the low code tool can also be used.

The App Interface (finally!)

I have attached screen-grabs of the deployed app on my desktop.

App home screen interface (by author)
App home screen with prediction for a set (by author)
App interface with prediction & sidebar for introduction (by author)

Important information, resources & links

List of Python libraries used for this project (by author)

The GitHub repository for all the Colab notebooks along with a clear README for navigating through the same is here.

The GitHub repository for the deployed app with complete documentation & clear README for using the app is here.

The dataset used for training the model is the Kaggle’s Home Credit Default Risk dataset from here.

Finally, the app is deployed through Streamlit framework and hosted on Heroku here.

With this, we have reached the end of this article & the series on end-to-end implementation.

It was an exhilarating experience for me to document my journey and will be glad to know if this series made your own end-to-end implementation journey a bit easier 😁.

Thanks for sticking till the end and I hope to see you around.😀

Bouquets💐 & brickbats🧱 may be directed to me here.

--

--

Narasimha Shenoy

🎮Gamer.🛠️Engineer. | penning my thoughts online🔖 | navigating my way in the ocean called life🚣🏻‍♂️ | tag along😃