Can You Solve This Data Science Problem?
Machine learning model for forecasting loan status
Two years ago, I had a screening interview with a financial company that uses data science and analytics to predict the credit worthiness of it’s customers to determine how likely they are capable of repaying a loan in full. As part of the interview process, I was assigned a take-home challenge problem. Please see below for the project description and instructions.
The dataset for this problem can be downloaded from this GitHub repository.
The dataset here is complex (has 50,000 rows and 2 columns, and lots of missing values), and the problem is not very straightforward. You have to examine the dataset critically and then decide what model to use. This problem was to be solved in a week. It also specifies that a formal project report and an R script or Jupyter notebook file be submitted.
As of the moment of writing, I don’t know the solution to this problem or what type of model would be suitable for tackling this problem.
I would like to challenge you to try to solve this problem yourself and let me know what your solution is.
Model for forecasting loan status
Instructions: In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:
- First, the borrower receives the funds. This event is called the origination.
- The borrower then makes regular repayments until one of the following happens:
(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.
(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid. In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:
- The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.
- For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.
We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine. Also, we expect that this project will not take more than 3–6 hours of your time.
My Version of the Solution
This particular problem does not have a unique solution. I attempted a solution using probabilistic modeling based on Monte-Carlo simulation. For the question: Estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished?
My model produced a 95% confidence interval for the fraction of loans that will charge off after 3 years to be 14.8% +- 0.2%.
I gave this problem to two of my friends who are data science aspirants, they obtained 6.8% and 70%, respectively as the percentage of loans that will have charged off after the 3 years loan duration period. So we can observe that there is a large variance in the predicted values for each of the 3 models considered above.
To view the dataset and my attempted solution, please see the following link:
Please email me comments and a version of your solution to the following email address: email@example.com
Additional Data Science/Machine Learning Resources
For questions and inquiries, please email me: firstname.lastname@example.org