Can You Solve This Data Science Problem?

Machine learning model for forecasting loan status

Benjamin Obi Tayo Ph.D.
Nov 21 · 3 min read
Image for post
Image for post
Photo by Karla Hernandez on Unsplash

Two years ago, I had a screening interview with a financial company that uses data science and analytics to predict the credit worthiness of it’s customers to determine how likely they are capable of repaying a loan in full. As part of the interview process, I was assigned a take-home challenge problem. Please see below for the project description and instructions.

The dataset for this problem can be downloaded from this GitHub repository.

The dataset here is complex (has 50,000 rows and 2 columns, and lots of missing values), and the problem is not very straightforward. You have to examine the dataset critically and then decide what model to use. This problem was to be solved in a week. It also specifies that a formal project report and an R script or Jupyter notebook file be submitted.

As of the moment of writing, I don’t know the solution to this problem or what type of model would be suitable for tackling this problem.

I would like to challenge you to try to solve this problem yourself and let me know what your solution is.

Instructions: In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:

  • First, the borrower receives the funds. This event is called the origination.
  • The borrower then makes regular repayments until one of the following happens:

(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.

(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid. In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:

  • The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.
  • For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.

We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine. Also, we expect that this project will not take more than 3–6 hours of your time.

This particular problem does not have a unique solution. I attempted a solution using probabilistic modeling based on Monte-Carlo simulation. For the question: Estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished?

My model produced a 95% confidence interval for the fraction of loans that will charge off after 3 years to be 14.8% +- 0.2%.

I gave this problem to two of my friends who are data science aspirants, they obtained 6.8% and 70%, respectively as the percentage of loans that will have charged off after the 3 years loan duration period. So we can observe that there is a large variance in the predicted values for each of the 3 models considered above.

To view the dataset and my attempted solution, please see the following link:

https://github.com/bot13956/Monte_Carlo_Simulation_Loan_Status

Please email me comments and a version of your solution to the following email address: benjaminobi@gmail.com

How Much Math do I need in Data Science?

Data Science Curriculum

5 Best Degrees for Getting into Data Science

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

How to Organize Your Data Science Project

Productivity Tools for Large-scale Data Science Projects

A Data Science Portfolio is More Valuable than a Resume

For questions and inquiries, please email me: benjaminobi@gmail.com

The Startup

Medium's largest active publication, followed by +733K people. Follow to join our community.

Benjamin Obi Tayo Ph.D.

Written by

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Predictive Analytics, Materials Sciences, Biophysics

The Startup

Medium's largest active publication, followed by +733K people. Follow to join our community.

Benjamin Obi Tayo Ph.D.

Written by

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Predictive Analytics, Materials Sciences, Biophysics

The Startup

Medium's largest active publication, followed by +733K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store