A New Comprehensive Case Study for Data Science Classrooms

With published data from peer-to-peer lenders, data science educators craft new modules for students and teachers

In an effort to aid data science educators on the undergraduate and graduate levels, Foster Provost, Professor of Information Systems and Data Science, Maxime C. Cohen and Kevin Jiao of NYU Stern’s Department of Information, Operations, and Management Sciences, and C. Daniel Guetta, Columbia Business School, have prepared new case study materials for use in data science classes. Their case study involves real data made freely available from the peer-to-peer lending platform LendingClub. The task seems straightforward — help a hypothetical investor build a portfolio that earns the highest possible return while recognizing risk tolerance, budget constraints, and diversification requirements. But this requires complex design and analysis; even calculating the return is not straightforward.

Peer-to-peer lending is the practice of lending money to individuals or small businesses through online services that match anonymous lenders with borrowers. This type of lending usually offers higher returns than traditional investments, but lenders face the risk of the borrower defaulting. Interest rates are set by an intermediary platform based on credit scores and annual incomes, and the platform charges a fee.

LendingClub and Prosper, the two largest peer-to-peer lending platforms in the U.S., provide potential investors free access to historical data. The data sets include “comprehensive information on all loans issued between 2007 and the third quarter of 2017 (a new updated data set is made available every quarter).” The data set used in this case study contains more

than 750,000 loan listings with a total value exceeding $10.7 billion.

With this educational case study, Provost and collaborators aim to answer a need for comprehensive case studies in the classroom. Students download the LendingClub dataset to interact with real data and build predictive models. Investors should split data into two parts — one part to make decisions, and a second part to evaluate those decisions. Complexities arise which may not be anticipated, however, such as varying loan lifespans, loans being paid off early, or varying longevity before default.

The researchers caution that there is a difference between building a good predictive model and applying one — they stress the importance of additional in-depth analysis. Provost emphasizes, “Unlike what one often sees in data science examples, we do not stop by evaluating the predictive performance of the models. We take the next step of evaluating how well the models will actually work at solving the ultimate task, and how one might actually use them to optimize decision making.”

To this end, the researchers structure their case study as six modules which can be taught together or separately. Broadly, those modules include: Identifying objectives, data ingestion and cleaning, data exploration, building predictive models for default, devising investment strategies, and optimization.

The case study highlights the particular problem of leakage, “a situation in which a model is built using data that will not be available at the time the model will be used to make a prediction.” For example, an investor might use a model that correlates payment amount with default or early repayment, but payment amount won’t be available when investing in future loans.

Provost and collaborators hope their case study — which is now online — will spur similar approaches to comprehensive, interactive data science education in university classrooms.

By Paul Oliver