A New Comprehensive Case Study for Data Science Classrooms

With published data from peer-to-peer lenders, data science educators craft new modules for students and teachers

Image for post
Image for post

In an effort to aid data science educators on the undergraduate and graduate levels, Foster Provost, Professor of Information Systems and Data Science, Maxime C. Cohen and Kevin Jiao of NYU Stern’s Department of Information, Operations, and Management Sciences, and C. Daniel Guetta, Columbia Business School, have prepared new case study materials for use in data science classes. Their case study involves real data made freely available from the peer-to-peer lending platform LendingClub. The task seems straightforward — help a hypothetical investor build a portfolio that earns the highest possible return while recognizing risk tolerance, budget constraints, and diversification requirements. But this requires complex design and analysis; even calculating the return is not straightforward.

Peer-to-peer lending is the practice of lending money to individuals or small businesses through online services that match anonymous lenders with borrowers. This type of lending usually offers higher returns than traditional investments, but lenders face the risk of the borrower defaulting. Interest rates are set by an intermediary platform based on credit scores and annual incomes, and the platform charges a fee.

LendingClub and Prosper, the two largest peer-to-peer lending platforms in the U.S., provide potential investors free access to historical data. The data sets include “comprehensive information on all loans issued between 2007 and the third quarter of 2017 (a new updated data set is made available every quarter).” The data set used in this case study contains more

than 750,000 loan listings with a total value exceeding $10.7 billion.

With this educational case study, Provost and collaborators aim to answer a need for comprehensive case studies in the classroom. Students download the LendingClub dataset to interact with real data and build predictive models. Investors should split data into two parts — one part to make decisions, and a second part to evaluate those decisions. Complexities arise which may not be anticipated, however, such as varying loan lifespans, loans being paid off early, or varying longevity before default.

The researchers caution that there is a difference between building a good predictive model and applying one — they stress the importance of additional in-depth analysis. Provost emphasizes, “Unlike what one often sees in data science examples, we do not stop by evaluating the predictive performance of the models. We take the next step of evaluating how well the models will actually work at solving the ultimate task, and how one might actually use them to optimize decision making.”

To this end, the researchers structure their case study as six modules which can be taught together or separately. Broadly, those modules include: Identifying objectives, data ingestion and cleaning, data exploration, building predictive models for default, devising investment strategies, and optimization.

The case study highlights the particular problem of leakage, “a situation in which a model is built using data that will not be available at the time the model will be used to make a prediction.” For example, an investor might use a model that correlates payment amount with default or early repayment, but payment amount won’t be available when investing in future loans.

Provost and collaborators hope their case study — which is now online — will spur similar approaches to comprehensive, interactive data science education in university classrooms.

By Paul Oliver

Center for Data Science

This is the official research blog of the NYU Center for…

NYU Center for Data Science

Written by

Official account of the Center for Data Science at NYU, home of the Master’s and Ph.D. in Data Science.

Center for Data Science

This is the official research blog of the NYU Center for Data Science (CDS). Established in 2013, we are a leading data science training and research facility, offering a MS in Data Science and, as of 2017, one of the nation’s first universities to offer a Ph.D. in Data Science.

NYU Center for Data Science

Written by

Official account of the Center for Data Science at NYU, home of the Master’s and Ph.D. in Data Science.

Center for Data Science

This is the official research blog of the NYU Center for Data Science (CDS). Established in 2013, we are a leading data science training and research facility, offering a MS in Data Science and, as of 2017, one of the nation’s first universities to offer a Ph.D. in Data Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store