PROSPER LOAN DATA: EXPLORATORY AND EXPLANATORY DATA VISUALIZATION

Samueloyedele
8 min readNov 3, 2023

--

Introduction

This is the dataset about loans record with various loan variables. The dataset contains 113,937 loans with 81 variables including loan amount, loan status, term, prosper score, monthly loan payment, stated monthly income, income range, term and many others.

The dataset was provided by Udacity as part of Data Analytics Nanodegree program. The dataset was performed using Python libraries (pandas, matplotlib, numpy, seaborn).

The dataset can be found in the here

Here is the link to the data dictionary

Objective

The main objective is to explore the features / variables that are best for predicting loan amounts for each borrower in the dataset.

Data Wrangling

In this section, the dataset was gathered, accessed it to inspect any issues or outliers in the data and clean the data for better exploratory visualization.

Gather

Firstly, imported the necessary libraries (numpy, pandas, matplotlib, seaborn) and loaded the dataset to a dataframe (loans) using Pandas library.

A subset dataset of 19 variables was created from the loan dataframe. Some of the variables are: Term, LoanStatus, BorrowerRate, ProsperScore, IncomeRange, IsBorrowerHomeowner,DebtToIncomeRatio, StatedMonthlyIncome, LoanOriginalAmount, MonthlyLoanPayment.

Assess

In this section, the dataset was assessed to inspect any quality or tidiness issues in the data using programmatic assessments.

  • Use Pandas shape function to get the dimension (row and column) of the data
  • Use Pandas infofunction to get the columns information and their data ypes.
  • Use Pandas isnull and sum to get sum of null / missing values in each column.

From the above result, they are null / missing values in BorrowerState, ProsperRating (Alpha), ProsperScore, Occupation, EmploymentStatus, EmploymentStatusDuration and DebtToIncomeRatio columns.

  • Use Pandas dupliacted and sum functions to get the sum of duplicated values in the data.
  • Use Pandas value_counts function to count the values in the ProsperScore column.

Quality Issues

  • Missing values in BorrowerState, ProsperRating (Alpha), ProsperScore, Occupation, EmploymentStatus, EmploymentStatusDuration, DebtToIncomeRatio
  • Replace value (True - Homeowner, False - No Homeowner) in IsBorrowerHomeowner
  • Replace the values of `ListingCategory (numeric)
  • Convert StatedMonthlyIncome to nearest whole number
  • Extract columns (month, year) from LoanOriginalDate
  • Convert the data types of Employment Status, ProsperScore, ListingCategory (numeric), IsBorrowerHomeowner
  • Rename some unclean variables
  • Convert Loan Status, Prosper Score, Prosper Score Alpha, Loan Original Month, Year, Income Range into ordered categorical type

Clean

After assessment, the dataset quality issues are cleaned following this step: Define, Code and Test.

Some of the issues cleaned:

Missing values in BorrowerState, ProsperRating (Alpha), ProsperScore, Occupation, EmploymentStatus, EmploymentStatusDuration, DebtToIncomeRatio

  • filter for rows without null / missing values in ProsperRating (Alpha), ProsperScore, EmploymentStatusDuration columns using Pandas notnull function.
  • Fill the null / missing values in Occupation columns with ‘Other’ values for simplicity using Pandas fillna function.

Replace value (True — Homeowner, False — No Homeowner) in IsBorrowerHomeowner

  • Replace the values (True, False) with (Homeowner, Not Homeowner) using Pandas replace function.

Extract columns (month, year) from LoanOriginalDate

  • Firstly, convert the LoanOriginalDate column to datatime data type using Pandas to_datetime function
  • Extract year from the LoanOriginalDate column using .year accessor from Pandas .dt accessor.
  • Extract the month from the LoanOriginalDate column using .month_name() accessor with .str[:3] function to get only the first three characters of the month name.

Overview of the Clean Loan Data

Structure of the cleaned dataset

They are 77,543 loans in the dataset with 20 variables. Most variables are numeric in nature. but variables; Loan Status, Prosper Score, Prosper Score Alpha, Loan Original Month, Year, Income Range are categorical ordered variables with this following level:

Ranking from (low) — (high) :

LoanStatus: Chargedoff, Completed, Defaulted, FinalPaymentInProgress, Past Due (1–15 days), Past Due (16–30 days), Past Due (31–60 days), Past Due (61–90 days), Past Due (91–120 days), Past Due (>120 days)

ProsperScore: 1,2,3,4,5,6,7,8,9,10,11

ProsperRatingAlpha: HR, E, D, C, B, A, AA

LoanOriginationMonth: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec

LoanOriginationYear: 2009, 2010, 2011, 2012, 2013, 2014

IncomeRange: Not employed, 1−24,999, 25,000–49,999, 50,000−74,999, 75,000–99,999, $100,000+

Exploratory Visualization

In this section, the cleaned dataset was visualized to explore the relationship between the numerical and categorical variables.

Univariate Exploration

Firstly, Univariate Visualizations are visualized to investigate the distribution of individual variables:

Main variable of interest: LoanOriginalAmount — to explore the frequency of loan amount given.

From the distribution above, the most frequency of loan amounts are $5000, $10000 and $15000.

The other interest (predictor) variables

  • IncomeRange — A bar plot using seaborn countplot

From the distribution above, it shows that borrowers with no incomes are disqualified from collecting loans.

  • DebtToIncomeRatio — A histogram plot on a smaller bin size using matplotlib plot.

From the distribution above, borrowers with low debt to income ratio less than 0.36 are given more loans.

  • StatedMonthlyincome — A histogram plot on a larger bin size using matplotlib plot. During the plotting, they were outliers in the data. The outliers are removed using quantile function to produce this visualization:

Plot with outliers:

After removing outliers:

From the distribution above, high stated monthly income from 8,500 tend to have low count of loans. Maybe it’s due to the loan amounts given out? Will investigate more using bivariate visualization.

Bivariate Exploration

Secondly, Bivariate Visualizations are visualized to investigate the relationship between two variables.

  • Correlation between numerical variables: using seaborn heatmap plot to visualized their relationship.

From the above map, there is a strong relationship between LoanOriginalAmount and MonthlyLoanPayment (0.9) which implies the higher the loan amount, the higher the monthly loan payment. There is a weak postive relationship between LoanOrginalAmount and StatedMonthlyIncome (0.414)

  • The relationship loan original amount and monthly loan payment
  • How does the income range affect the loan amount?

From the visualization above, income range has an effect on the loan original amount, borrowers with high income range tend to get high loan amounts.

  • Does high monthly income guarantee high loan amount?

From the visualization above, borrowers with high monthly income tend to get high loans.

  • Does DebtToIncomeRatio has an impact on the loan amount?

From the visualization above, borrowers with low debt to income ratio (< 0.36) get high loan amounts.

Multivariate Exploration

Thirdly, Multivariate Visualizations are visualized to investigate the relationship between three or more variables.

  • How does categorical variable affect the loan amount and monthly loan payment?

Explanation: Faceted heat maps are created for categorical variables (LoanOriginationYear, ProsperScore, IsBorrowerHomeOwner) to visualized their relationship with loan original amount and monthly loan payment.

The relationship between Loan Original Amount and Month Loan Payment via LoanOriginationYear:

From the visualization above, there is an increase in loan original amount and monthly loan payment across the year.

  • The debt ratio income loans across the loan origination year and prosper score.

Explanation: The debt ratio income variable is filtered to within (0.1–1) ratio for better visualization.

  • The relationship between loan original amount and monthly loan payment via term.

Summary of Findings

  • The loan original amount distribution is trimodal, with one peak at 5000, second peak at 10000, and third peak at 15000.
  • In Stated Monthly Income, there was outliers in the data which was filtered out. After removing the outliers, they are high spikes in frequency at 4000, 4500, 5000 stated monthly income.
  • In Employment Status, employed people tend to get the most loans compared to other categories.
  • More loans are given in 36 term (3years). Over 60% of loans given to borrower are home owner. most of the loans are still on current status. 2013 is the year with the most frequency of loans.
  • High loan amounts are issued out when there’s low debt income ratio (< 0.36). The lower the debt to income ratio, the higher the loan original amount.
  • Most loans are collected based on debt consolidation which is more than other listing category. This implies that most borrowers collect loans to pay up their debts.
  • There is a strong relationship between the loan original amount and income range. The higher the income ranges the higher the loan amount given.
  • There is a strong positive correlation between loan original amount and monthly loan payment. The higher the loan amount, the higher the monthly loan payment.

Conclusion

This project contains two files: The first file exploration which contain the structure of the project and exploratory visualizations. The second file slide_deck for explanatory visualizations which convey and communicate the findings (explorations).

From this project, I have been able learn about quantitative and qualitative data and how to use them for exploratory and explanatory visualizations. Generally, it has strengthened my data visualization skills.

You can check out the full project on my Github repository and other project on my website portfolio.

Recommendations

· The loan term should be extended to prevent overdue of loan payments for borrowers that depend on debt consolidation.

· More loans should be given to borrowers that depend on business to help borrowers grow their businesses.

· Self-employed borrowers with feasible business should be given more loans. Most self-employed people make more money than employed people.

· An additional interest rate should be implemented on overdue loans which will help reduce past due payments.

Thank you for taking your time to read it, kindly comment your suggestion on this project. I will also appreciate a like/follow from you.

--

--