Understanding LendingClub with Databricks

Shreshth Yash
7 min readJul 8, 2020

--

Introduction to Databricks

Databricks is a managed platform for running Apache Spark — that means that you do not have to learn complex cluster management concepts nor perform tedious maintenance tasks to take advantage of Spark. Databricks also provides a host of features to help its users be more productive with Spark. It’s a point and click platform for those that prefer a user interface like data scientists or data analysts. Follow this link to get a gentle introduction of data bricks and Apache Spark environment.

Amazing things what Spark offers:

· Fast Processing of data

· Use Python and its libraries

· If you are good with SQL or Scala, simultaneously use that as well

We’ll be covering Python and SQL both and typical functions that Databricks can offer for analysis.

Easy and simple steps to set up the Databricks environment for Lending Club Analysis

1. Fire up a cluster

2. Upload the data

3. Initiate a notebook

How does Lending Club operate?

Lending Club is a peer to peer lending company based in the United States in which investors provide funds to potential borrowers and investors earn profits depending on the risk they take (credit score of borrowers). Lending Club plays like a third party between investors and borrowers. Borrowers profile is shared with the investors and they decide to invest after checking the risk involved. After the loan is received to the borrower then it is repaid in monthly installments with the interest they agreed on.

Dataset

The data set is taken from Kaggle

For now, we’ve taken one-quarter of data to perform data cleaning and further analysis. Setting up a spark data frame by reading the CSV file

Let’s check the schema of the data frame to analyze the datatypes of all the features

To start using the SQL query, we first need to create a table and start using SQL the same and easy way, just by adding ‘%sql’ on the start of the column

What are our important features mean(data dictionary)

· term — The number of payments on the loan.

· homeOwnership-homeownership status RENT, OWN, MORTGAGE, OTHER.

· grade — LC assigned loan grade

· purpose — A category provided by the borrower for the loan request.

· intRate — Interest Rate on the loan

· addrState — The state provided by the borrower in the loan application

· loan_status — Current status of the loan

· application_type — loan is an individual application or a joint application

· loan_amnt — The listed amount of the loan applied for by the borrower.

· emp_length — Employment length in years.

· annual_inc — The self-reported annual income provided by the borrower.

· dti — borrower’s total monthly debt payments on the total debt obligations.

· dti_joint-co-borrowers’ total monthly payments on the total debt obligations

· delinq_2yrs — The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years

· revol_util — Revolving line utilization rate

· total_acc — The total number of credit lines currently in the borrower’s

· num_tl_90g_dpd_24m — Number of accounts 90 or more days past due.

Collecting all the important features in another data frame

Describe functions gives the mean, median, stdev, min, max, count of the features selected from the data frame

Data Cleaning

It is the most tedious process and takes almost 75% of the time in order to generate good analysis.

Cleaning Employment length and term column of the borrower using regular expression and making a separate column.

Comparing the cleaned columns with the original columns

We saw above the data types of majority columns are string, so converting some of them to double using cast function

Checking our target column “loan status”

Here we can see that status like Late(31–120 days), In Grace period, Late(16–30 days), Charged off suggests that customer loan is getting bad. We’ll merge these into a single function by creating a new column “bad_loan”. Values which are in Current, FullyPaid are “No” and the rest of others will be in “Yes” status.

Final bad loan column can be seen like this. There is a lot more data cleaning done and multiple columns are dropped after cleaning. A whole notebook can be found on my Github repo.

Exploratory Data Analysis

Try to analyze the bad_loan with respect to grade column using cross tab function to very similar to the pivot table

It can be seen that as the grade increases, the count of bad_loan also increases.

Next is to check for what purposes does the loan is used by the borrower.

What customers do can be clearly seen above that to cover up other debts of the borrower, consolidated and credit card bills are paid through loans in the majority. The reason is not to get a credit score impacted.

Databricks gives an option to visualize data at the same time. So we’ll check the total loan amount given to borrowers based on grades.

Checking the state-wise total loan amount

Here we can see that Florida, California, Texas are some states where most of the loans are taken. Iowa has no loan at all.

Now let’s check which states have the maximum bad loans

It can be seen that California has the maximum number of defaulters, followed by Florida and NewYork.

Let’s check the below graph, Grade A and B are with less defaulters as compared to F and G grade

Loan amount distribution

Let’s check how much loan is usually taken by customers. Here is a histogram of the loan_amount to see the distribution.

The majority of loan amount given to borrowers is the range of 10k -15k. We can also see the distribution is also rightly skewed.

For checking the outliers we can use the box plot and getting more insights about the data.

Seeing the box plot it can be said that if the loan is Good then the media is around 13k but when the loan is Bad then the mean of the loan amount is 23k. So, we can infer that increasing the loan amount can lead to bad loans, there are many other factors that also should be taken into account.

Now checking the interest rate with respect to Bad loans

For the bad loans, interest rates are pretty much on the higher side.

We can see what type of homeownership of the customer with respect to loan is bad or not. This can tell the behavior of the customers and stability of the customers.

To Conclude here,

Databricks platform is really fast for large datasets and give more features for exploratory data analysis with multiple graph options. Multiple language support like Python, SQL, Scala is very handy to use according to the comfort zone. Spark data frames are very similar to the pandas' data frame and visualization with Python libraries like Matplotlib, Seaborn can also be used to perform EDA with amazing graphs.

Thanks for your time. You can post me your feedback on comments or directly on shreshthyash@gmail.com or else connect me on LinkedIn.

--

--