Predicting mortgage approvals: Data analysis and prediction with Azure ML Studio

Part 1

Published in

Analytics Vidhya

8 min readMar 14, 2020

This is part 1 of “Predicting mortgage approvals”, in part 2 I show you the classification model on Azure Machine Learning Studio.

A simple and quick EDA for a data scientist beginner in Python

This post is an introduction, example, of an EDA (Exploratory Data Analysis). Our goal is to explore and analyze the data. Finally we will design a model to predict whether a mortgage application is accepted or denied according to the given dataset, which is adapted from the Federal Financial Institutions Examination Council’s (FFIEC). The model will be described in next post.

Data contains variables about applicants, loan characteristics and amounts, location and population, etc. Basically, we explored, analyzed data and looked for insights and relationships between data variables. Then we transformed it to a more powerful predictive form, if it is necessary, and built a machine learning model to predict our target variable. We can remark some conclusions and ideas:

Presence of outliers (or data errors) in many numerical values.
We can see that loans for home purchasing are most likely to be accepted than loans for home refinancing
The loan acceptance is not affected by the loan type or the property type. No matter if it is a conventional loan or government-guaranteed.
Most of the applicants are white, not hispanic and male people and its ratio of acceptance is positive. But requests from black, hispanic people or women are slightly rejected.
Lender is a categorical variable with too many different values, using a new variable related to the acceptance ratio of a lender is a much better option for predicting.
Applications where loan amount is below 100,000 are more likely to be rejected. When it is higher than 150,000 they are more likely to be accepted.
Applicants with incomes above 75,000 are more likely to be accepted.
Applicants from locations where more than 60% of the population is from a minority slightly tends to be rejected. Even in areas with low percentage of minorities, applicants with relatively high incomes can be rejected

Data Exploration

Our first step is a brief exploration on the dataset provided, we want to get some rapid insights or ideas about our dataset. So first of all, we list all columns, datatypes, number of rows, …

Output of describe command on the dataseg — Columns of the dataset

So, we have 500,000 data records, grouped in 23 data columns, most of them numerical, some columns have missing values and one variable is boolean. Comparing the data with the problem description, we actually can define two groups of variables:

Categorical (numerical but not a “number” or quantitative value): loan_type, property_type, occupancy, pre-approval, msa_md, state_code,county_code, applicant_ethnicity, applicant_race, applicant_sex, lender, co_applicant
Numerical: loan_amount, population, minority_population_pct, ffiecmedian_family_income, number_of_owner-occupied_units, number_of_1_to_4_family_units,…
Label: accepted, this is our targeted variable

So, next we should inspect a brief descriptive summary of our dataset, showing the main statistics features. For the categorical variables:

This brief data analysis give us some useful information:

Many categorical variables take the same value for almost every row: propery_type, occupancy and preapproval.
Approval value “It not applicable” for almost every row.
Applicant race, ethnicity and sex have mainly two values, we will dive deeper later.
There is no complete information about the location in many rows , and this variable should be an important variable.
Presence of a lot of outliers (or data errors) in most of the numerical features. Especially loan amount and applicant income. Loan_amount mean is about 200 but maximum value is higher than 100,000 and standard deviation is almost 600. It looks like there are some wrongs values in the data.
In general, the dataset seems to be balanced between accepted and not accepted applications

Let´s dive in every kind of data we have, searching for more useful information.

Exploring categorical data

First, a basic boxplot to identify how values are spread along its range:

We use boxplot to see how values are spread along the range, many of the figures shows that most of the rows has a unique value and we should explore if the label variable distribution is affected by some of these features. Let´s plot some richer graphics than the previous ones:

Now some ideas are shown:

We can see that loans for home purchasing (loan purpose = 1) are most likely to be accepted than loans for home refinancing (loan purpose = 3).
The loan acceptance is not affected by the loan type or the property type. No matter if it is a conventional loan or government-guaranteed, they all have same opportunities. But most of applications are conventional loans (loan type = 1) and One to four -family properties (property type = 1)
Owner´s principal dwelling (occupancy = 1) are the most frequent applications

For features relatives to applicants we make the same analysis:

Most of the applicants are white (5), not latino (2) and male (1) people and its ratio of acceptance is positive
However, the requests of black (3), hispanic people (1) or women (2) are slightly rejected. But the difference is too small that we cannot confirm it as a discriminant factor
We appreciate that applicants who do not provide that kind of information tends to be not-accepted, so this information seems to be relevant for the lender.

Lender variable is an especial feature with a lot of different values as shown in the next figure:

When we define only 10 bins, some kind of linear increasing tendency is shown. But when number of bins is increased, the tendency is flatten, as it is supposed, but some peaks are revealed. We need to transform this data in order to get any sort of information. We consider the ratio of acceptance of a lender as a good piece of information, so for every lender, its acceptance ratio is calculated and we also need to reduce the number of categories: 6.1 thousand is not acceptable for a categorical variable. Then we decided to define levels of acceptance ratio: level 0 for 0%, level 1 for 0–12,5% ratio acceptance, level 2 for 12,5%-25% and so on.

The acceptance ratio for most of the lender is between 0.5 and 0.9, median 0.7 approx. So there is some positive tendency in accepting loans.

The number of levels is not determined at this moment but 8 levels looks as a good option

Location Features

There are a lot of records with a missing value (-1) in these columns and many of them are not accepted.
Records with all of them as missing values (-1 in every column): there are many missing values in these rows, probably can not infer so much about them.

The state variable has values between 0 and 52 but there are no records with value 51, so its missing values could be replaced with the value 51. Maybe some kind of error occurred during the data processing or they are just missing. Applicants are equally distributed along all MSA MD values, so probably state is the best feature to represent the location.

Exploring numerical variables

The next group of features to analyze are the numerical ones, we plot some graphs to show their distribution:

We can barely see the interquartile range on the boxplot, so many outliers/errors are presents. There is no significant difference between accepted and not-accepted application distribution, even among the outliers.
Those applications where loan amount is below 50,000 are likely to be rejected, between 50,000-100,000 are some likely. When loan amount es higher than 150,000 the applications are more likely to be accepted.
When applicant income is less than 75,000, they are more likely to be rejected.

We repeat the same process with the other numerical features:

In locations where the percentage of minority population is low, their applicants tends to be accepted. When 60% of population is from a minority, the number of rejected applicants tends to get higher.
As we detect previously, the lower the median of applicant income is the less likely to be accepted the loan is. Below 55,000$, the ratio of denied loans is bigger.

There are two features that present a normal distribution but we cannot appreciate any remarkable fact on them.

Analyzing relation between numerical variables

At this moment we have some knowledge about the features and how the acceptance ratio is dependent on that. Next step, analyze how these features are related between them. So a scatter matrix is our tool to get that insight:

Scatter Plot Matrix for all numerical features

The relationship between the variables loan amount and applicant incomes with respect to each of the remaining variables allows us to identify accepted and not accepted loans.
There is a strong linear correlation between population, number of 1–4 family units and number of owner occupied units.

We would like to focus in some features that seems to reveal some remarkable ideas:

High values for loan amount and low applicant incomes are very likely to be rejected.
There are no applications with a high loan amount in locations when % of minority population is very high, nor accepted or not accepted.
In areas with high percentage of minorities, even applicants with relatively high incomes are rejected

The next step will be to design and build a predictive model based on the features just analyzed. In the next post, we will describe a model designed in Azure Machine Learning Studio.

Then we can build a predictive model usin Azure Machine Learning Studio, click the link to access it.