Predicting mortgage approvals: Data analysis and prediction with Azure ML Studio(Part 2)

Eduardo Muñoz
Analytics Vidhya
Published in
7 min readMar 26, 2020

This is part 2 of “Predicting mortgage approvals”. Here you can link to part 1.

A description of a machine learning model in Azure Machine Learning Studio

The next point is to develop a predictive model that allows us to determine with an acceptable degree of confidence when a loan request or application is accepted or not. This is a two-class classification problem, accepted or not accepted. We will try to define a model starting from a simple approach. Then, this model will be refined but avoiding increasing complexity as much as possible.

After the analysis made we can take some decisions:

  • Row_id variable, as it is well known, should be discarded because it does not provide any value or information
  • The categorical variables loan type and accuracy should be removed
  • Variables loan_property and approval do not seem to provide much value, although they are not completely disposable
  • The variable lender has a number of values or categories too high and therefore it will be replaced by its level of acceptance ratio.
  • About the variables related to race, ethnicity and sex of the applicant, some of them, that are not predictive, should be discarded, such as ethnicity or sex.
  • Some of the numerical variables, those mentioned in previous sections, seem to be good candidates: loan_amount, applicant_income, minority population pct, applicant income or ffiec median family income
  • Regarding the 3 variables related to the location of the applicants, we are going to include only the state but others combinations will be analyzed.
Photo by Rudy and Peter Skitterians onPixabay

Treatment of the variable lender

The variable lender as one would expect has an important value in the model but it presents a quantity of values, which become categories, very high that could be a source of overfitting. To address this problem we propose to transform this variable into a new variable that defines ranges of acceptance ratio of a lender. Therefore for each lender we will calculate its acceptance ratio, as the number of accepted applications divided by the total of applications processed by that lender

Once these values ​​are calculated, we will group the lenders in the ranges 0-20% of acceptance ratio, 21-40%, ... In this way we reduce the number of categories of the variable to only 5 or 6 values and also provide some useful information about the facility of a lender to grant a loan.

Dealing with location variables

We are referring to the variables state, county and msa md (Metropolitan statistical area or metropolitan division), all of them define the location of the applicant. For all of these features the number of values ​​is very high. We also have a lot of records where one or more of these variables take the value -1 that indicate that its actual value is not known or has not been registered, that is, they are missing values.

We can also say that:

  • For the state field we have values ​​from 0 to 52 but the value 51 is missing. Which may lead us to consider that the value -1 in that variable is really the value 51.
  • For the msa_md field the same thing happens, we have values ​​from 0 to 408 but there is no record with the value 338, again it could be the real value of the records that take the value -1.
  • In the case of county we have several values ​​that never appear 85, ... so you can not make a simplification like the one mentioned in the previous fields

Finally, the initial approach will be to consider only the variable state as part of the predictive model and based on the results of our model we will consider to include the msa md field.

Dealing with outliers, data error and missing values

Again, we have null values ​​in multiple numerical variables in the data and that can penalize the performance of the model that we are going to develop. There are around 70,000-80,000 records with missing values and also they appear simultaneously in several variables at the same time. To approach this problem, we will try to fill in the null values ​​of these variables looking for the value that can correspond to it based on other variables of the same registry:

  • Applicant income: we will take the median value of the records belonging to the same state, county, msa_md, race, ethnicity and sex as the record with the null value. If it does not exist, we will search for that median value based on the state, the county, msa md, race and ethnicity and so on. Finally if no median value exists we will take the mean value of that variable for the state of the record.
  • Minority population pct: in this case when a record with a null value is found we will search the median value for the records of the same state, county or msa_md.If it does not exist it will be searched for the median of the same state and county or finally for the records of the same state.
  • For the remaining variables, the approach used in the variable minority population pct will be used

Regarding to outliers or possible data error, we have some variables that are seriously affected, such as loan amount, applicant income, ffiec median family income, etc. For the variables that are finally included in the predictive model, we will approach 2 methods:

  • Those records that have values ​​above the perceptile 98 will be removed from the training process
  • Those records with values ​​above the IQR * 1.5 will be replaced by the limit value defined by the IQR * 1.5
https://docs.microsoft.com/en-us/archive/blogs/azureedu/how-can-i-get-started-with-azure-machine-learning

Building our predictive model

To build our model we will use Azure Machine Learning Studio cloud tool and some Python scripts to apply data transformations and manipulations. And we will do it in a group of stages: Training data preparation, predictive model design for training, test data preparation and scoring the model on the test data.

Data preparation

For this stage we design and Azure ML experiment where every data column is transformed base on the transformations previously defined:

  • transforming lender variable in an acceptance ratio level
  • filling missing values for numerical variables
Multiple python scripts for data preparation (Photo by author)

So, this experiment produce a dataset that will be used for training our model and some others dataset to apply the same transformation to the test data.

Building our model

Our next experiment will receive the transformed dataset for training and apply a two-class classification algorithm to get a model for scoring the test data. Finally we tried the main classification algorithm: logistic regression, decisión trees (and variations) and neural network. We evaluated every algorithm with the same data and multiples parameters and our best option was the Boosted Decision Tree algorithm.

Azure ML Studio components of our model (Photo by author)

We performed the following steps:

  • Clipping values above some threshold or percentile to deal with outliers
  • Remove some records with missing values
  • Zscore normalization for some variables (loan_amount, applicant_income, ffiecmedian_family_income and population)
  • Minmax normalization for minority_population_pct and tract_to_msa_md_income_pct)
  • Make categorical variables for loan_purpose, level, applicant_race, etc…
  • Selecting the columns to use in the training process
  • Splitting the dataset in training set and test set: 75%-25%
  • Train the Boosted decision tree
  • Score and evaluate the results

After many experiments, the columns selected were: loan_amount, loan_purpose, applicant_income, applicant_race, state, minority population pct, tracto_to_msa_md_income_pct, ffiecmedian_family_income, ethnicity and lender level of acceptance ratio.

The training results were:

Preparing the test data

Now, we need to apply to our test dataset the same (or similar) transformations we have applied to the training data as we saw previously: transforming lender variable, filling missing values for some numerical variables (applicant_income, minority_population_pct, ffiecmedian, trac_to_msa_md).

Photo by author

The result of this process is a new dataset prepared to be used in our predictive model.

Scoring the model on the test dataset

Finally we can apply our model to the test data and score every record to accepted or not accepted. This experiment is:

Photo by author

Conclusions

After a long time analyzing and building the model we can conclude that data preparation has been the most powerful tool to get an acceptable performance. About the data analysis, we have yet mentioned many ideas: applicant race or ethnicity are relevant, applicants with high incomes as well as located in areas with low percentage of minority population are more likely to be accepted for a loan and most of the applications are related to the same type of property and loan

--

--

Eduardo Muñoz
Analytics Vidhya

A Data scientist and Machine Learning practitioner and involved in NLP tasks and advances. Experienced Project Management Lead. Learning every day.