Finding A Place Mortgage World

Published in

Web Mining [IS688, Spring 2021]

5 min readApr 7, 2021

At some point in your life, you may be getting to the point that living at home or in an apartment just is not the best anymore. You may decide maybe it is time to purchase a home to start a family or rather than paying rent make an investment. For myself, I have hit that stage of life. My girlfriend and I are part of the adventurous group of people that have decided now are the perfect time to buy our first home! Yes, very exciting and stressful times as we are in the largest supply shortage of homes to possibly ever occur. Currently, there is a supply and demand issue with homes as many of you may know. To give some background on the issue there are a few reasons why the shortage is occurring. The biggest issue is there are a lot of millennials that are looking to buy their first home IE ME! The next issue is the cost of materials has been on a steady rise and COVID 19 has greatly inflated this issue. The lack of materials has lead to less new homes being built.

Knowing all of this information and the fact that income has been outpaced by the cost of living greatly at least here in New Jersey, we still wish to purchase a home. I often wonder how financially we stack up against other home buyers and what group of homebuyers do we fall into. Luckily there is some data out there to help give me an idea.

The Data

There is a dataset that contains plenty of variables (~20) and over 500 thousand records of mortgages from single-family homes. The data spans from 1999 -2018 so it currently does not have the exact effects of COVID 19 and the boom of home buyers. With a login to the FreddieMac page it is possible to obtain more information; but, for now, we will use the 1999- 2018 data.

The data set contains credit scores, mortgage insurance %, loan to value, debt to income ratio, loan term, delinquency, the lender, first payment, maturity date, metropolitan statistical area, original interest rate, prepayment penalty flag, original upb, prepaid, number of borrowers and a few other variables that we are not really going to focus too much because they are items such as the servicer of the loan, the seller, and loan sequencing number.

The main features we will be working with credit score, mortgage insurance percentage, debt to income ratio, interest rate, loan to value, and original up. I have chosen these variables because they align to the items I know from my home buying experience so I can align myself to these and some of the other variables have yet to be part of my process and hopefully will not be. BUT! we will definitely be taking a look at delinquency and see what groups seem to be more delinquent than others.

Stats

Mean Credit Score : 709
Mean Mortgage Insurance Percentage 6.95
Mean Debt to Income Ratio: 31.9
Mean Original Combined Loan to Value: 76.05
Mean Loan Term: 359.85

To start seeing the groups of the data I will apply k-means clustering. K-means clustering will help partition the data by grouping our data points around the nearest mean or centroid. To determine the number of centroids the dataset should have will take some iterations. I will use a silhouette score to evaluate help determine the proper number of optimal clusters for the dataset. The silhouette score will plot the data and measure the distance and find how well the data points fit in the number of clusters. The higher the score returned the better that k representation is. One of the main issues is that will need to iterate through numerous K values to take the silhouette score of each K value to determine what K should actually be. Since the dataset has 500,000 instances this could take a while. To speed this up we will take a random same of 25K data points and try a cluster range from 2–11.

Applying K-Means and Silhouette Score

The silhouette score determines that 2 is the most efficient number of clusters for the dataset. Now let's try to interpret our clusters some. Delinquency is a serious issue when the borrower is late on payment of the mortgage. So this is a good place to start our analysis and see if the cluster gives us any insight into a grouping of delinquent data.

Chart showing K-means counting the number of delinquent and none delinquent accounts.

As you can see both clusters have about the same amount of delinquent accounts so this doesn't really tell us too much as of yet. If we look at the average mortgage insurance percentage by each cluster we see a large discrepancy.

0: 0.132
1: 25.12

Average Mortgage Insurance Percentage by Cluster ID

We now can see how these clusters are breaking apart from this we see cluster 0 has a very low mortgage insurance rate. If we continue on we will see cluster 0 has a higher average credit score (718 vs 684) and a low debt to income ratio (31 to 34) respectively. So we can surmise from this that cluster 0 may be in a better financial position to be purchasing a home than cluster one.

To help confirm our clusters are correct we can apply another clustering metric known called Agglomerative Clustering. This clustering method is dependant on leverage a similarity metric such as euclidean distance to measure the distance of all the points to create the clusters vs K-Means using the centroid approach. With agglomerative clustering, we do not need to specify the number of clusters which is really nice if we are unsure.

After applying agglomerative we actually end up with two clusters which is the same result we get using K-Means and the silhouette score. With agglomerative we get very similar results to K-Means across the board.

Original debt to income ratio of Cluster 0: 30.89 , Cluster 1: 34.22
Credit score of Cluster 0: 719 ,Cluster 1: 685

The agglomerative confirms much of the results of K-means giving me a little more confidence in the groupings.

So now I can put in my information and see where I lie with other home buyers and potential if I fall into a group that is often delinquent.

Limitations

This dataset is fairly limited on some information that I feel would be critical to help really clarify groups and see where a person may fall. Items that would be helpful are the person's income, the cost of the property, and the mortgaged amount would be helpful to determine better groupings. More recent data would be very interesting to investigate because the housing market has gone up so drastically over the past year.

Resources

sklearn.cluster.AgglomerativeClustering - scikit-learn 0.24.1 documentation

Agglomerative Clustering Recursively merges the pair of clusters that minimally increases a given linkage distance…

scikit-learn.org

Finding A Place Mortgage World

sklearn.cluster.AgglomerativeClustering - scikit-learn 0.24.1 documentation

Agglomerative Clustering Recursively merges the pair of clusters that minimally increases a given linkage distance…

Single Family Loan-Level Dataset

As part of a larger effort to increase transparency, Freddie Mac is making available loan-level credit performance data…

Written by AndrewD5