What Is My Home Worth? What Every Homeowner Should Know

Felipe Mahlmeister
fmeister23-en
Published in
7 min readSep 5, 2019

Buying or selling a home has never been an easy task, especially nowadays where we have plenty of choices and houses with many different variations.

It’s really important to know how much your house is worth because if you estimate your home’s value as too high, you could wind up sitting on the market. Otherwise, if you place it below market value, certainly you will end up losing money.

So, what’s the way out of this situation?

Your best bet is to know your home’s worth, and list your home close to the market value.

On the internet, we have plenty of sites that could estimate it for you based on your location: ZipRealty, Realtor, Redfin, and many others.

The main objective of this article is to go deeper into these questions:

  • What are the aspects that have the most influence on the value of a home?
  • Can we have a trustworthy estimate of the Homes Price?
  • How could we estimate almost any houses price and transform this study into a product?

To answer these questions we are going to explore the Ames Housing dataset, which has almost 3000 houses prices with 79 explanatory variables that were directly related to property sales in Ames, Iowa. Some examples of questions that these variables answer:

When was it built? How big is the lot? How many square-feet of living space is in the dwelling? Is the basement finished? How many bathrooms are there?

Part I: What are the aspects that have the most influence on the value of a home?

Let’s suppose we’re about to buy a new house and the real estate agent is comparing some houses of the same neighborhood. What are the main components of homes that we should keep an eye on because of their large impact on the worth of a home?

To get this answer we will start with some practical steps:

  • Correlation matrix (heatmap)
  • SalePrice correlation matrix (zoomed heatmap)
  • Scatter plots between the most correlated variables

Correlation matrix (heatmap)

This is a great way to get a quick overview of our data and its relationships.

At first sight, there are two red colored squares that get my attention. The first one refers to the ‘TotalBsmtSF’ and ‘1stFlrSF’ variables, and the second one refers to the ‘GarageX’ variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.

Another thing that got my attention was the ‘SalePrice’ correlations. We can see ‘TotalSF’, ‘GrLivArea’, ‘TotalBsmtSF’, and ‘OverallQual’ with significant relationships with ‘SalePrice’, but we can also see many other variables that should be taken into account. That’s what we will do next.

SalePrice correlation matrix (zoomed heatmap)

According to the heatmap, these are the variables most correlated with ‘SalePrice’. My thoughts on this:

  • ‘OverallQual’, ‘ TotalSF ’, ‘GrLivArea’ and ‘TotalBsmtSF’ are strongly correlated with ‘SalePrice’
  • ‘TotalSF’ and ‘GrLivArea’ seems to has multicollinearity. We will keep ‘OverallQual’ since its correlation with ‘SalePrice’ is higher
  • ‘GarageCars’ and ‘GarageArea’ are also some of the most strongly correlated variables. However, the number of cars that fit into the garage is a consequence of the garage area. Therefore, we just need one of these variables in our analysis (we can keep ‘GarageCars’ since its correlation with ‘SalePrice’ is higher)
  • ‘TotalBsmtSF’ and ‘1stFlrSF’ also seems to has multicollinearity. We will keep ‘TotalBsmtSF’

Let’s proceed to the scatter plots.

Scatter plots between the most correlated variables

Although we already know some of the main figures, this mega scatter plot gives us a reasonable idea about variables relationships.

In this figure, we can see that there’s an exponential relationship between ‘OverallQual’ and ‘SalePrice’, it totally makes sense that the homes prices increases as the overall quality of the house get greater, but it’s a surprise to me that this relationship gets exponential! So, open your eyes when you’re comparing similar houses, but differents overall quality.

The plot concerning ‘SalePrice’ and ‘YearBuilt’ can also make us think. In the bottom of the ‘dots cloud’, we see what almost appears to be a shy exponential function. We can also see this same tendency in the upper limit of the ‘dots cloud’. Also, notice how the set of dots regarding the last years tend to stay above this limit (the prices are increasing faster now).

So, what are the main components of homes that we should keep an eye on?

  1. You should first take a look at the Overall Quality of the materials and finish of the house, this is the top one factor which contributes to the Sales Price variation
  2. As a common sense, the living area or total square footage of the house is also a main component that has a big influence on the house price
  3. How many cars can fit in the garage and the total number of bathrooms should also be watched

This is the main factors that you should keep your eyes on when buying or selling a house.

Part II: Can we have a trustworthy estimate of the Homes Price?

Now we already know what are the main factors that impact the value of a home, but we haven’t yet solved our main problem: correctly estimate the value of our property
Is it possible to build a reliable machine learning model to do this for us?

Utilizing machine learning techniques we selected some models and ensemble then to get a better result, I think isn’t very interesting show here a lot of code, so if you want to see it in its entirety, you could check it on my GitHub page.

All of the models individually achieved scores between 0.11 and 0.14, but when the predictions of those models are blended, they got about 0.08! That’s because those models are actually overfitting to certain degree. They’re very good at predicting a subset of houses, and they fail at predicting the rest of the dataset. When their predictions are blended, those models complement each other.

The ensemble model made us achieve a TOP 27% position at the Kaggle Competition, with an overall score of 0.12024 !

If you live in Ames, Iowa, you can easily use this model to estimate the selling price of your home. But as this is not the case for most people, this brings us to the next part of this article.

Part III: How could we estimate almost any houses price and transform this study into a product?

Everything we’ve done so far is based on data collected from Ames, Iowa. If we want to generalize this study to almost every house in the world we need first to collect the houses data from another places and compare if the score of this model has significant changes (this means that our model isn’t overfitted to this dataset and it got on its essence “how to estimate a houses price”, otherwise of “How to estimate the Ames houses price”).

If our model gets a good result, we need to collect the same attributes of every local we’re committing to estimate.

We can create a site where the user input their ZIP Code and the main attributes of the house as Overall Quality of the materials and finish of the house, total square footage of the house, how many cars can fit in the garage and the total number of bathrooms and he’ll get an estimate.

This isn’t the main objective of this study, but I’ll let this task for you.

I hope you learned something from this study, in case of any doubt you can call me on any social media if you want

--

--