Creating a Predictive Model for Home Prices that Anyone Can Use
My team and I was given the task to create a predictive model for the sales prices of homes in the Seattle, Washington area, more specifically, King County. I personally wanted to take a deeper dive and conduct an analysis on how home condition, the year the home was built, and zip code affected home prices. I conducted a multiple linear regression model and created a predictive model of home prices specifically geared around those three features. I know there are so many ways you can construct a predictive model, some more complexed and complicated than others, but I found this method to be pretty simple and it also gives you the ability to create a user friendly interface that anyone can use simply by plugging in different feature values.
Let us first talk about exactly what Linear Regression is. Linear Regression is the modeling technique used to estimate the strength and direction of the relationship between two (or more) variables. You will use a dependent variable, also called the target variable, and independent variables, which are also called features or predictors. Linear Regression uses only one independent and one dependent variable to determine the correlation between the two and Multiple Linear Regression uses multiple independent variables and one dependent variable. Regression is a parametric technique, which means that it uses parameters learned from data and it is also considered the beginning concept of machine learning. Some assumptions that must be made when conducting a linear regression model is that the data must possess:
- Linearity
- Multicollinearity
- Normality
- Homoscedasticity
Once all the assumptions are made and satisfied, then you can move on to your predictive modeling.
I wanted to give you a short intro to linear regression before we moved on to the actual modeling. Next I wanted to also briefly show you some visualizations created around this data set as well.
After I have cleaned my data set and now ready for analysis, I created a few visuals to see how condition, the year the home was built, and zip code affected home price before I moved on to my regression modeling.
I created a really cool geomap of the Seattle, Washington area comparing price and their home condition rating using Plotly and Mapbox! I was very excited about it and it is a little tricky the first time you create one due to its learning curve of creating and importing a mapbox token but after you’ve made one, it’s actually pretty easy. I am going to make another post on exactly how to make this geomap because I could not find a clear step by step anywhere and it seriously took me almost a week to figure it out!
Next I created a scatter plot to see how year built, condition, and price correlates with each other. This visual shows us that homes between 1900 and 1980 have the highest condition ratings and they are also holding their values very well. I wanted to know why so I did a little outside research on Seattle home prices and there is a shortage of homes in King County. The homeowners are keeping their properties up due to the fact that the county is growing at such a rapid rate. Seattle is the 5th most expensive city to live in the United States and with a home shortage of 1 home to 1060 people, the home prices are surging.
I also wanted to know if there was a specific zip code that had the highest rated condition homes compared to price. The two zip codes with the most 5 rated condition homes was in 98125 and 98101. These areas are located in downtown Seattle in waterfront properties but more specifically, Pikes Place Market.
After I completed my visualizations, I also conducted some modeling of my data. I wanted to make this post about the predictive model specifically but again if you are interesting in that process, please check my full notebook out on my github.
Once my modeling was completed, I then started creating my predictive model for home prices. My first step was to solve for the linear regression model and use reg.fit to train my model. The traditional equation for the linear regression model is:
𝑦̂ =𝛽̂ 0+𝛽̂ 1𝑥1+𝛽̂ 2𝑥2+…+𝛽̂ 𝑛𝑥𝑛
But I decided to write the equation a little more simplistic to use for my model by using the following:
I next plugged in the variables for my model’s specific equation for home prices and the features condition, year built, and zip code written as:
price= m1 * condition + m2 * zipcode + m3 * yr_built + b
This will be the blueprint for my model and now it’s time to add this into python.
import numpy as np
import pandas as pd
from sklearn import linear_modelreg=linear_model.LinearRegression()
reg.fit(housing_df_out[['condition', 'zipcode', 'yr_built']],housing_df_out.price)
Once we solve for our Linear Regression line, we next find the coefficients for the m1, m2, m3 values.
reg.coef_
This next step will solve for the intercept value that we can plug in for b.
reg.intercept_
Once you solve for all variables in the equation, you can then plug in values for condition, zip code, and year built to get a predicted home price. You now have the freedom to switch out features from your data set as you please and you can use whatever values you would like to solve for your predictive price. Remember, whatever order you added your features into your reg.fit model is how you need to order your values for them as well. So, for my model, I listed my features in the order of [‘condition’, ‘zipcode’, ‘yr_built’]. That means I also have to plug in my values in the same order for the model to work correctly. I have two examples below:
This predictor solved for a home price of $436,008 with a 3 rated condition, 98003 zip code, and built in 1995.
reg.predict([[3,98003,1995]])
This predictor solved for a home price of $498,653 with a 5 rated condition, 98101 zip code, and built in 1986.
reg.predict([[5,98101,1986]])
I think this is very cool and once setup correctly, you can plug in different values to solve for your predictor. This is so user friendly as well for non-technical users. At this point. they are plugging in numbers and getting an outcome. No need for complicated code for it to work.
I had a lot of fun creating this predictive model! I love to create things as simple and efficient as possible and this definitely was right up my alley. Try this out and let me know if it worked for you!