Predicting King County House Prices with Multiple Linear Regression

Published in

Analytics Vidhya

8 min readAug 29, 2020

Introduction

The King County House Dataset contains a wealth of information about the price, size, location, condition and various other features of houses in Washington’s King County. In this article, I’ll present how I built a multiple linear regression model in Python to predict house prices.

Here is a complete list of the modules I used in this analysis. Many, but not all, are included in the code embedded below. You may find that some of the variable names are ambiguous, so here is a brief description of each:

To get started, I read the data into a pandas dataframe and ran df.info() to learn about the shape, columns and datatypes.

We can see that it’s a large dataset, containing more than 21 thousand entries and 20 columns. Almost all of the columns contain numeric data, which is convenient for linear regression!

Missing Values

Next, I looked at the proportion of missing values in each column.

Next, I took a look at the value counts for each of the 3 variables with missing values. I decided to drop ‘yr_renovated’ due to its high proportion of missing values, fill the ‘waterfront’ NAs with 0 (which matched the majority of other values in the column, and indicates ‘no waterfront’), and fill the small number of missing ‘view’ values with the column’s mean.

Outliers

To exclude buildings that likely aren’t single-family homes, I checked for outliers in the bedrooms and bathrooms columns using the following code format:

The outlier count was 187 for bathrooms, and 47 (excluding overlaps) for bedrooms. The above code treats any value that’s more than 3 standard deviations away from the mean as an outlier, and drops all the rows containing outliers.

One-Hot Encoding

There’s one variable in the dataset that we would expect to be highly related with price, but which doesn’t have a clear linear relationship: condition. The value descriptions for this column state that it is coded on a 1–5 linear scale, with 1 being ‘poor’ and 5 being ‘very good.’

Interestingly, the below scatterplot shows that ‘average’ houses tend to perform the best in terms of price:

This may be due to the fact that the condition values are relative to age and grade (which refers to the design/quality of construction rather than the utility/functionality of the building). A new, excellently designed, very expensive building could be given an ‘average’ condition rating if some functional repairs are needed. As we can see in the scatterplot below, average condition houses also tend to have the highest grade rating.

There might be a clearer linear relationship between price and specific condition values, which we can explore more effectively by one-hot encoding the variable. The below code creates a dummy variable for each condition value, drops the first value to avoid the dummy variable trap, drops the original column, and joins the new variables to the data frame.

Now we can visualize the relationship between each condition rating and house price on scatterplots. A value of 1 indicates that the house has the condition rating that corresponds to each column, while a value of 0 means that it does not:

Now there are linear relationships for each column, with condition ratings 2 and three having the most pronounced relationships.

Multiple Linear Regression Assumptions

There are four assumptions that must be checked as part of the multiple linear regression analysis process:

No multicollinearity
Linear relationship between explanatory and response variables
Homoscedasticity of error terms
Normal distribution of model residuals

We’ll dig deeper into each of these assumptions as we check them.

Multicollinearity

Using seaborn, I created a heatmap of correlations between each variable and all the others. The absolute value of the correlations was calculated because only the strength, not the direction, of a linear relationship matters for satisfying this assumption.

As we might expect, many of the variables related to the size of homes (e.g. sqft, number of rooms) are highly correlated with each other. Generally, we don’t want to include any two x variables whose correlation exceeds .80 in the same model. We do, however, want to include variables that are correlated with the y variable. Below, I create and display two small data frames that narrow down the correlations we are most interested in.

Although sqft_living and sqft_above are both highly correlated with price, only one of them can be included in a multiple regression model because they are also highly correlated with each other. Fortunately, the other two variables whose correlation violates this assumption (condition ratings 3 and 4) have a relatively weak correlation with price.

Linear Relationships

Scatterplots are a simple way to check if the relationships between explanatory and response variables are linear. After creating objects for the variables that have the strongest correlation with price, I made the following scatterplots.

The ‘view’, ‘floors’, and ‘bedrooms’ variables do not have a clear linear relationship with house price. Sqft_living has a stronger linear relationship with price than sqft_above, so it will be used in the multiple regression model. Due to its multicollinearity with sqft above, sqft above will be excluded.

Building a Multiple Regression Model

The next two assumptions — normality and homoscedasticity — require us to first create a regression model because they refer to a model’s residuals rather than its features. In linear regression, residuals are the vertical distances between actual values and those that are estimated by the regression line. Before creating a multiple regression model, I completed a simple linear regression analysis for bathrooms, grade, sqft_living, and sqft_living15. During this process, I discovered that the the residuals for the sqft variables didn’t fully satisfy normality and homoscedasticity. To address this, I used np.log() to update price and the sqft values to their natural log.

For the multiple regression model, I included the four variables that had the highest correlations with price without violating the multicollinearity assumption:

The r-squared value, 0.535, indicates that the model can account for about 53% of the variability of price around its mean. The null hypothesis for multiple regression is that there is no relationship between the chosen explanatory variables and the response variable. All of the p-values round to 0, which means we can reject the null hypothesis. Now we can confirm that the model satisfies the assumptions of normality and homoscedasticity.

Normality of Residuals

Quantile-quantile plots are one way to check the normality assumption. If the residuals are normally distributed, their points will mostly fall along a straight line. Below is a Q-Q plot of the model’s residuals.

Since almost all of the data points fall along a straight line in this QQ-plot, we can consider the normality assumption satisfied.

Homoscedasticity

The homoscedasticity assumption states that for any value of x, the variance of the residuals is roughly the same. To visualize this, I made a scatterplot with the model’s residuals on the y axis and fitted values on the x-axis. For the homoscedasticity assumption to be satisfied, the shape of the points should be roughly symmetrical across a line at y=0.

We can see that the points have a roughly symmetrical blob-like shape that is consistent across the x-axis. The model satisfies — last, but not least — the assumption of homoscedasticity.

Model Validation

The final step in evaluating the quality of the model is cross-validation, which gives us an idea of how the model would perform with new data for the same variables. I used sklearn’s train_test_split function to split the data into two subsets: one that the model will be trained on, and another that it will be tested on. By default, the function takes 75% of the data as the training subset and the other 25% as its test subset.

Mean Squared Error

The code below creates train and test data for the x and y variables, use the x subsets to predict new y values, and then calculate the distance between these and the actual y-values. Finally, we use the mean_squared_error function to calculate the MSE for both subsets.

The MSEs for the train and test subsets are similar, which suggests that the model will perform similarly on different data.

Conclusions

Together, square footage, grade, number of bathrooms, and the size of neighbors’ homes function as the best predictors of a house’s price in King County. The model does have some limitations: given that some of the variables needed to be log-transformed to satisfy regression assumptions, any new data used with the model would have to undergo similar preprocessing. Additionally, given regional differences in housing prices, the model’s applicability to data from other counties may be limited.

So, if you’re looking for housing that won’t break the bank, it may be wise to skimp on the square footage and share a bathroom with three other people. But aren’t most of us urban dwellers already doing that?

Full code available on GitHub