Lessons from building models to predict housing prices.

Zach Green
Sep 4, 2018 · 3 min read

In one of my recent projects, I explored the process of building a linear regression model to predict housing prices in Ames, Iowa. The approach I took for building this model was to try to be as exhaustive as possible by testing all possible combinations of features contained in the data and their interaction terms.

One of the pros of this approach is that it limits any potential of bringing in my own bias and it lets the data say which predictors would be important for the model. This approach is rigorous and it tests all possible models so that no stone is left unturned.

One of the cons of this approach is that it can become very computationally expensive. As the number of features increase, the number of possible models increases exponentially. At a certain point, this amount of computation becomes impossible.

How can the phenomena of “regression to the mean” affect the results?

If you are able to test all possible models and pick out the models that score the highest, you are bound to see some regression to the mean when you are making predictions in the real world. This is because a portion of the “best” models will happen to fit the test set better than they normally would. Then, when deploying the model and making predictions, they are likely to regress to what their true performance should be. Lets say we even run a 95% hypothesis test on each model we try. If you assume for the sake of argument that there is no model that can predict prices better than chance (that the null model is the true model). Then out of the millions of possible models tested, you would still expect to see thousands of models that score significantly higher than chance, simply by chance. Now, assuming the null model is not the “true” model, and that there is some relationship between the predictors and the target variable, you would still expect to see some combinations of features happen to score significantly better that they should just due to chance. Thus, when you pick the best model out of all possible models, you are likely to see its performance decrease (hopefully only slightly) when generalizing to new data due to the “regression to the mean” phenomena.

One of the feature selection methods I used in this project was the recursive feature elimination function. This was useful for generating lists of important features to investigate further, but in this context it must be used with caution. Many of the models generated with this method were not mathematically valid because they contained only certain interaction terms between the predictors but did not include the main effects by including the “parent” predictors in the model.

Some things that were helpful for me in this project:

When starting this project, I wrote down an outline of how I wanted to approach this problem and the steps I would need to take in the code notebook as comments. I then filled in the spaces between the comments as I was working on the project. This helped me wrap my mind around everything that needed to be done for this project and it broke down the problem into smaller sub steps.

After doing this project I began to keep more self-made “cheat” sheets with useful blocks of code in my notes that I may need to use again one day. These are really handy to have tucked away so that I can just put them into new projects and adapt the code to whatever new problem I may be facing.