Adventures in Predictor Selection
The art and the science of figuring out which variables best predict home value.
By Alec Morgan & Kayli Leung
Data science is weird. As a data scientist you are the intermediary between an intractably massive pile of data on one side and a team of business decision makers on the other. Your allies in this task are AI algorithms: narrow in their comprehension of the world, but superhumanly fast. The current state of AI is really as tools, and like any tool the usefulness of modern AI really depends on the tact with which you use it.
Imagine that, just as my project partner Kayli Leung and I did, you want to predict the prices of homes in a certain area such as King County. You’re coming armed with a veritable wealth of data — over 20,000 records of real home sales in the county from 2014 to 2015 — and each record includes 21 fields of information including price, square feet, # of bedrooms, # of bathrooms, condition, and over a dozen more.
Were we only predicting with one variable, it would be elementary to implement a simple OLS (Ordinary Least Squares) linear regression algorithm with which to predict price. However, we can create far more accurate predictions by analyzing the correlations of an entire set of variables with our target variable (which is price), which leaves only the question of which ones will give us the most accurate model. This approach is called multivariable linear regression.
Well, clearly all of these variables will have at least some relationship with the price of the house, so why not just use all of them? The answer is two-fold: overfitting and collinearity.
Overfitting is essentially what happens when your AI has a sheltered upbringing: by training on a small set of data, not unlike gleaning its entire worldview from a tiny slice of culture, it develops a very narrow and naive idea of what the world is like. In more practical terms, if you train your model on a lot of highly specific data, it will make highly specific (and highly accurate) predictions about that data. But it won’t generalize well. Throw it something it hasn’t seen before, and it will naively assume it will be like what it has already seen. We have to be very careful about this.
Collinearity is essentially a more specific subtype of overfitting. The ‘linearity’ part refers to your variables correlating with each other, and the ‘co’ part means that your predictors are correlating with each other in addition to what you’re actually trying to predict. This indicates that your variables are probably counting the same fact about your data multiple times.
For example, price tends to correlate strongly with # of bedrooms and square footage, but # of bedrooms and square footage also tend to correlate strongly with each other. By including # of bedrooms and square footage in the same model then, you’re more-or-less counting the size of the house twice. This artificially skews the numbers and overemphasizes the size of the house in your predictions. Note that these predictions will actually be accurate — at least, with the data your model was trained on. The issue is simply that your model will not generalize well to new data wherein your collinear variables have proportionally different relationships with each other.
Clearly then we need to apply more tact when choosing our predictors, but how? We can start with this.
This heatmap shows the correlations, both direct and inverse, between all our different variables. We want as little correlation between our predictors as possible, i.e. to keep only the lightest colored cells. So for starters lets just remove the worst culprits of collinearity, especially if they don’t even correlate well with price in the first place.
Much better. You’ll notice however that we still have some clusters of red, particularly in the intersection of price, bedrooms, bathrooms, and sqft_living. These are unfortunately also our best predictors of price. We removed many variables with weaker price predicting capability and stronger collinearity, but we need at least some strong predictors of price, so this is as far as we dare go.
As we now move into training and testing our first models, we’ll be making two noteworthy tweaks to our data that will increase the accuracy of our predictions significantly. The first is to subset our data to the bottom 90th percentile of house prices.
The larger percentiles include many outliers of price which would inevitably skew our model’s predictions, so we make a judgement call here: better to predict most home prices with better accuracy than all home prices with skewed accuracy.
The second change is to predict for the logarithm of price instead of price itself. The reason for this comes down to our algorithm of choice: linear regression. Linear regression assumes that your data has a fairly normal distribution, i.e. a central cluster that you can run a straight line through in order to get fairly close to every single point. Price does not have a fairly normal distribution.
But the logarithm of price does.
Back to the question at hand: how to choose our predictors? One strategy is to build simple linear regression models with each predictor individually and then pick the top n performers. We initially built our own functions from scratch to automatically build and test simple models for us.
However, we then discovered that sklearn contains an entire feature selection module that already does this even better — ah, the joys of Python!
Simply pass in a linear regression object and declare that you want to use n features to make your predictions, and voila: RFE tells you exactly which ones to use. This is no arbitrary task, since the collinearity issue means that adding a new predictor can reduce the usefulness of the previous ones. For now we can just be grateful that this particular trail has already been blazed by those brave souls that were optimizing their linear regression models before us.
Our final model’s accuracy looks like this.
By plotting the quantiles of our model’s error values at different price points against a standard distribution, we’re able to get a direct visual of what is in reality a 9-dimensional model (because we used 9 predictors to get these results). At most points it’s remarkably accurate, although at the low end and high end of real estate prices it trails off somewhat for some reason.
We had one week to build this project. Given more time, how could it be improved further? We have several ideas.
Zipcodes have some correlation with price, but higher zipcode numbers don’t necessarily correlate with higher price. By making dummies of zipcodes we can cause a modelable relationship to emerge, giving us one more useful predictor. Similarly, we used latitude and longitude to artificially engineer a distance from downtown Seattle feature. However, there are several employers and in the greater Seattle area that drive significantly higher real estate prices in their respective locales, and so by engineering features for distance from each of these accuracy might be improved significantly. We might also seek out more data to train our model on, or transform the sale dates into a more usable format so that we might use them to model the seasonality of house sales. All this and more would be possible with just slightly more time.
In conclusion, predictor selection is somewhere in between an art and a science. Figuring out how to balance bias and variance is partially a matter of using established tools and methods, but also simply of using good judgement and intuition. I hope that while reading this you’ve learned something new and gained some useful insight into the feature selection process!