Predicting Home Prices in Ames, IA (Using Machine Learning to Predict Home Prices)
Project Overview
The objective of this project is to create a regression model based on the Ames Housing Dataset. This model will predict the price of a house on sale in Ames, IA.
Ames is a city in central Iowa approximately 30 miles (48 km) north of Des Moines. It is best known as the home of Iowa State University (ISU), with leading Agriculture, Design, Engineering, and Veterinary Medicine colleges. In 2017, Ames had a population of 66,498.
Dataset
The dataset given has 79 columns of features many of which overlap in describing the 2051 properties. I have combined and or eliminated a number of features to minimize overfitting and collinearity in the model. For example, 6 columns describe the home garage alone. The features include the size and quality of different interior and exterior features of the home as well as detailed information regarding the foundation, garage, basement, roof, siding, age, and maintenance. Once the dataset was downloaded in Python, I was able to evaluate it in greater detail.
Data Cleaning
The dataset required a significant amount of cleaning as twenty-six features had NaN data. This data was removed from all features used in the model created. Some features had in excess of 100 rows with missing data. Using these features would have required removing the same number of rows from all other features. Given the high number of features in the dataset, it was not necessary to use those features with an excessively high number of NaN values.
Categories having object data had to be converted to numeric values for each subset of the category. For example, 6 new sub-categorical columns were created with numerical values for the 6 different types of roofing material used on the homes.
Features like ‘Age of home’ can be misleading if complimentary features like ‘Year of remodeling’ has not been added as well to account for the replacement of depreciated materials. Both of these features were included in the model created.
Feature Engineering
In order to reduce the number of features and make more general property features, I created a handful of new features.
Features with overlapping data like ½ baths and whole baths were combined into a single column. Features with non-numerical data deemed significant were broken into multiple sub-categories and given numeric values. The next step taken in considering the relative value of feature was to sort them by correlation to the price. The below graphic gives the top seven features that correlate with the pricing variable:
Exploratory Data Analysis
As shown above, the ‘Overall Quality’ feature has the greatest correlation with the price of houses. Numerous other features address the quality of materials of specific interior and exterior areas (kitchen, garage, roofing material, exterior veneer, and the basement) of the homes. Given this fact, only a few specific qualities features were added such as the roof type and kitchen quality as both of these features have a relatively high affect on housing resale values. As an adjuster (and I’ve purchased two homes) I know that kitchen quality is highly related to resale value. Also, roof materials have huge variance in price — for example a tile roof costs 5 times the amount of a typical composition shingle roof.
In analyzing the data, it quickly became clear that collinearity would likely be an issue. Five of the top seven correlating features are related to Square Feet. The third and fourth ranked features ‘Garage Cars’ and ‘Garage SF’ are almost synonymous as the more cars a garage holds the higher the square feet will be.
One of the solutions to collinearity implemented was to consolidate similar features when possible like number of baths and number of half baths.
As mentioned above, age is an important variable but remodeling should be considered as the condition and value of the homes vary depending on the amount of upkeep they have had.
Model Selection
Given my background in adjusting, I decided to utilize my knowledge of construction (and home resale value) and manually build the model. I decided to develop an initial model and add different features until I reached a satisfactory cross-val or R Squared score. After the initial model, I made two adjustments. All three models were evaluated using the linear regression and Ridge models. Note that lasso I (commonly used to aggressively remove features to reduce overfitting) was not used as I was manually building my models.
After a preliminary data analysis evaluating feature correlations with price and judgements made to avoid extreme collinearity, the following features were chosen for the initial model: Overall Quality, Total Rooms Above ground, Garage Cars, Years Since Built, Total Baths, Total Basement Square Feet, Year Remodeled/Addition, Open Porch Square Feet, and Finished basement Square Feet.
Because kitchen resale value is one of the most significant factors in home resale value, I decided to add this feature after creating dummy variables for the categorical item ‘Kitchen Quality’. Once this feature was added and evaluated, the same was done for the ‘Roof Quality’ feature.
Modeling and Evaluation
A linear regression model was written using the initial features chosen and then the data was fit and ran which returned a cross-val score of 0.78. The data was then ran in a Ridge model which returned an R-Squared value of 0.76 at alpha = 1.
Next, I created a dummy variables for categorical item ‘Kitchen Quality’. The data was fit and ran in a Linear Regression model which returned a cross-val score of 0.77. The Ridge model was used for these features which returned an R-Squared value of 0.84 at alpha = 1. Because the ridge score was higher than that of the other two models, I’ve included the graph below comparing the predicted and true values. The model under-predicts values of higher end homes.
The final model utilized the ‘Roof Material Quality’ feature after I converted the object data into multiple dummy variables with numeric values. The Linear Regression model cross-val Score was 0.82 — which is the highest of the three models. Note that this model also under predicts prices of higher end homes. These features returned an R-Squared value of 0.76 when ran with a Ridge model. A High percentage of homes have composition shingle roofs. This may explain the lowered score. Also, while differing roofs can be much more expensive than others, people do not usually focus as much on roofing as kitchen quality when purchasing a home. This model returned the highest score when plugged into the Kaggle competition. The below chart compares the predicted and actual values when the roofing variables have been added as features.
Conclusions
In summary, the model created predicts values of Homes with surprising accuracy given that features were manually added based on correlation returns and past experience with housing materials of varying qualities. This model could be used to assist home builders or real estate agents in determining appropriate sale and purchase prices of homes in the Ames, IA area.
In the future, I’d like to take advantage of polynomial features, SKB, and RFE to generate even more accurate models. I would also like to optimize algorithm hyperparameters using GridSearch.