House Prices Prediction With Regression
Overview
A home is probably the most expensive purchase that a person will do in his or her lifetime. The most important constraint while buying a home is the value of the property. So, a good model that can predict the value of the property will help users to shortlist and buy their dream property.
Also, with increasing population and urbanization this model can help various stakeholders involved in the planning, construction and expansion of towns and cities.
Problem Statement
Here I try to build a model for Ames Housing dataset to predict the hosue price using Regression Techniques. You can find the code for the implementation here.
We will follow the following steps to build our model
1. Exploratory data analysis of the data
2. Data cleaning — Missing data
3. Categorical Features — Encoding and Dummies
4. Numerical Features — Normality, Skew and Kurtosis
5. Baseline models pipelines
6. Grid Search for best parameters for top three baseline models
7. Stacked models combining top three models
Metrics
I evaluate the models on the basis of Root-Mean-Squared-Error between the logarithm of the predicted value and the logarithm of the observed sales price (RMSLE). Taking logs will ensure that errors in predicting expensive houses and cheap houses will affect the result equally.
This measurement is useful when there is a wide range in the target variable, and you do not necessarily want to penalize large errors when the predicted and target values are themselves high. It is also effective when we care about percentage errors rather than the absolute value of errors.
Data Exploration and Data Visualization
Target Feature
SalePrice is the variable we need to predict. Lets see its distribution
We can see that the distribution
- Deviates from the normal distribution.
- Has appreciable positive skewness.
- Shows peakedness.
As the distribution is skewed we have to transform it such that it follows a normal distribution.
We can use different transformations to achieve this such as
1. Log Transformation
2. Box Cox Transformation— Requires estimation of lambda parameter
3. Square Root Transformation
Here we go with the Log Transformations to transform and normalize skewed features
We see that the log transformation has pretty much normalized the distribution
Missing Data
Numerical Features
Below are the skewed numerical features with absolute skew greater than 0.75. We will apply log transformation to all these features to normalize them.
['MSSubClass', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtHalfBath', 'KitchenAbvGr', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']
Categorical Features
Following are the categorical features from our data set.
['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageQual', 'GarageCond', 'PavedDrive', '3SsnPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
We use pandas get_dummies to one hot encode the categorical features. There are total of 554 features including the Id and SalesPrice features after one hot encoding the features. We will drop the Id feature as it is on no use to us.
Model Implementation
Lets build baseline model pipelines for the following regressors
and see their performance:
- Lasso
- ElasticNet
- LassoLarsIC
- KernelRidge
- GradientBoostingRegressor
- XGBRegressor
- LGBMRegressor
We use MinMaxScaler as our feature scaler in the pipelines. As stated at the beginning, our metric will be Root Mean Square Log Error. We use KFold Cross Validation with 5 folds to validate our models. The RMSLE scores for the baseline models are shown below
GBoost 0.133
XGB 0.135
LGB 0.138
KRR 0.153
Random_Forest 0.153
LaLasso 0.155
Lasso 0.399
Enet 0.399
Refinement
We refine our models using GridSearchCV for the top three baseline models — GBoost, XGB and LGB. The RMSLE scores for the baseline models after using tuned parameters from grid search are shown below
GBoost 0.132
XGB 0.130
LGB 0.134
We can see that there is quite a decrease in the RMSLE. Please note that error metric is RMSLE and hence the reduction in error of real price will be an exponential factor of this error reduction.
Final Model Evaluation and Validation
We build our final model by stacking the top three tuned baseline models and taking their average prediction as final value.
We obtain the RMSLE score of 0.127 for our stacked model.
Results and Justification
With the stacked model we achieved an even better metric score. Please note that error metric is RMSLE and hence the reduction in error of real price will be an exponential factor of this error reduction.
With the above models we have achieved the following goals:
- Successfully analyzed, cleaned and scaled our dataset.
- Build Baseline models to investigate our metric of concern, i.e, Root Mean Square Log Error (RMSLE)
- Using gridsearch and model stacking for new model (RMSLE 0.127), achieved error reduction of 0.006 compared to the best baseline model (GBoost RMSLE 0.133)
Reflections and Improvement
Our model gives us a pretty good score based on our metrics.
However, there is more scope for fine tuning and improvement of the model:
- We have used Log Transformation to normalize skewed features in our dataset. Sometimes a Box Cox Transformation will work better for normalization. However, the Box Cox Transformation requires estimation its lambda values as input parameter. Better normalized ata will increase the scope for reduction of error and improvement of our RMSLE metric.
- We have done grid search on some important and basic parmeters. We found that using tuned paramters for our pipeline decreses our error. However, given more computing power, a better more exhaustive grid search can be done to further fine tune the model and reduce the error.
- We can use different stacking strategies, such as weighted stacking and add more tuned baseline models to our stack list to reduce our error
Finally, we saw that stacking and using combined models can significantly reduce our error and improve our model and final predictions.
Acknowledgements
A big shout out to all the awesome Kaggle Kernels out there for this data set, especially the one from Serigne.
Thank You. Happy Coding !!!