[Week 3-House Price Prediction]
Team Members: Harun Özbay, Halis Taha Şahin, Cihat Duman
In this week, we have reviewed some machine learning methods and implemented them on the training data.
As mentioned in the previous post, we have reviewed the training data and preprocessed it to make it ready for applying the several regression methods such as linear regression,random forests and boosting.
Firstly, we considered all 80 features even filled with None values for all rows.Here are the root-mean-squares errors between the logarithm of the predicted value and the logarithm of the real value:
Linear Regression: 0.171
Random Forest: 0.134
Gradient Boosting: 0.134
Then we have omitted the features shown below which have negative correlations with the sale prices to observe the effect:
MiscVal -0.020021
OverallCond -0.036868
YrSold -0.037263
LowQualFinSF -0.037963
MSSubClass -0.073959
KitchenAbvGr -0.147548
EnclosedPorch -0.149050
And here are the results :
Linear Regression: 0.174
Random Forest: 0.135
Gradient Boosting: 0.135
After that, we came up with the idea of filling some of the numerical features(LotFrontage) -which we filled with zeros before- with the median of none-null values of the other samples with respect to some other non-numerical features(Neighbourhood). The RMS results are below:
LotFrontage-Neighbourhood:
Linear Regression: 0.171
Random Forest: 0.136
Gradient Boosting: 0.136
The results after one more relation between a numerical and a non-numerical feature is considered:
Lotfrontage-Neighbourhood + GarageYearBuilt-Neighbourhood:
Linear Regression: 0.172
Random Forest: 0.134
Gradient Boosting: 0.134
We are planning on making better predictions by finding more robust relations between features. We will also add more regression methods such as support vector machines and KNN.