Multiple Regression and Recursive Feature Elimination (RFE)

Fernando Aguilar
7 min readApr 15, 2019

--

Introduction

The most difficult of most projects is arriving at a model with significant and efficient predictability. If you are using an OSEMiN model, the modeling stage is the culmination of hard work and long hours obtaining, scrubbing and exploring the data. During this first stages, it is common to have dealt with null values, outliers, transforming and scaling your data.

Hopefully, by the time you have gone over your EDA, you already have a vague idea of the features that would work best for your model. I have currently finished a project to build a multivariate model to predict house sales prices using the king county data set. After the initial EDA, and having your thoughts about the best features in mind, it would be best to test for multicollinearity.

Test for multicolinearity

It is no surprise that after looking into the data, one might think off home square footage, location, number of bed and baths as very important features in predicting a home sale value. However, what happens if this features are correlated amongst each other? Then we face multicollinearity. It make it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable. It is safe to assume a correlation value of 0.75 as high. I proceded to create a correlation heatmap to find correlations no only amongst independent variables but also between the target and each possible feature to be included in the final model

Create a correlation heat map with all possible features in the model
Correlation Heat map

I set the minimum and maximum values to -0.75 and 0.75, respectively, to be able to spot high correlation values more easily. From the heatmap, we can confirm that bathrooms, bedrooms, and house square footage are amongst the features with higher correlation to the target variable ‘price’. However, these are also very highly correlated amongst each other. This may result in coefficients from the partial regression not to be estimated precisely, and the standard errors are likely to be high. Therefore, based on our EDA and the heatmap, it is time to choose the best features for the model.

For my model I chose to keep the following variables: sqft_living, lat, yr_renovated, grade, and zipcode. They are a combination of home square footage, location, and the state of the house, which in this particular case coincide with the most sought after features for homebuyers.

Simple Linear Regression

After carefully choosing the independent variables for the model, it is important to test how good of a predictor each of the variables is as a single feature model. This will help to rule out any variable that fails to reject the null hypothesis by having a high p-value (higher than 0.05) or that it does very little at explaining the target variable (very low R² value).

I will first test each of the continuous variables, sqft_living and latitude. This variables have already been transformed (log and cube root transformation, respectively), and scaled (min-max scaling) during the EDA process. I ran a single feature regression model and printed out the R², intercept, slope and p-value. The results were:

OLS testing for each continuous random variable.
Selling price~sqft_living
------------------------------
['sqft_living', 0.4504851147267033, 0.08962350080510796, 0.6358764251985838, 0.0]
Selling price~lat
------------------------------
['lat', 0.21112253075290077, 0.2318977649531629, 0.23819279894216772, 0.0]

From the results it is noticeable that both continuous variables have relatively high R² values and slope, both also have a p-value of less than 0.05 hence they are significant predictors for the target variable ‘price’. Furthermore, since both variables have very low correlation amongst them, it is expected that including both of them in the model will increase the R² values without adding much noise to it. Hence, I will keep both variable moving forwards towards the multiple feature model.

I will continue to test each of the categorical variables: zipcode, yr_renovated, and grade. This variables have already been one hot coded during EDA and dummy variables were created. The results of the OLS are as follows:

Print OLS Results for each Categorical Variable
Model Results: price ~ yr_renovated
Model Results: price ~ zipcode
Model Results: price ~ grade

From the single feature OLS results above, yr_renovated had the lowest R_squared value. Hence, I will delete it from the model moving forward. Both, grade and zipcode, have high R² values and p-values of less than 0.05 overall. Although some individual dummy variables have high p-values, this may be an issue of dimensionality that I will deal with using RFE. To start, let’s build and test the first multivariate model including sqft_living, lat, zipcode, and grade (with all dummy variables).

Multiple Linear Regression

For this stage, I will run a multiple feature model using a train-test split with a train test of 25%. To test the fit of the model, I will print out its mean absolute error , compare the RMSEs amongst training and testing data, and compare mean predicted and actual selling prices.

Linear Regression with Train-Test Split
R_squared Score: 0.8509063821070757
Mean Absolute Error: 0.03194101659219547
Root Mean Squared Error test: 0.04379282741339869
Root Mean Squared Error train: 0.04442036205533277
Mean Predicted Selling Price: 0.3852218144758737
Mean Selling Price: 0.3855008038714294

This model seems to be a great fit. The R² score is very high at around 85%, also the difference between the RMSEs is very low and the Mean Prices are very close to each other. Although it seems like a very good model, it is best to test it further using cross-validation.

5-fold Cross-Validation
R_squared Mean Score: -6.44414675982732e+18
[-3.22207338e+19 8.47272164e-01 8.42818063e-01 8.52439254e-01
8.53224924e-01]

The result from the 5-fold cross-validation is terrible with a negative score. Therefore, either the intercept or the slope are constrained so that the line of best fit, fits worse than a horizontal line. Moreover, as noted before, there might be an issue with dimensionality and probably a lot of the dummy variables might not contain much data. Hence, I will create a new model using feature selection to remove features that are causing noise in the initial model.

Recursive Feature Elimination (RFE)

Linear Regression using RFE
R_squared Score: 0.8270503350015767
Mean Absolute Error: 0.03482593985773259
Root Mean Squared Error test: 0.04705841133932033
Root Mean Squared Error train: 0.04787779907667826
Mean Predicted Selling Price: 0.38887905753150637
Mean Selling Price: 0.38777279205303655

Although the R² score dropped to around 83%, is not a big change and it is noticeable that the differences amongst RMSEs have decreased as well as the differences between predicted and actual target values. Let’s also cross-validate this new model.

cross_validation(X_RFE,y)
R_squared Mean Score: 0.8244926988540975
[0.82332196 0.82025237 0.81959714 0.82832553 0.8309665 ]

The mean R² obtained form the 5-fold cross_validation is a positive 82%. Hence, this model does a much better job than the first one. I will now proceed to look for any features that have a p-value > 0.05, and delete them from the model.

OLS_reg(X_RFE)OLS Regression Results                            
==============================================================================
Dep. Variable: price R-squared: 0.827
Model: OLS Adj. R-squared: 0.827
Method: Least Squares F-statistic: 2530.
Date: Fri, 12 Apr 2019 Prob (F-statistic): 0.00
Time: 18:31:48 Log-Likelihood: 34429.
No. Observations: 21191 AIC: -6.878e+04
Df Residuals: 21150 BIC: -6.845e+04
Df Model: 40
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 0.1126 0.003 39.820 0.000 0.107 0.118
sqft_living 0.4063 0.004 99.094 0.000 0.398 0.414
lat 0.1544 0.002 90.932 0.000 0.151 0.158
zipcode_98004 0.1599 0.003 58.060 0.000 0.155 0.165
zipcode_98005 0.0745 0.004 19.923 0.000 0.067 0.082
zipcode_98006 0.0723 0.002 32.472 0.000 0.068 0.077
zipcode_98008 0.0627 0.003 21.771 0.000 0.057 0.068
zipcode_98010 0.0390 0.005 7.384 0.000 0.029 0.049
zipcode_98019 -0.0515 0.004 -13.597 0.000 -0.059 -0.044
zipcode_98027 0.0474 0.002 19.153 0.000 0.043 0.052
zipcode_98032 -0.0335 0.004 -7.767 0.000 -0.042 -0.025
zipcode_98033 0.0686 0.002 28.782 0.000 0.064 0.073
zipcode_98039 0.1963 0.007 28.737 0.000 0.183 0.210
zipcode_98040 0.1329 0.003 45.566 0.000 0.127 0.139
zipcode_98052 0.0303 0.002 14.467 0.000 0.026 0.034
zipcode_98070 0.0778 0.005 14.681 0.000 0.067 0.088
zipcode_98074 0.0277 0.002 11.688 0.000 0.023 0.032
zipcode_98075 0.0479 0.003 18.317 0.000 0.043 0.053
zipcode_98102 0.1059 0.005 22.415 0.000 0.097 0.115
zipcode_98103 0.0710 0.002 34.689 0.000 0.067 0.075
zipcode_98105 0.1073 0.003 33.458 0.000 0.101 0.114
zipcode_98107 0.0779 0.003 26.008 0.000 0.072 0.084
zipcode_98109 0.1205 0.005 26.163 0.000 0.111 0.130
zipcode_98112 0.1311 0.003 44.108 0.000 0.125 0.137
zipcode_98115 0.0704 0.002 33.930 0.000 0.066 0.074
zipcode_98116 0.0892 0.003 33.417 0.000 0.084 0.094
zipcode_98117 0.0676 0.002 31.771 0.000 0.063 0.072
zipcode_98119 0.1155 0.004 32.381 0.000 0.109 0.123
zipcode_98122 0.0834 0.003 29.198 0.000 0.078 0.089
zipcode_98136 0.0847 0.003 28.433 0.000 0.079 0.091
zipcode_98144 0.0644 0.003 24.582 0.000 0.059 0.070
zipcode_98168 -0.0452 0.003 -15.199 0.000 -0.051 -0.039
zipcode_98199 0.0908 0.003 33.121 0.000 0.085 0.096
grade_4 -0.0759 0.009 -8.076 0.000 -0.094 -0.057
grade_5 -0.0842 0.004 -23.836 0.000 -0.091 -0.077
grade_6 -0.0727 0.002 -40.790 0.000 -0.076 -0.069
grade_7 -0.0646 0.001 -53.186 0.000 -0.067 -0.062
grade_8 -0.0424 0.001 -38.568 0.000 -0.045 -0.040
grade_11 0.0546 0.003 20.764 0.000 0.049 0.060
grade_12 0.1054 0.005 19.271 0.000 0.095 0.116
grade_13 0.1593 0.013 11.925 0.000 0.133 0.186
==============================================================================
Omnibus: 1634.980 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5790.556
Skew: 0.348 Prob(JB): 0.00
Kurtosis: 5.464 Cond. No. 56.9
==============================================================================

Conclusion

From using the RFE we managed to keep both continuous varibles and drop most dummy variables that were creating noise in the model. Although, R² decreased from 85% to 83% the model is more robust with a positive cross-validation result. Also, there are no features with p-value>0.05, therefore I can reject the null hypothesis that the feature are not significant in predicting price values.

--

--

Fernando Aguilar

Data Analyst at Enterprise Knowledge, currently pursuing an MS in Applied Statistics at PennState, and Flatiron Data Science Bootcamp Graduate.