Multiple Regression and Recursive Feature Elimination (RFE)

7 min readApr 15, 2019

Introduction

The most difficult of most projects is arriving at a model with significant and efficient predictability. If you are using an OSEMiN model, the modeling stage is the culmination of hard work and long hours obtaining, scrubbing and exploring the data. During this first stages, it is common to have dealt with null values, outliers, transforming and scaling your data.

Hopefully, by the time you have gone over your EDA, you already have a vague idea of the features that would work best for your model. I have currently finished a project to build a multivariate model to predict house sales prices using the king county data set. After the initial EDA, and having your thoughts about the best features in mind, it would be best to test for multicollinearity.

House Sales in King County, USA

King County Data set

www.kaggle.com

Test for multicolinearity

It is no surprise that after looking into the data, one might think off home square footage, location, number of bed and baths as very important features in predicting a home sale value. However, what happens if this features are correlated amongst each other? Then we face multicollinearity. It make it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable. It is safe to assume a correlation value of 0.75 as high. I proceded to create a correlation heatmap to find correlations no only amongst independent variables but also between the target and each possible feature to be included in the final model

Create a correlation heat map with all possible features in the model

I set the minimum and maximum values to -0.75 and 0.75, respectively, to be able to spot high correlation values more easily. From the heatmap, we can confirm that bathrooms, bedrooms, and house square footage are amongst the features with higher correlation to the target variable ‘price’. However, these are also very highly correlated amongst each other. This may result in coefficients from the partial regression not to be estimated precisely, and the standard errors are likely to be high. Therefore, based on our EDA and the heatmap, it is time to choose the best features for the model.

For my model I chose to keep the following variables: sqft_living, lat, yr_renovated, grade, and zipcode. They are a combination of home square footage, location, and the state of the house, which in this particular case coincide with the most sought after features for homebuyers.

Simple Linear Regression

After carefully choosing the independent variables for the model, it is important to test how good of a predictor each of the variables is as a single feature model. This will help to rule out any variable that fails to reject the null hypothesis by having a high p-value (higher than 0.05) or that it does very little at explaining the target variable (very low R² value).

I will first test each of the continuous variables, sqft_living and latitude. This variables have already been transformed (log and cube root transformation, respectively), and scaled (min-max scaling) during the EDA process. I ran a single feature regression model and printed out the R², intercept, slope and p-value. The results were:

OLS testing for each continuous random variable.

Selling price~sqft_living
------------------------------
['sqft_living', 0.4504851147267033, 0.08962350080510796, 0.6358764251985838, 0.0]
Selling price~lat
------------------------------
['lat', 0.21112253075290077, 0.2318977649531629, 0.23819279894216772, 0.0]

From the results it is noticeable that both continuous variables have relatively high R² values and slope, both also have a p-value of less than 0.05 hence they are significant predictors for the target variable ‘price’. Furthermore, since both variables have very low correlation amongst them, it is expected that including both of them in the model will increase the R² values without adding much noise to it. Hence, I will keep both variable moving forwards towards the multiple feature model.

I will continue to test each of the categorical variables: zipcode, yr_renovated, and grade. This variables have already been one hot coded during EDA and dummy variables were created. The results of the OLS are as follows:

Print OLS Results for each Categorical Variable

From the single feature OLS results above, yr_renovated had the lowest R_squared value. Hence, I will delete it from the model moving forward. Both, grade and zipcode, have high R² values and p-values of less than 0.05 overall. Although some individual dummy variables have high p-values, this may be an issue of dimensionality that I will deal with using RFE. To start, let’s build and test the first multivariate model including sqft_living, lat, zipcode, and grade (with all dummy variables).

Multiple Linear Regression

For this stage, I will run a multiple feature model using a train-test split with a train test of 25%. To test the fit of the model, I will print out its mean absolute error , compare the RMSEs amongst training and testing data, and compare mean predicted and actual selling prices.

Linear Regression with Train-Test Split

R_squared Score: 0.8509063821070757
Mean Absolute Error: 0.03194101659219547
Root Mean Squared Error test: 0.04379282741339869
Root Mean Squared Error train: 0.04442036205533277
Mean Predicted Selling Price: 0.3852218144758737
Mean Selling Price: 0.3855008038714294

This model seems to be a great fit. The R² score is very high at around 85%, also the difference between the RMSEs is very low and the Mean Prices are very close to each other. Although it seems like a very good model, it is best to test it further using cross-validation.

5-fold Cross-Validation

R_squared Mean Score: -6.44414675982732e+18
[-3.22207338e+19  8.47272164e-01  8.42818063e-01  8.52439254e-01
  8.53224924e-01]

The result from the 5-fold cross-validation is terrible with a negative score. Therefore, either the intercept or the slope are constrained so that the line of best fit, fits worse than a horizontal line. Moreover, as noted before, there might be an issue with dimensionality and probably a lot of the dummy variables might not contain much data. Hence, I will create a new model using feature selection to remove features that are causing noise in the initial model.

Recursive Feature Elimination (RFE)

Linear Regression using RFE

R_squared Score: 0.8270503350015767
Mean Absolute Error: 0.03482593985773259
Root Mean Squared Error test: 0.04705841133932033
Root Mean Squared Error train: 0.04787779907667826
Mean Predicted Selling Price: 0.38887905753150637
Mean Selling Price: 0.38777279205303655

Although the R² score dropped to around 83%, is not a big change and it is noticeable that the differences amongst RMSEs have decreased as well as the differences between predicted and actual target values. Let’s also cross-validate this new model.

cross_validation(X_RFE,y)
R_squared Mean Score: 0.8244926988540975
[0.82332196 0.82025237 0.81959714 0.82832553 0.8309665 ]

The mean R² obtained form the 5-fold cross_validation is a positive 82%. Hence, this model does a much better job than the first one. I will now proceed to look for any features that have a p-value > 0.05, and delete them from the model.

OLS_reg(X_RFE)OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.827
Model:                            OLS   Adj. R-squared:                  0.827
Method:                 Least Squares   F-statistic:                     2530.
Date:                Fri, 12 Apr 2019   Prob (F-statistic):               0.00
Time:                        18:31:48   Log-Likelihood:                 34429.
No. Observations:               21191   AIC:                        -6.878e+04
Df Residuals:                   21150   BIC:                        -6.845e+04
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         0.1126      0.003     39.820      0.000       0.107       0.118
sqft_living       0.4063      0.004     99.094      0.000       0.398       0.414
lat               0.1544      0.002     90.932      0.000       0.151       0.158
zipcode_98004     0.1599      0.003     58.060      0.000       0.155       0.165
zipcode_98005     0.0745      0.004     19.923      0.000       0.067       0.082
zipcode_98006     0.0723      0.002     32.472      0.000       0.068       0.077
zipcode_98008     0.0627      0.003     21.771      0.000       0.057       0.068
zipcode_98010     0.0390      0.005      7.384      0.000       0.029       0.049
zipcode_98019    -0.0515      0.004    -13.597      0.000      -0.059      -0.044
zipcode_98027     0.0474      0.002     19.153      0.000       0.043       0.052
zipcode_98032    -0.0335      0.004     -7.767      0.000      -0.042      -0.025
zipcode_98033     0.0686      0.002     28.782      0.000       0.064       0.073
zipcode_98039     0.1963      0.007     28.737      0.000       0.183       0.210
zipcode_98040     0.1329      0.003     45.566      0.000       0.127       0.139
zipcode_98052     0.0303      0.002     14.467      0.000       0.026       0.034
zipcode_98070     0.0778      0.005     14.681      0.000       0.067       0.088
zipcode_98074     0.0277      0.002     11.688      0.000       0.023       0.032
zipcode_98075     0.0479      0.003     18.317      0.000       0.043       0.053
zipcode_98102     0.1059      0.005     22.415      0.000       0.097       0.115
zipcode_98103     0.0710      0.002     34.689      0.000       0.067       0.075
zipcode_98105     0.1073      0.003     33.458      0.000       0.101       0.114
zipcode_98107     0.0779      0.003     26.008      0.000       0.072       0.084
zipcode_98109     0.1205      0.005     26.163      0.000       0.111       0.130
zipcode_98112     0.1311      0.003     44.108      0.000       0.125       0.137
zipcode_98115     0.0704      0.002     33.930      0.000       0.066       0.074
zipcode_98116     0.0892      0.003     33.417      0.000       0.084       0.094
zipcode_98117     0.0676      0.002     31.771      0.000       0.063       0.072
zipcode_98119     0.1155      0.004     32.381      0.000       0.109       0.123
zipcode_98122     0.0834      0.003     29.198      0.000       0.078       0.089
zipcode_98136     0.0847      0.003     28.433      0.000       0.079       0.091
zipcode_98144     0.0644      0.003     24.582      0.000       0.059       0.070
zipcode_98168    -0.0452      0.003    -15.199      0.000      -0.051      -0.039
zipcode_98199     0.0908      0.003     33.121      0.000       0.085       0.096
grade_4          -0.0759      0.009     -8.076      0.000      -0.094      -0.057
grade_5          -0.0842      0.004    -23.836      0.000      -0.091      -0.077
grade_6          -0.0727      0.002    -40.790      0.000      -0.076      -0.069
grade_7          -0.0646      0.001    -53.186      0.000      -0.067      -0.062
grade_8          -0.0424      0.001    -38.568      0.000      -0.045      -0.040
grade_11          0.0546      0.003     20.764      0.000       0.049       0.060
grade_12          0.1054      0.005     19.271      0.000       0.095       0.116
grade_13          0.1593      0.013     11.925      0.000       0.133       0.186
==============================================================================
Omnibus:                     1634.980   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5790.556
Skew:                           0.348   Prob(JB):                         0.00
Kurtosis:                       5.464   Cond. No.                         56.9
==============================================================================

Conclusion

From using the RFE we managed to keep both continuous varibles and drop most dummy variables that were creating noise in the model. Although, R² decreased from 85% to 83% the model is more robust with a positive cross-validation result. Also, there are no features with p-value>0.05, therefore I can reject the null hypothesis that the feature are not significant in predicting price values.