Explained: Housing Prices Feature Engineering and Ridge Regression.

Published in

The Startup

9 min readSep 2, 2020

This blog is based on the notebook I used to submit predictions for Kaggle In-Class Housing Prices Competition. My submission ranked 293 on the score board, although the focus of this blog is not how to get a high score but to help beginners develop intuition for Machine Learning regression techniques and feature engineering. I am a final year mechanical engineering student, I started learning python in January 2020 with no intentions to learn data science. Thus, any suggestions in the comments for improvement will be appreciated.

The data set contains houses in Ames, Iowa. Photo by chuttersnap on Unsplash

Introduction

The scope of this blog is on the data pre-processing, feature engineering, multivariate analysis using Lasso Regression and predicting sale prices using cross validated Ridge Regression.

The original notebook also contains code on predicting prices using a blended model of Ridge, Support Vector Regression and XG-Boost Regression, which would make this blog verbose. To keep this blog succinct, code blocks are pasted as images where necessary.

The notebook is uploaded on the competition dashboard if reference or code is required : https://www.kaggle.com/devarshraval/top-2-feature-selection-ridge-regression

This data set is of particular importance to novices like me in the sense of the vast range of features. There are a total of 79 features(excluding Id) which are said to explain the sale price of the house, it tests the industriousness of the learner in handling these many features.

1. Exploratory data analysis

The first step in any data science project should be getting to know the data variables and target. This stage can make one quite pensive or other hand feel enervated depending on one’s interests. Exploring the data gives an understanding of the problem and feature engineering ideas.

It is recommended to read the feature description text file to understand feature names and categorize the variables into groups such as Space-related features, Basement features, Amenities, Year built and remod, Garage features, and etc based on subjective views.

This categorization and feature importance becomes more intuitive after a few commands in the notebook such as:

corrmat=df_train.corr()
corrmat['SalePrice'].sort_values(ascending=False).head(10)

SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897

Naturally, overall quality of the house is strongly correlated with the price. However, it is not defined how this measure was calculated. Other features closely related are space considerations (Gr Liv Area, Total Basement area, Garage area) and how old the house is (Year Built).

The correlations give an univariate idea of important features. Thus, we can plot all of these features and remove outliers or multicollinearity. Outliers can skew regression models sharply and multicollinearity can undermine importance of a feature for our model.

1.1 Multi-collinearity

f, ax = plt.subplots(figsize=(10,12))
sns.heatmap(corrmat,mask=corrmat<0.75,linewidth=0.5,cmap="Blues", square=True)

The Seaborn heatmap shows features with correlation above 75%

Thus, highly inter-correlated variables are:

GarageYrBlt and YearBuilt
TotRmsAbvGrd and GrLivArea
1stFlrSF and TotalBsmtSF
GarageArea and GarageCars

GarageYrBlt, TotRmsAbvGrd, GarageCars can be deleted as they give the same information as other features.

1.2. Outliers

First, examine the plots of features versus target one by one and identify observations that do not follow the trend. Only these observations are outliers and not the ones following the trend but having a greater or smaller value.

#Outliers in Fig 1 
df_train.sort_values(by = 'GrLivArea', ascending = False)[:2]
#Fig 2(Clockwise)
df_train.sort_values(by='TotalBsmtSF',ascending=False)[:2]
#fig 3 
df_train.drop(df_train[df_train['LotFrontage']>200].index,inplace=True)
#Fig 4
df_train.drop(df_train[(df_train.YearBuilt < 1900) & (df_train.SalePrice > 200000)].index,inplace=True)
# Fig 5
df_train.drop(df_train[(df_train.OverallQual==4) & (df_train.SalePrice>200000)].index

One might be tempted to delete the observations for LotArea above 100k. However, on second thought it is possible these were just large houses but overall low quality.

df_train[df_train['LotArea']>100000]['OverallQual']Out []:
249    6
313    7
335    5
706    7

Thus, we see that despite having larger Lot Area, these houses have a overall quality of 7, and average price for that quality is a bit more than 200k as seen in the box plot. Thus, these houses may not be a harmful outlier to the model.

Thus, we will see the correlations will have increased for the variables after removing outliers.

2.Feature Engineering

Concatenate the training and testing sets to create new features and fill in missing values so that number and scale of features remains same in both sets.

2.1.Missing Data

Percentage Missing Values for each feature

Thus, some features have too many missing values. These features can make our model over fit on the few values present.

First, fill all the missing values for creating new features and later delete features where 97 % of the values are identical.

Some features are listed as numerical data whereas should be string data. These features are the Location features, MS Zoning,MSSubclass, YearBuilt, MoSold,etc. For other important features like LotFrontage, GarageArea, and MS Zoning fill in the median or mode values. As a last step, fill in all null values as None for string and 0 for numeric.

objects = []
for i in all_features.columns:
    if all_features[i].dtype == object:
        objects.append(i)
all_features.update(all_features[objects].fillna('None'))
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric = []
for i in all_features.columns:
    if all_features[i].dtype in numeric_dtypes:
        numeric.append(i)
all_features.update(all_features[numeric].fillna(0))

2.2.New features

Creating features deemed to be relevant by subjective analysis of features, such as:

1)Years since Remodelling of the house.

2)Total surface area=Total basement surface area+ 1st Floor SF + 2nd Floor SF

3)Bathrooms =full bathrooms + 0.5* half bathrooms

4)Average Room Size= Ground living area / (Number of rooms and bathrooms)

5)Bedroom Bathroom tradeoff = Bedrooms * Bathrooms

6)Porch Area = Sum of porch features

7)Type of Porch = categorical feature

8)The newness of the house captured by its age taking in factor of renovation.

Some of the features were created in this way:

2.3. Mapping important categorical features

Basically, this is manually encoding the categorical features. Dummy variables or one hot encoding creates sparse matrices and does not capture the importance or relationship of the feature as we want it to. For example, the following map for neighborhoods feature encodes the categories according to their median sale price and we can encode more expensive areas as higher values to explore the relationship. Verify this by trying both methods, one hot encoding and the manual mapping, the latter scores significantly higher.

Box plots for categorical features to facilitate mapping or encoding the categories as numbers.

# Mapping neighborhood unique values according to the shades of the box-plot
neigh_map={'None': 0,'MeadowV':1,'IDOTRR':1,'BrDale':1,
        'OldTown':2,'Edwards':2,'BrkSide':2,
        'Sawyer':3,'Blueste':3,'SWISU':3,'NAmes':3,
        'NPkVill':4,'Mitchel':4,'SawyerW':4,
        'Gilbert':5,'NWAmes':5,'Blmngtn':5,
        'CollgCr':6,'ClearCr':6,'Crawfor':6,
        'Somerst':8,'Veenker':8,'Timber':8,
         'StoneBr':10,'NoRidge':10,'NridgHt':10 } 
all_features['Neighborhood'] = all_features['Neighborhood'].map(neigh_map)
# Quality maps for external and basementbsm_map = {'None': 0, 'Po': 1, 'Fa': 4, 'TA': 9, 'Gd': 16, 'Ex': 25}
#ordinal_map = {'Ex': 10,'Gd': 8, 'TA': 6, 'Fa': 5, 'Po': 2, 'NA':0}
ord_col = ['ExterQual','ExterCond','BsmtQual', 'BsmtCond','HeatingQC',
           'KitchenQual','GarageQual','GarageCond', 'FireplaceQu']
for col in ord_col:
    all_features[col] = all_features[col].map(bsm_map)
all_features.shape

Impute negative values obtained from Years_Since_Remod in the test set as zero.

all_features[all_features['YrSold'].astype(int) <      all_features['YearRemodAdd'].astype(int)]all_features.at[2284,'Years_Since_Remod']=0
all_features.at[2538,'Years_Since_Remod']=0
all_features.at[2538,'Age']=0

2.4. Drop features

Finally, drop features according to multi-collinearity, missing values, weak correlation, identical values.

2.5. Feature Scaling

A (1+x) box-cox transformation is used to scale the highly skewed features.

Target Scaling

#The target is scaled to logarithmic distribution.
target = np.log1p(df_train['SalePrice']).reset_index(drop=True)

2.6.Feature transformation, logarithmic and quadratic relationships

Squaring and logging numeric features to explore degrees of relationships.

Save all lists of features to trial the validation score and choose best list

Note: Use numeric list of training data set before encoding the categorical variables.

Cubed and square root features were also used but did not result in significant improvement and led to over-fitting on including all features.

#Convert remaining categorical features by one hot encoding
all_features=pd.get_dummies(all_features).reset_index(drop=True)
all_features1=pd.get_dummies(all_features1).reset_index(drop=True)
loged_features=pd.get_dummies(loged_features).reset_index(drop=True)
log_sq_cols=pd.get_dummies(log_sq_cols).reset_index(drop=True)# Re-Split the training and testing data
def get_splits(all_features,target): 
    df=pd.concat([all_features,target],1)
    X_train=df.iloc[:len(target),:]
    X_test=all_features.iloc[len(target):,:]
    return X_train,X_test# Split for dataset containing log and squared features:
X_train1,X_test1=get_splits(log_sq_cols,target)
train1,valid1,feature_col1=get_valid(X_train1,'SalePrice')

2.7.Multi-variate analysis using Lasso Regression

Lasso regression or the L1 penalty regularization is used for feature selection as undesired features are weighted zero as the alpha coefficient is increased.

A wide range of alphas is used in cross validation and best model is selected. The corresponding weights of the features are extracted by using the .coef_ attribute of LassoCV.

Output is the list of importance weights given to features in the best CV model

3. Ridge Regression

To predict the sales price of the training set houses, ridge regression is used on features obtained from Lasso model.

Ridge model using cross validation for the L2 alpha is performed using all the features in the log_sq_col list. One may perform ridge for all the lists obtained above and check that log_sq_col list obtains least CV score.

Ridge regression using cross validation is performed for various number of features using the sorted list from Lasso model. It is observed that the model generalizes best when 270 features are included. A more extensive search over number of features can be performed but result is same.

Ridge regression : Using all 298 features, CV Score is 0.1116, while it is 0.1111 for 270 features.

4. Inferences

The ridge regression model with 270 features and alpha of 21 is the best regression model for given training set and gives the lowest test set error. Thus in this section we will try to find which features are generally important for predicting sale price and which the model has chosen and resulted in improved score.

Feature importance is calculated using Permutation importance from eli5 library. This function calculates the change in the target variable while keeping other features constant.

First the importance given by using the default features provided in the train set,

Thus we see a naive model using all the features provided gives importance to space features like Basement, Grade living area and 2nd Floor surface area. Also, neighborhood and year built features are also present.

The importance imputed by the ridge model formed above with 270 features,

Thus, one can observe that the ridge model gives the time related features more weight than space features. One reason can be that in the naive model, there were not enough time features as much as space features, and time features are of lower absolute value having lower impact.

To observe impact of the change in features on the target, we plot the summary plot of the shap values.

Naturally the sale price is impacted greatly by house having a large 2nd floor area which also indicates the house has large ground floor area. Moreover, it is a numeric squared feature having high absolute values.

To observe impact on output of other features, we can drop the space features or only use the particular features intended.

5.Conclusion

That’s the end of the exercise. The prices have been predicted upto a tolerance of 12,000 dollars for a median price of 162,000. The important features have been explored and insights can be drawn to modify and sell houses at a higher price.

Certainly one can use other models to obtain the shap values and explore the feature importance. However, you will need tensorflow package or 3.8 version of Python to perform that. Besides, other models do not perform as well as Ridge, only used to increase the submission score by blending.

Note: It is not recommended to use data leakage for gaining real practical experience and many notebooks found on the competition dashboard use it in between the modelling cells without mentioning. Thus, it should be noted this notebook does not use data leakage and neither do any of the given references.

References for the notebook:

For understanding of features and introduction:

‘COMPREHENSIVE DATA EXPLORATION WITH PYTHON’, by Pedro Marcelino — February 2017

For models used and scaling features:

2. ‘How I made top 0.3% on a Kaggle competition’, by Lavanya Shukla

For hyper-tuning of models:

3. ‘Data Science Workflow TOP 2% (with Tuning)’, by aqx

For shap values and permutation importance:

4. Machine Learning Explainability by Dan Becker, Kaggle Mini Course.