Modeling house prices with Elastic Net, Lasso, Ridge…

Published in

Modeling House Prices

5 min readMar 28, 2020

The Dataset

This dataset has more than 100 variables that could impact in the market price of a house and the price of each property in the dataset. You can find easily in kaggle.com.

The properties are located in Ames, Iowa.

It has 1465 rows and 81 columns, each row has the information of a house.

Now it’s time to look for missing values in our info:

There are 19 columns with Nan’s, and 4 columns with more than 75% of Nan’s: Alley, PoolQC, Fence, MiscFeature.

Due to that level of Nan values I decide to delete these columns.

Explore the Data

The response variable, SalePrice has the house prices of the dataset. Let’s plot and analyze:

The average price is 180.921 $
The maximum price is 755.000 $
The minimum price is 34.900 $
The standard deviation is 79.443 $, really high for that average price, what means, there’s a lot of dispersion in our date, due to the heterogeneity of the houses.

Correlation between features

The correlation matrix allow us to see the correlations of all the variables in the dataset.

The features shows a high correlation between them, let’s see the correlation higher than 0.6 with the response variable, SalePrice:

The features related with the size, like GrLivArea, TotalBsmtSF…seems to be high correlated with the price and between them.

Let’s create a variable that sum them, and then, delete the individual ones.

train[“Total_house_area”] = train[“1stFlrSF”] + train[“2ndFlrSF”] + train[“TotalBsmtSF”]+train[“GarageArea”]

…and the correlations (>0.6) with SalePrice:

Now the new variable Total_house_area is the most correlated variable with the SalePrice!

Let’see the scatter matrix to see in more detail these 5 variables and the relations between them and with SalePrice:

A priori seems that Overallqual and Total_house_area are playing a huge role in SalePrice. Also GarageCars, YearBuilt and FullBath are importants as they are a sort of proxy of wealth.

Feature Engineering

Divide the data between X and the response variable y
Train and test sets split.
Dummy coding categorical features
Fill NaN with the mean of the column.

#X and y
train.dropna(subset=([‘SalePrice’]),inplace=True)
y=train[‘SalePrice’]
X=train.drop(columns=’SalePrice’)

#Fill NaN: Numerical and categorical

zmean=X.select_dtypes(include=[‘float64’,’int64']).apply(lambda x: x.fillna(x.mean()),axis=0)
zmean.select_dtypes(include=[‘float64’,’int64']).head()

fill_mode = lambda col: col.fillna(col.mode())
an=X.select_dtypes(include=’object’).apply(fill_mode, axis=0)
an.select_dtypes(include=’object’).head()

#Dummy categorical

dummies=pd.get_dummies(an)

Once here we have the data prepared to split between train and test and then model the SalePrice depending of the rest of features. Let’s use a 70–30 split ratio for that.

Using a linear model, let’s select the optimal number of features. As we can see in the graph, 113 are the optimal number of features, using a linear model and R2 as metric.

With these 113 features and the train and test sets, let’s fit three more models in order to see:

What’s the best linear model for this problem?
What are the most important features to predict the house prices?
Are the most correlated variables with the price the most important in the models?

Let’s fit 3 models using gridsearchCV in order to find the best parameters:

Lasso.
Ridge.
Elastic net.

The best model based in the testing r2:

…is the Elastic Net regression, which has the best test r2 score.

What about the most important variables in these models?

The most correlated variables with SalePrice are often used in the four models, but they are not in the first 4 positions in none of them, that’s why the models offers not only a prediction but a good business comprehension tool aswell.