Predicting House Prices Using Multi-parametric Linear Regression

10 min readJan 13, 2022

One of the first types of machine learning that budding data scientists learn is prediction using Linear Regression. Moreover, predicting house value based on real estate data is a classic example for the field of regression. In order to make things a little less droll, let’s imagine a real life scenario where we can actually apply our machine learning prowess to serve a real client. Imagine the following statement of work (SOW) or situation comes across your desk:

Client Use Case

You are a data scientist who has been tasked by an up and coming real-estate company with predicting sales price for housing data. They have decided to first put you to the test in order to see how well your model can predict the pricing data on an example data set with a portion of the data held back (a true blind) to run a final test on your model. This way they can make sure to feel confident that your models will have a good chance at predicting the unknown data they are gathering. Success will be measured based on your ability to deliver more than 1 model for the task at hand and a comparative analysis of the various regression models in (Ridge, Lasso, MLR) that you use. The primary stakeholder will be company ADVORE Realty’s hiring manager who is looking to hire you and the secondary stakeholder will be the existing technical members of the ADVORE team who will be listening in and providing their technical feedback.

The dataset used in this project can be found here.

EDA and Data cleaning

Initial Imports

The main types of data cleaning performed were related to the imputing of values using the median of the column features. The large majority of columns with a high percentage of nulls were also intentionally excluded from the training data set. In the Exploratory Data analysis, we looked at correlation relationships to sale price of homes and also scatterplots of the top features with sale price. This process was used to find outliers in the dataset that we investigated for further data cleaning. Finally, we also unpacked the categorical variable of “neighborhoods” feature, and we quantified this variable by generating dummy columns in our model for analysis.

## IMPORT LIBRARIES

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt## IMPORT DATAdf = pd.read_csv('./datasets/train.csv') # Read in CSV
df.columns =[col.replace(' ', '_').lower() for col in df.columns]## DATA QUALITY CHECKS
nan_cols = [col for col in df.columns if df[col].isnull().any()] #Find columns with nulls
percent_null = df[nan_cols].isna().sum().sort_values(ascending = False) / len(df) * 100 #Make as a percentage of data per column
percent_null

The code above outputs the following percentages in descending order of percentage of data that is null (a.k.a these are columns to consider removing).

OUTPUT: 

pool_qc           99.561190
misc_feature      96.830814
alley             93.174061
fence             80.497318
fireplace_qu      48.756704
lot_frontage      16.089712
garage_yr_blt      5.558264
garage_cond        5.558264
garage_qual        5.558264
garage_finish      5.558264
garage_type        5.509508
bsmt_exposure      2.827889
...
bsmtfin_sf_1       0.048757
dtype: float64

Let’s take a look at a quick visual of what our target variable looks like with a histogram, which can help us determine if there is some skew in our data.

Histogram for our target variable, sale price. We can clearly note a skew in this distribution.

As we can see from the distribution above, we can clearly see that the data has a positive skew (or right-skewed). This means that we have a long tail in the distribution at the higher sale prices of our data and a short tail in the lower quartile of our data. This obviously makes sense, because it is quite rare to hear of a piece of real estate sell for less than $0, even if it is completely ransacked! Therefore, we may want to take this into consideration for any results that we generate.

LOCATION, LOCATION, LOCATION!

In the following code, we can prove the age-old real-estate adage that basically states that Location is king in real-estate by comparing across various neighborhoods in our data to see if a pattern exists.

df.neighborhood.value_counts()
plt.figure(figsize = (13,7))
#Show the mean saleprice of each neighborhood
(df.groupby(['neighborhood']).mean())['saleprice'].sort_values(ascending = False).plot.barh()
plt.axvline(df['saleprice'].mean(),color='k') # Black = median 
plt.axvline(x=df['saleprice'].mean()+ 2 * (df['saleprice'].std()),color='r') # Red = 2 STD
plt.axvline(x=df['saleprice'].mean()- 2 * (df['saleprice'].std()),color='r') 
plt.axvline(x=df['saleprice'].mean()+ 1 * (df['saleprice'].std()),color='y') # Yellow = 1 STD
plt.axvline(x=df['saleprice'].mean()-1 * (df['saleprice'].std()),color='y')
plt.title("Average Sale price of Various Neighborhoods",fontsize = 20)
plt.xlabel("Average Sale Price",fontsize = 18)
plt.ylabel("Neighborhood",fontsize = 18)
plt.tight_layout()

Notice in the graph above that most of the neighborhoods fall within 2 standard deviations of the mean sale price for all the neighborhoods.

Feature Selection

One of the important things to understand for a linear regression model, especially in a case where multiple features are present such as this case, where multiple factors may affect the housing price, is the linear correlation of our target variable with each feature variable to get an initial preview of potentially valuable features. Often, one of the most difficult tasks when having a large dataset with many different types of features is knowing which features to keep, and which features to let go.

One of the more convenient ways to evaluate your feature set (the columns in your data that you are considering in including a linear regression model with your target variable) is to create a heat-map of linear correlation scores. This specific type of heat-map is adjusted from the traditional (n x n) square grid that plots each column against every other column. Instead of this, we elect to create a heat-map here that is only checking with respect to the target variable and also orders the columns in descending linear correlation with our target variable, sale price. The code to accomplish this is below:

plt.figure(figsize=(5,10))
# We can use a heatmap to discover our most highly correlated (numeric) variables to saleprice
sns.heatmap(df.corr()[['saleprice']].sort_values('saleprice',ascending=False), 
           vmin = -1, 
           vmax = 1,
           cmap = 'coolwarm',
           annot = True);
plt.title("Correlation Heatmap for Sale Price for Feature Selection",fontsize = 18);

Note: Red (warm) temperature indicates a positive correlation with sale price, while Blue (cool) temperatures indicates a negative correlation. The magnitude of the number represents the strength of the correlation in the positive or negative direction.

We can see here the positive correlations in red/orange and the negative correlations in blue to our target variable: sale price. This correlation is how well the given column correlates to our target column.

From the above heat-map, it becomes quite simple to determine that overall quality, greater living area, garage area, garage cars, square footage, etc are some of the most highly correlated factors.

This correlation matrix is a nice part of the story, however it is not the full picture

Let’s remember that only numeric features can be accounted for in the heat-map. Therefore, the heat-map does not include the critical aspect of neighborhoods and how that relates into our linear regression model. Therefore, we may want to consider including the neighborhoods as individual binary parameters called dummy variables. Dummy variables are binary 1s or 0s for each row depending on if that data point represents that neighborhood. These dummy variables will be added as additional columns to our feature set.

def make_dummy_neighborhoods(df):    
    df_train = pd.read_csv('./datasets/train.csv')
    df_train.columns =[col.replace(' ', '_').lower() for col in df_train.columns]
    top_correlations = list(df_train.corr()[['saleprice']].sort_values('saleprice',ascending=False).head(10).index)
    # Automatically grabbing the top most correlated with features (with correlation > 0.5)
    df_subset = df.loc[:,top_correlations]
    df_subset.fillna(df_subset.median(),inplace = True)
    df_subset['neighborhood'] = df['neighborhood']

    df_dummies = pd.get_dummies(df_subset,columns = ['neighborhood'],drop_first=True)
    
    features = list(df_dummies.columns)[1:]
    features
    X = df_dummies[features]
    y=df_dummies['saleprice']
    return X,y

X,y = make_dummy_neighborhoods(df)

Machine Learning Modeling

In order to make our machine learning models, we will be considering 3 types of machine learning models: standard multi-parametric regression (MLR), Lasso Regression, and Ridge Regression. Lasso and Ridge Regression allow us to segment our features. You can read more about the differences between Lasso and Ridge Regression here. Basically, both of these types of models reduce overfitting in our model that can occur from standard linear regression — also known as regularization. Lasso and Ridge are regularization methods. Without getting too deep into the weeds, the simple commonality is that both are regularization methods, that add an intentional bias to our model. In addition, Lasso can provide us some useful ways to implement feature selection for our data about the most important features in our data.

Let’s go ahead and define some simple functions to set up our regression models:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import StandardScalerdef get_mlr_regression(X,y):
    
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
    ols = LinearRegression()
    
    features = list(df.corr()[['saleprice']].sort_values('saleprice',ascending=False).head(20).index)
    # Automatically grabbing the top most correlated with features (with correlation > 0.5)
    df_subset = df.loc[:,features]
    df_subset.fillna(df_subset.median(),inplace = True)

    features = list(df_subset.columns)[1:]
    
    ols.fit(X_train,y_train)
    score_ols_train = ols.score(X_train,y_train)
    score_ols_test = ols.score(X_test,y_test)
    print(" Regression ".center(18, "="))

    print(f'The R Squared Train score is {score_ols_train}')
    print(f'The R Squared Test score is {score_ols_test}')
    
    
    
    return ols,X_train,X_test,y_train,y_test

ols,X_train_lr,X_test_lr,y_train_lr,y_test_lr = get_mlr_regression(X,y)OUTPUT:

=== Regression === 
The R Squared Train score is 0.8572810382033677 
The R Squared Test score is 0.8661359592361862

Note, that in the feature set above, we assume features of the top 20 features in our data. We can see that our R-squared metric is relatively decent for both our training and test, but it is far from perfect.

from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso, LassoCV


def get_split_scale (X,y):
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)

    sc = StandardScaler()
    Z_train = sc.fit_transform(X_train) # Fit to the means in order to transform
    Z_test = sc.transform(X_test) #Apply the transform to actually use that scaled data on test 
    return Z_train,y_train,Z_test,y_test

Z_train,y_train,Z_test,y_test = get_split_scale (X,y)

def get_ridge_fit(X,y):
    Z_train,y_train,Z_test,y_test = get_split_scale (X,y)
    
    alphas = np.logspace(0,5,100)
    ridge_cv = RidgeCV(alphas = alphas,cv=5)
    ridge_cv.fit(Z_train,y_train);

    return ridge_cv 


def get_lasso_fit(X,y):
    Z_train,y_train,Z_test,y_test = get_split_scale (X,y)
    l_alphas = np.logspace(-3,0,100)
    lasso = LassoCV(alphas = l_alphas,cv=5)
    lasso.fit(Z_train,y_train);
    
    return lasso


ols,X_train_lr,X_test_lr,y_train_lr,y_test_lr = get_mlr_regression(X,y)print(" Ridge ".center(18, "="))
print(f'The RIDGE Train score is: {ridge_cv.score(Z_train, y_train)}')
print(f'The RIDGE Test score is: {ridge_cv.score(Z_test, y_test)}')
print(" Lasso ".center(18, "="))
print(f'The LASSO Train score is: {lasso.score(Z_train, y_train)}')
print(f'The LASSO Test score is: {lasso.score(Z_test, y_test)}')
OUTPUT:=== Regression ===
The R Squared Train score is 0.8572810382033677
The R Squared Test score is 0.8661359592361862
===== Ridge ======
The RIDGE Train score is: 0.8569573806546568
The RIDGE Test score is: 0.8658378543256252
===== Lasso ======
The LASSO Train score is: 0.8572807863410608
The LASSO Test score is: 0.8661268619811613

Comparing our test scores, we can see that we have reached a relative limit to our improvement by using these models. However all is not in vain, because we have developed a methodology that can be repeated on new types of data where in fact overfitting may be the case. Also the scores you see above include tweaking of our initial models.

Finally, let’s see what the added bonus of doing Lasso on our data can provide for us:

lasso_coefs = pd.Series(lasso.coef_,index = list(X.columns))

plt.figure(figsize=(8, 10))
abs(lasso_coefs[lasso_coefs != 0]).head(15).sort_values().plot.barh()
plt.title('Nonzero Lasso Coefficients in Relation to Feature Set',fontsize = 18);
plt.xlabel('Lasso Coefficient',fontsize = 16);
plt.ylabel('Feature Name',fontsize = 16);
plt.tight_layout()

The bar chart above is a handy code snippet for being able to compare our most valuable features in relation to our model and we can see that we have some strong contenders from our original heat-map, but also certain neighborhoods that are highly relevant to our model. This can be useful for our client ADVORE realty, in being able to distinguish the use of our machine learning model across different neighborhoods.

Discussions and Conclusion

The goal of this project was to be able to provide our “client,” the ADVORE team with an in-depth look of how certain home features (attributes) lead to increased sales price. We can see that In the early data exploration, we were able to notice the connection between sale price and many features that appeal to the common sense of the real-estate industry, such as “Greater Living Area,” “Overall Quality”,”Year Built”, “Basement Sq.ft”, “Garage area”, “1st floor Square Footage”, and the specific neighborhoods that could be important to price in this particular dataset, “ClearCr,” “CollgCr”, “BrDale”,”Blueste”.

Future improvements to this project would involve log regularization of the target variable, because we saw that we had a skewed distribution in the original histogram of sale-price data. Therefore, if we were able to rerun our models using the log-adjusted sale price, we may see better results.

In conclusion we can see how this project was an interesting exercise for data exploration and data science. In the world of real-estate having an accurate model can be a powerful tool to assess (estimate) the potential value of homes, whether it be for the appraisal of a home or for new homes pending to go on the market. This can be a powerful asset for a real-estate team willing to take part in the next generation of technology in the real-estate industry. Companies like Zillow, Redfin, and Rex have transformed the industry by incorporating machine learning models into their approach of the home-buying and home-selling process.

I hope this was a useful exercise to move a bit deeper than the traditional linear regression approach for the classic case of predicting housing prices.

If you haven’t already play around with this data set and explore to learn these methods for yourself. There is no better way than getting your hands dirty!

Feel free to leave a comment below if you have any questions about the code or the approach taken, or if you have any comments about improvements that could be made here. Thank you for reading.

Article written by:

Manu Mulaveesala