Developing a Parsimonious Model for Predicting Housing Prices: Methods and Approaches

Sahel Eskandar
6 min readMar 16, 2023

--

Developing an accurate predictive model for housing prices is a complex task that often involves analyzing a large number of predictor variables. However, it is often desirable to create a model that is as parsimonious as possible — that is, one that uses as few predictor variables as possible to achieve the desired level of prediction accuracy. In this article, we will discuss several methods that can be used to develop a parsimonious model for predicting housing prices.

“Parsimonious” is an adjective that describes something that is simple or minimalistic, yet still effective or sufficient for its intended purpose. In the context of modeling or statistics, a parsimonious model is one that achieves a desired level of accuracy or prediction with as few predictor variables as possible. The idea is to avoid overfitting the model by including unnecessary or redundant variables, while still maintaining enough information to make accurate predictions. A parsimonious model is often preferred over a more complex one, as it is easier to interpret and less likely to suffer from issues such as multicollinearity, overfitting, or lack of generalization.

Lasso Regression

One of the most common methods for creating a parsimonious model is Lasso Regression. Lasso (short for “Least Absolute Shrinkage and Selection Operator”) is a type of linear regression model that performs both variable selection and regularization by adding an L1 penalty term to the cost function. The strength of the L1 penalty is controlled by a hyperparameter called “alpha”.

When alpha is set to 0, Lasso becomes equivalent to ordinary least squares regression, and all variables are included in the model. As alpha increases, the penalty term becomes stronger, leading to a more pronounced shrinkage effect on the coefficients. This causes some of the coefficients to shrink towards zero, effectively removing some variables from the model, which in turn helps to prevent overfitting.

As alpha continues to increase, more and more variables are excluded from the model until eventually all coefficients are set to zero, and the model becomes a simple intercept-only model. The optimal value of alpha that balances the trade-off between model complexity and predictive performance can be determined through cross-validation or other tuning methods. Lasso regression can be particularly useful in situations where there are many potential predictor variables, and the goal is to identify which ones are most important.

The following sample code is an example of how Lasso Regression can be implemented in Python. We first load the necessary libraries and data, then split the data into training and testing sets. Then, fit a Lasso regression model to the training data using an alpha value of 0.1, and evaluate the performance of the model on the test data using the R-squared score.

# Load necessary libraries and data
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

data = pd.read_csv('housing_data.csv')
X = data.drop('Price', axis=1)
y = data['Price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the Lasso regression model to the training data
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Evaluate the performance of the model on the test data
print("R-squared score:", lasso.score(X_test, y_test))

Principal Component Regression (PCR)

Another approach that can be used to create a parsimonious model is Principal Component Regression (PCR). PCR is a technique that involves transforming the original predictor variables into a new set of orthogonal variables, known as principal components. These principal components are then used as predictors in a regression model. PCR can be particularly useful when there are many potential predictor variables that are highly correlated with each other. In such situations, the principal components can be used to reduce the dimensionality of the data and identify which predictors are most important.

# Load necessary libraries and data
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

data = pd.read_csv('housing_data.csv')
X = data.drop('Price', axis=1)
y = data['Price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the pipeline for PCR
pca = PCA(n_components=5)
linear_regression = LinearRegression()
pipeline = Pipeline([('pca', pca), ('linear_regression', linear_regression)])

# Fit the PCR model to the training data
pipeline.fit(X_train, y_train)

# Evaluate the performance of the model on the test data
print("R-squared score:", pipeline.score(X_test, y_test))

Similar to Lasso code, first load the necessary libraries and data, then split the data into training and testing sets. Then define a pipeline for performing PCR, which includes a PCA step to reduce the dimensionality of the data to 5 principal components, followed by a linear regression step to fit a regression model to the transformed data. We then fit the PCR model to the training data using the pipeline, and evaluate the performance of the model on the test data using the R-squared score.

Stepwise Regression

Stepwise regression is another method that can be used to create a parsimonious model. Stepwise regression is an iterative process that involves adding or removing predictor variables one at a time based on their statistical significance. The process continues until a final model is reached that includes only the most important predictors. Stepwise regression can be particularly useful when there are many potential predictor variables and the goal is to identify which ones are most important.

Here’s an example code snippet for implementing stepwise regression in Python:

# Load necessary libraries and data
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

data = pd.read_csv('housing_data.csv')
X = data.drop('Price', axis=1)
y = data['Price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the stepwise regression model
model = sm.OLS(y_train, X_train).fit(method='step')

# Evaluate the performance of the model on the test data
X_test = sm.add_constant(X_test) # add constant term to test data
y_pred = model.predict(X_test)
print("R-squared score:", model.rsquared)

In this code, we first load the necessary libraries and data, then split the data into training and testing sets. We then define a stepwise regression model using the sm.OLS function from the statsmodels library, and fit the model to the training data using the method='step' argument to perform stepwise variable selection. We then evaluate the performance of the model on the test data using the R-squared score, which is computed using the model.rsquared attribute. Before making predictions on the test data, we also add a constant term to the data using the sm.add_constant function. Note that this is just one possible implementation of stepwise regression in Python, and other libraries such as scikit-learn may offer alternative implementations.

When using any of these methods, it is important to keep in mind that the goal is not just to create a parsimonious model, but also to create a model that accurately predicts housing prices. Therefore, it is important to use appropriate metrics to evaluate the performance of the model. One commonly used metric is the mean squared error (MSE), which measures the difference between the predicted and actual housing prices. Other metrics such as the R-squared value and the root mean squared error (RMSE) can also be used to evaluate the performance of the model.

In conclusion, creating a parsimonious model for predicting housing prices can be a challenging task. However, there are several methods that can be used to identify the most important predictor variables and create a model that achieves the desired level of prediction accuracy. Lasso regression, PCR, and stepwise regression are all potential options that can be used to develop a parsimonious model. It is important to evaluate the performance of the model using appropriate metrics to ensure that it accurately predicts housing prices.

👏 Don’t forget to give this article some claps and share it with your network to support my work! Feel free to follow my Medium profile for more insightful content on machine learning and data science. Thank you for being so supportive! 🚀

--

--

Sahel Eskandar

Data Scientist | Ph.D. Teaching and working with people brings me a sense of purpose. I believe in systems! Motivated to create a better one!