Predicting Housing Prices in the U.S.

Published in

INST414: Data Science Techniques

5 min readMay 15, 2023

Introduction

In this post, we will explore the use of linear regression machine learning techniques to predict housing prices in 12 states across the U.S. Predicting housing prices is an important task for a variety of stakeholders, including real estate agents, homeowners, and policymakers. By understanding the factors that influence housing prices, we can make more informed decisions about buying and selling homes, as well as develop policies to promote affordable housing.

In the following sections, we will describe the sources and tools used in our analysis, present our findings, discuss the limitations of our predictions, and provide recommendations for future work.

Sources and Tools

The primary tools used for this project were the kaggle API, the pandas data analytics library, the scikit-learn machine learning library:

The primary tools used for this project were the kaggle API, the pandas data analytics library, the scikit-learn machine learning library:
The kaggle API was used to retrieve the realtor-data.csv dataset.
The pandas library is a popular data analysis library and was used to store data in tabular format.
The scikit-learn library is an industry-level machine learning library, and it was used to perform dimensionality reduction and clustering operations.

The dataset used for this project was acquired from Kaggle using their Python library kaggle , and it is the ‘USA Real Estate Dataset’. This dataset was chosen because it is updated Monthly, and is traced by a reliable source, www.realtor.com, which is a company that provides a “comprehensive list of for-sale properties, as well as information and tools to make informed real estate decisions.”

import kaggle
import pandas pd

# Download the dataset
dataset = kaggle.api.dataset_download_file(
  'ahmedshahriarsakib/realtor-data.csv',
  'data/realtor-data.csv'
)

# Load the dataset into a pandas dataframe
realtor_df = pd.read_csv("data/realtor-data.csv")
print(f"Rows/Columns: {realtor_df.shape}", '\\n')
print(f"Variables: {realtor_df.columns}")

Summary of Variables in the `realtor-data.csv’ file — Summary of Variables in the `realtor-data.csv`

Analysis and Results

In this study, we are trying to predict the numerical variable,price , of a house, given our set of variables from the realtor-data.csv dataset. Thus, this problem is a linear regression problem, which is a supervised learning problem. The linear regression model selected for this problem was Non-Negative Least Squares, as the value we’re trying to predict is naturally positive.

The data is naturally clustered, by state, so taking advantage of this, a linear regressor was trained for each cluster, and the performance of the model was tested for each cluster. The performance metric used to evaluate the performance of each model was the coefficient of determination, where

R²=(1-\frac{u}{v}),
u=((y_{true} — y_{pred})²).sum(), and
v=(y_{true}-y_{true}.mean())²).sum(),

The best performance value for the coefficient of determination is 1.0. This metric can become arbitrarily negative as the model performance can be arbitrarily worse.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split as tts

states = realtor_by_state.groups.keys() # get the states in the dataset
cluster_data = dict() # create a dictionary to store the cluster data
for state in states: 
    this_state = cluster_data[state] = dict()
    X = realtor_by_state.get_group(state).drop(columns=["price", "state", "city"])
    y = realtor_by_state.get_group(state)["price"]
    X_train, X_test, y_train, y_test = tts(X, y, test_size=.2, random_state=0)

    # Store the training data
    train = this_state["train"] = dict()
    train["data"] = X_train
    train["target"] = y_train

    # Store the test data
    test = this_state["test"] = dict()
    test["data"] = X_test
    test["target"] = y_test

    # Train this cluster's model
    estimator = this_state["estimator"] = LinearRegression(positive=True)
    estimator.fit(train["data"], train["target"])

    # Evaluate the model's performance
    y_pred = estimator.predict(test["data"])
    this_state["score"] = estimator.score(test["data"], test["target"])

# store and print the the performance results for each clusters regressor
results_df = pd.DataFrame(columns=["State", "Training Data", "Testing Data", "Model Performance (R^2)"])
for i, state in enumerate(states): 
    c = cluster_data[state]
    data = [
        state, 
        c["train"]["data"].shape, 
        c["test"]["data"].shape, 
        round(c["score"], 2)
    ]
    results_df.loc[i] = data
results_df

Since this is a regression problem, there is no definitive right or wrong prediction. However, some predictions may be closer to the actual value than others. To define mislabeled data, we consider predictions with a difference greater than 10% of the actual value. The conneticut model was analyzed. With an R² score of -0.91, this model didn’t perform too well. Given our metric, the predictions were off for ~90% of the data. A possible reason for this is that the data is highly sparse, and thus a linear classifier is not substantial for this problem.

# Evaluate the performance for 5 wrongly labeled data points
conneticut = pd.DataFrame(
    [
        results_df.loc[0, ["y_true"]].to_numpy()[0], 
        results_df.loc[0, ["y_pred"]].to_numpy()[0]
    ], 
).transpose()
conneticut.columns=['y_true', 'y_pred']
conneticut.loc[
  abs((conneticut.y_true - conneticut.y_pred) / conneticut.y_true) > .1
].head(5)

Wrongly Regressed Values from Connecticut Cluster

Limitations

The analysis performed in this document is limited by the data used. While the dataset is updated monthly and is traced by a reliable source, there are still several factors that can influence housing prices that are not included in the dataset. For example, the quality of a home is not accounted for in the dataset and can significantly impact the price. Additionally, external factors like the state of the economy, consumer confidence, and interest rates are not included in the dataset but can impact housing prices.

Conclusion

This document explores the use of linear regression machine learning to predict housing prices in 12 states across the U.S. The dataset used for this project was acquired from Kaggle using their Python library, and it is the ‘USA Real Estate Dataset dataset. The primary tools used for this project were the kaggle API, the pandas data analytics library, and the scikit-learn machine learning library. The limitations of the analysis and predictions are discussed, and suggestions for future research to improve the accuracy of predictions are provided.

Future research can focus on expanding the dataset to include more variables that can impact housing prices. Additionally, the dataset can be expanded to include more states and more recent data to improve the accuracy of the predictions.