Housing Price Prediction

jimmy
2 min readMay 26, 2024

--

Predicting House Prices using Linear Regression

In the realm of real estate, predicting house prices accurately is invaluable for both buyers and sellers. Machine learning models, particularly linear regression, offer a powerful tool to make such predictions based on various features of the property. In this blog post, we’ll walk through the implementation of a linear regression model to predict house prices using Python and the scikit-learn library.

Understanding the Dataset

We begin by loading our dataset. The dataset contains information about various houses including their area, number of bedrooms, bathrooms, stories, whether they are on the main road, and the price. We use the Pandas library to read the dataset:

import pandas as pd df = pd.read_csv(‘Housing.csv’)

We then explore the dataset, examining its dimensions and the number of instances for each feature. Additionally, we drop certain irrelevant features for our analysis:

columns_to_drop = [‘guestroom’, ‘basement’, ‘hotwaterheating’, ‘airconditioning’, ‘parking’, ‘prefarea’, ‘furnishingstatus’]
df.drop(columns=columns_to_drop, inplace=True)

Preprocessing the Data

Before feeding the data into our model, we need to preprocess it. This involves encoding categorical variables and scaling numerical features. We use LabelEncoder to encode the ‘mainroad’ feature and OneHotEncoder to encode it as a one-hot vector. Additionally, we scale the ‘area’ feature using StandardScaler:

from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

column_transformer = ColumnTransformer(
transformers=[
(‘scale_area’, StandardScaler(with_mean=False), [‘area’]),
(‘onehot’, OneHotEncoder(drop=’first’), [‘mainroad’])
],
remainder=’passthrough’
)

pipeline = Pipeline([
(‘preprocessor’, column_transformer),
(‘regressor’, LinearRegression())
])

Training the Model

With our data preprocessed, we split it into training and testing sets. We then fit our linear regression model using the training data:

from sklearn.model_selection import train_test_split
X = df[[‘area’, ‘bedrooms’, ‘bathrooms’, ‘stories’, ‘mainroad’]]
y = df[‘price’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)

Evaluating the Model

Finally, we evaluate the performance of our model using various metrics such as Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, and R-squared:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred = pipeline.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(“Mean Squared Error:”, mse)
print(“Root Mean Squared Error:”, rmse)
print(“Mean Absolute Error:”, mae)
print(“R-squared:”, r2)

Conclusion

In this blog post, we implemented a linear regression model to predict house prices based on various features. By preprocessing the data and training the model, we were able to achieve reasonable predictions. However, there’s always room for improvement, such as exploring different algorithms or tuning hyperparameters. Nonetheless, this serves as a foundational example of applying machine learning to real-world problems in the domain of real estate.

--

--