Guide to Build Real Estate Price Prediction Model using ML algorithms

16 min readNov 2, 2023

Introduction

In the ever-evolving landscape of real estate, the ability to accurately predict house prices is a valuable skill for homeowners, buyers, and investors alike. Traditional methods of pricing often fall short of capturing the complex dynamics influencing property values. Machine Learning, is a powerful tool that can revolutionize the way we approach this challenge.

The real estate market is ever changing and can be affected by various factors like location, property features, and economic conditions. Building these models may seem intimidating, but we’ll guide you step by step. By the end, you’ll know how to gather and prepare real estate data, select the right model, and evaluate its performance. We’ll also talk about how to deploy our model. Whether you’re curious about your local housing market or want to use data in real estate, come with us on this journey to explore the world of real estate price prediction.

Learning Objectives:

Learn how to collect and preprocess real estate data for machine learning.
Explore various machine learning models for predicting real estate prices.
Discover how to validate models and ensure their reliability.
Fine-tune models for accurate predictions through hyperparameter tuning.
Gain practical insights into using machine learning for real estate price forecasting.

In this concise guide, we will explore the process of building a real estate price prediction model using data sourced from Kaggle, a popular platform for data science and machine learning enthusiasts.

The dataset we’ll be working with provides a rich source of information encompassing various property features, historical price data, and more. By applying data-driven techniques and machine learning, we’ll walk you through the essential steps to create a predictive model capable of estimating property values effectively.

Data Collection and Preprocessing

In this section, we will outline the crucial steps involved in data collection and preprocessing. As mentioned earlier, we will be using a dataset from Kaggle which is the data of Bengaluru India. To start, let’s import all the necessary libraries. You can install anaconda distribution which comes with all the data science stack and run the commands in Jupyter Notebook.

#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams["figure.figsize"]=(20,10)

Since we are using supervised learning, we need tagged or labeled dataset. The dataset is in csv format, now let’s load it as ‘df’. It has 13320 rows and 9 columns which is quite a decent dataset for us to train our model.

df = pd.read_csv('House_Data.csv')
df.head()

One crucial step is identifying which data columns are essential for our analysis. Not all columns in our dataset are equally informative, and including unnecessary ones can lead to model complexity and reduced accuracy. Considering the relevance of each column to our target variable, which, in this case, is the ‘price’ of houses. Features directly related to property characteristics, such as ‘location’ ‘total_sqft,’ ‘bath,’ and ‘size,’ are likely to be essential for prediction.

Also considering redundancy and irrelevance we then can make an informed decision about which ones to keep and which to drop. This process of feature selection not only enhances the performance of our house price prediction model but also streamlines our analysis, making it more efficient and effective.

#Droping insignificant columns
df1 = df.drop(['area_type','availability','society','balcony'], axis='columns')
df1.head()

Clean data is crucial for accurate predictions. Here’s how to handle null values in your dataset:

Identify columns with missing values.
Decide how to address missing data, either by imputing with median/mode values or removing rows.
Use fillna() to replace null values with the median, mode, or other appropriate values. Alternatively, use dropna() to remove rows with missing data.

The following code shows the number of rows where particular column value is null.

#Handling NA values
df1.isnull().sum()

Since the dataset is quite big that has 13320 rows, and the null values rows is quite small compared to it, you can drop these columns. Alternatively, you can also take the mean/median or mode value of the respective column to fill the null values.

#Dropping NA rows
df2 = df1.dropna()
df2.isnull().sum()

As we examine the dataset, we notice that the ‘size’ column exhibits inconsistencies, with some entries using ‘BHK’ notation, while others use ‘bedroom’. To ensure uniformity, we can employ a pandas series function that provides a list of unique values.

df2['size'].unique()

To address the ‘size’ column’s variations effectively, we will create a new column named ‘BHK’. We can achieve this by tokenizing the ‘size’ column using space as the delimiter and extracting the first token as the ‘BHK’ value.

df2['bhk'] = df2['size'].apply(lambda x: int(x.split(' ')[0]))
df2.head()

Standardizing ‘Total_sqft’ Data:

In the ‘total_sqft’ column, there are two issues: ranges and non-standard characters. To address this, you can take the following steps:

Correct the range values by calculating their averages. For example, if an entry is in the format ‘1000–1200 sqft,’ convert it to ‘1100 sqft’ for consistency.
Also handle non-standard entries, such as those with unusual characters or units. Convert these entries into float values to ensure uniformity.

#Handling non-uniformities of column-total_sqft

def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens)==2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

To preserve the integrity of the original data, create a new DataFrame. Within this new DataFrame, apply the mentioned conversions to the ‘total_sqft’ column.

df3 = df2.copy()
df3['total_sqft'] = df3['total_sqft'].apply(convert_sqft_to_num)
df3.head()

Feature Engineering:

In the dynamic real estate market, understanding the price per square foot holds paramount importance, making it a key metric for our analysis. To incorporate this crucial factor, we perform calculations within our dataset.

df4 = df3.copy()
df4['price_per_sqft'] = df4['price']*100000 / df4['total_sqft']

In our dataset, ‘location’ serves as a categorical feature. However, when dealing with numerous unique locations, the high dimensionality of this feature can become problematic. To address this issue and improve model performance, we adopt a feature engineering strategy:

We identify ‘location’ values with a frequency count of less than 10. These are the locations that occur relatively infrequently in our dataset.
We group all such less frequent ‘location’ values into a single category labeled ‘Other.’

By aggregating these less common locations into ‘Other,’ we effectively reduce the dimensionality of the ‘location’ feature, making it more manageable for our modeling process.

This feature engineering approach helps prevent the curse of dimensionality, streamlines our dataset, and enhances the predictive power of our model. It will also help in outlier detection and removal in later stages.

df4.location = df4.location.apply(lambda x: x.strip())
location_stats = df4.groupby('location')['location'].agg('count').sort_values(ascending=False)
location_stats_less_than_10 = location_stats[location_stats<=10]
df4.location = df4.location.apply(lambda x: 'other' if x in location_stats_less_than_10 else x)
len(df4.location.unique())

We are now left with 242 unique location.

Outlier Removal:

To ensure the accuracy and reliability of our analysis, we need to address outliers — data points that deviate significantly from the norm. In our case, we are specifically concerned with outliers in the ‘total_sqft’ per bedroom and extreme price values.

‘Total_sqft’ per Bedroom Threshold: A common rule of thumb in the real estate industry is that approximately 300 square feet per bedroom is typical. To maintain consistency, we remove entries that fall outside of this threshold, as they might represent outliers or inaccuracies.

#Outlier detection and removal
df4[df4.total_sqft/df4.bhk<300]
df5 = df4[~(df4.total_sqft/df4.bhk<300)]

Extreme Price Values: Extreme prices can skew our model’s predictions. To mitigate this, we employ statistical methods such as calculating the standard deviation of prices. Entries with prices that deviate significantly from the mean, based on a certain number of standard deviations, are considered outliers and are removed from the dataset.

#Function to remove extreme cases

def remove_pps_outliers(df):
    df_out = pd.DataFrame()  # Initialize an empty DataFrame
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft > (m - st)) & 
        (subdf.price_per_sqft <= (m + st))]
        # Concatenate with ignore_index=True
        df_out = pd.concat([df_out, reduced_df], ignore_index=True) 
    return df_out

df6 = remove_pps_outliers(df5)

Now, let’s create two distinct data frames, each representing apartments in the same location. One data frame will consist of 2-bedroom apartments, while the other will consist of 3-bedroom apartments. Our goal is to examine whether the value of 2-bedroom apartments tends to be greater than that of 3-bedroom apartments with the same square footage. We will plot a scatter plot to visualize the datasets.

def plot_scatter (df, location):
    bhk2 = df[(df.location==location) & (df.bhk==2)]
    bhk3 = df[(df.location==location) & (df.bhk==3)]
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue', label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,color='green', marker='+', label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price Per Square Feet")
    plt.title(location)
    plt.legend()
    
plot_scatter(df6, "Kothanur")

In the visual representation above, we observe an intriguing phenomenon: certain datasets show a paradoxical trend where the price for a 2BHK property is higher than that of a 3BHK property with the same square footage area. To address this anomaly, we employ a systematic approach:

1. Statistical Analysis: We initiate our process by conducting statistical analysis. We find the average (mean), how spread out the numbers are (standard deviation), and the middle number (median) for 2-bedroom apartments based on their size in square feet. These numbers help us understand what’s typical for property values.

2. Outlier Filtering: After that, we clean up our data. We’re on the lookout for unusual cases — ones where the price of a 2-bedroom apartment is way lower than what we’d expect for that size.

3. Rationale: Usually, a 2-bedroom apartment with the same amount of space as a 1-bedroom apartment should cost more. So, if we find any cases where the price doesn’t match this expectation, we consider them unusual and are pruned from our dataset.

#Function to remove 2bhk apartment whose price per sqft is less than price per sqft
# of 1 bhk apartment
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, 
                bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')
df7 = remove_bhk_outliers(df6)
plot_scatter(df7, "Kothanur")

In the image above, we can observe that our dataset is largely cleaned. However, a few abnormalities persist which is a typical scenario in data analysis.

Exploratory Data Analysis

In our journey to build a comprehensive real estate price prediction system, one of the pivotal steps is gaining insights from our dataset. To achieve this, we turn to the power of data visualization using Matplotlib.

Our initial visualizations focus on two key aspects:

Price per Square Foot Analysis: Using Matplotlib, we create visualizations that allow us to understand the distribution of house prices per square foot. This insight helps us grasp the range and variations in property values, forming a crucial foundation for our predictive model.

import matplotlib
plt.hist(df7.price_per_sqft, rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

Bathroom-to-Bedroom Ratio: Another intriguing aspect we explore is the number of bathrooms in relation to the number of rooms (bedrooms). We establish a rule of thumb: a difference of two or more bathrooms than bedrooms is considered acceptable. Instances that do not meet this criterion are flagged for removal from our dataset.

plt.hist(df7.bath, rwidth=0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

#Removing bathroom outlier where number of bathroom greater than number of bedrooms + 2

df8 = df7[df7.bath<df7.bhk+2]

From an initial dataset of 13,320 rows and 9 columns, our rigorous data cleaning, preprocessing, and outlier removal have streamlined the dataset significantly. We now stand at a more refined dataset size of (7251, 7).

Model Building

Machine learning models requires numeric data for their calculations. However, our ‘location’ column contains categorical text data, which needs to be transformed into a suitable format for model input.

To bridge this gap between text data and numeric compatibility, we employ a technique known as one-hot encoding, also called “dummies.” In this technique, each unique ‘location’ category is transformed into a new binary column, where a ‘1’ signifies the presence of that category and ‘0’ indicates absence.

But we need to be careful because sometimes we end up with too many columns that are kind of the same. This is called the dummy variable trap. To avoid it, we drop one of those columns to ensure we have one less dummy column and add the rest to our data. Since we’ve already turned the location text into numbers, we can now get rid of the location column.

#Convert location column into numeric using one hot encoding
dummies = pd.get_dummies(df9.location)
df10 = pd.concat([df9, dummies.drop('other' ,axis='columns')],axis='columns')
df11 = df10.drop('location', axis='columns')
df11.head()

Model Training:

To prepare our data for training, we need to define our independent variables (X) and the dependent variable (y).

Independent Variables (X): These are the features used to make predictions. We exclude the ‘price’ column since it’s the dependent variable we want to predict.

Dependent Variable (y): This is what we aim to predict — real estate prices.

X = df11.drop('price',axis='columns')
y = df11.price

Train-Test Split:

When developing a machine learning model, it’s essential to evaluate how well it performs on unseen data. This is where the train-test split comes into play:

We partition our dataset into two distinct subsets:

Training Set: This portion is used to train our model, allowing it to learn patterns and relationships within the data.

Testing Set: The testing set is held back and remains untouched during model training. It serves as an independent dataset to assess the model’s predictive performance. In our data we take test size as 0.2 which means 20 percent of dataset will be the test sample and 80 percent will be used for training the model.

By evaluating the model on the testing set, it helps us detect overfitting. If a model performs exceptionally well on the training data but poorly on the testing data, it may be overfitting — fitting too closely to the training data and failing to generalize. We can compare the performance of different models on the same testing set to choose the one that best meets our prediction requirements.

#Creating a model to train the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X , y ,test_size = 0.2, random_state=10)

Linear Regression is a widely used and well-understood machine learning algorithm. It assumes a linear relationship between the input features and the target variable, making it easy to understand and explain. By training a Linear Regression model initially, we establish a baseline performance level.

Model Evaluation:

After training the Linear Regression model, we evaluate its performance using the `model.score` function. This function computes the coefficient of determination (R-squared), which quantifies how well the model fits the data. A higher R-squared value indicates a better fit.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test , y_test)

In our case, we get a score of 0.8452277 which signifies that our Linear Regression model can explain approximately 84.52% of the variance in real estate prices. In simpler terms, it’s capturing the underlying patterns in the data with a high degree of accuracy.

Next, we take our model evaluation up a notch with K-Fold Cross-Validation. We use the ShuffleSplit strategy to ensure each fold represents our data fairly, giving us a more reliable assessment of our model’s performance.

#K fold cross validation

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

#Randomize the sample so each of the fold has equal distribution of data samples
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)

Looking at the output, we figured out that for linear regression, even if we run the five fold cross validations, we get score mostly more than 80 percent.

We will explore various models and their configurations to identify the optimal approach for prediction. To achieve this, we leverage the powerful Grid Search CV provided by scikit-learn.

Grid Search CV is a valuable tool that automates the process of running models with different regressors and parameters, allowing us to pinpoint the best-performing combination. In our case, we focus on Lasso Regression and Decision Trees.

Within the function, we specify the algorithms and their associated parameters.

#Trying various algorithms and hyper parameter tuning using GridSearchCV 

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

def best_model(X, y):
    algos = {
        'linear_regression': {
            'model': LinearRegression(),
            'params': {
                'linearregression__normalize': [True, False]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1, 2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion': ['squared_error', 'friedman_mse'],
                'splitter': ['best', 'random']
            }
        }
    }

    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X, y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })
    return pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])

Grid Search CV then provides us with not only the best model score but also the optimal parameters — this process is known as hyperparameter tuning. We use best_model(X, y) to view the best among among all these and it gives us the following output:

This tells us that for our dataset, linear regression is the best model with score above 80 percent.

Model Deployment:

Now, we’re at the stage where we create a practical function. This function takes location, BHK and bathroom numbers as input and, in return, provides an estimated real estate price in lakhs.

def predict_price(location,sqft,bath,bhk):
    loc_index = np.where(X.columns==location)[0][0]
    
    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    if loc_index >=0:
        x[loc_index] = 1
    return model.predict([x])[0]

predict_price('Indira Nagar', 1000, 3,3)

Importance of ML in Real Estate:

1. Enhanced Accuracy: Machine learning looks at lots of data to make very accurate predictions about property prices, which helps both buyers and sellers. For buyers, this means better insights into fair market values, helping them make informed decisions. Sellers, on the other hand, can price their properties more competitively to attract potential buyers, ultimately reducing listing times and minimizing losses.

2. Data-Driven Decisions: Real estate professionals can leverage ML algorithms to gain deep insights into market trends and property values. By examining historical and real-time data, they can make more informed decisions, such as when to invest, sell, or renovate.

3. Efficient Valuations: Property valuation is a critical aspect of the real estate industry. ML algorithms excel at automating this process, making it quicker and more accurate. This streamlines the transaction process, reduces delays, and ensures that buyers and sellers have a clear understanding of a property’s fair market value.

4. Personalized Recommendations: By analyzing user preferences, budgets, and past behaviors, these systems suggest properties that closely match their criteria. This not only simplifies the search process but also enhances the overall user experience.

Applications of ML in Real Estate:

1. Price Prediction: Machine Learning can predict property prices by analyzing a variety of factors, such as location, property size, historical sales data, and more. Based on those data, it can give an estimated price prediction to give an insight of the market worth.

2. Property Valuation: Machine Learning automates and refines this process by analyzing extensive data sets and variables to determine a property’s true worth.

3. Fraud Detection: Machine Learning can be employed to detect irregularities and anomalies in real estate deals. By analyzing data patterns, it can flag suspicious transactions, helping to identify potential fraud and ensuring a safer environment for real estate transactions.

4. Investment Planning: ML algorithms can analyze various factors, including property performance data, market trends, and economic indicators. This analysis provides valuable insights into which properties are likely to yield the best returns. Investors can use this information to strategize and optimize their real estate portfolios, maximizing their potential for profit.

Real-world Considerations

A. Handling Outliers and Anomalies : Sometimes, there are unusual data points in our data frame which are called outliers. Anomalies are just strange or irregular data. When we’re working with real estate data, we need to decide what to do with these outliers and anomalies. Do we keep them because they provide valuable insights, or do we remove them to make our predictions more accurate? It’s all about finding the right balance.

B. Dealing with Dynamic Market Conditions : Real estate is like a moving target; it changes all the time. Market conditions can shift due to various factors. When we’re making predictions, we have to consider how well our models work in this ever-changing environment.

C. Ethical Considerations in Real Estate Prediction : When making predictions in real estate, we need to be fair and honest, following ethical rules. This means treating people and properties the right way and not causing harm.

Conclusion

Our journey through the intricacies of real estate price prediction has been both enlightening and rewarding. We embarked on this expedition with the goal of crafting a system that could demystify the complexities of property valuation using machine learning algorithm. Along the way, we navigated through data collection and preprocessing, uncovered hidden insights during exploratory data analysis, and carefully selected the most promising features. We embraced model selection and validation techniques, ensuring that our predictions are not just accurate but also robust and reliable.

Key Takeaways

The quality and preparation of data are paramount. Proper data collection and preprocessing lay the groundwork for accurate predictions.
Exploring a range of machine learning algorithms, from Linear Regression to more advanced models, allows us to identify the best model for real estate price forecasting.
Rigorous validation techniques, like K-Fold Cross-Validation, ensure that our models are robust and capable of making accurate predictions.
Fine-tuning models through techniques like Grid Search CV optimizes their performance, helping us extract the most accurate predictions.