House Prices Prediction using Regression Model and Web Scraping

Harun Erbay
İstanbul Data Science Academy
4 min readDec 26, 2021

Introduction

The second project of the Data Science Bootcamp, organized in partnership with Istanbul Data Science Academy and Hepsiburada, has been successfully completed. In this project, we made a house price prediction model using Machine Learning which houses for sale in Kadikoy, Istanbul. We got the data from this site for web scraping.

Problem Statement

The goal of project is to help us understand the relationship between house features and how these variables are used to predict house price.

Objective

  • Web scraping using Python BeautifulSoup
  • Predict the house price

Methodology

  • Data Collection with Web Scraping
  • Data Cleaning and Reorganizing
  • Exploratory Data Analysis (EDA)
  • Linear Regression
  • Conclusions

1. Data Collection with Web Scraping

Step 1 :

Our first step is to import the libraries that might be required to build our model.

Step 2 :

Using this function , we can get html code of given url to function.

Step 3 :

Then, we need a function like below to reach all pages.

Step 4 :

In the next step, we needed all links on all pages to access the house postings.

Step 5 :

Finally, we pulled the features that we will load into the columns from the page and created our data frame.

df.to_csv("zingat_house_price_prediction.csv", encoding="ANSI")

2. Data Cleaning and Reorganizing

Info about data
Number of null values in the data

We have some problems about data set. We need to deal with these before we get to the regression part. So we use the median function to fill in the null values and convert the columns of object type to integer type. We dropped the “ODA-SALON SAYISI” column for better machine learning. We created new columns by applying get_dummies to the “MAHALLE” column.

Median function for null values
Object types to integer types
Applying get_dummies

3. Exploratory Data Analysis

After data cleaning and reorganizing target variable’s distribution as follows;

House prices distribution

We grouped some features to build a better machine learning model and visualized these. You can find detailed codes here.

  • 1.Top 10 neighborhoods with the highest average house prices.
  • 2.Distribution of the number of rooms and halls.
  • 3.Distribution of the number of bathrooms.
  • 4.Correlation matrix of house features
  • 5.Correlation matrix of the neighborhood where the house is located
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
sns.pairplot(DF, aspect=1.5)

4. Linear Regression

First of all, we need to import all the packages we are sure to use.

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from pandas import Series, DataFrame
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

We used the dataset which we applied get_dummies (new_df). After, we split the data and categorize which columns are feature variables and which are target variables. The size of testing data is 20%.We used R2-Score and MSE as evaluation metrics in model prediction and made it for train-validation-test.

MSE Value:  FİYAT    5.757277e+12
dtype: float64

R^2 Score Value (Test): 0.5801046663202142

R^2 Score Value (Validation): 0.5690857021978828
MSE Value:  FİYAT    6.274632e+12
dtype: float64

R^2 Score Value (Test): 0.5438016679245479

R^2 Score Value (Validation): 0.5303632838955692
R^2 Score Value (Test):  0.5793058427962506 

R^2 Score Value (Validation): 0.563142565405035
R^2 Score Value (Test):  0.5837754090161392 

R^2 Score Value (Validation): 0.5620382643849704

5. Conclusions

According to analysis, the features that most positively affect the price of a house in Kadıkoy are;

  • The square meter of the house.
  • Number of rooms and living rooms of the house
  • The house is located in Caddebostan or Fenerbahce.
  • The best regression method are shown Linear Regression and Lasso Regression.

What can we do to improve the prediction model?

  • Scanning the house data in larger area.
  • Working with more features.
  • Doing more extensive feature engineering on features.

Thank you for reading and taking the time to read my article.You can visit my LinkedIn and Github accounts for more detailed information.

--

--