Prediction of house prices in South Jakarta using the Decision Tree Regressor method

Anggi Setyawan Riyadi
3 min readOct 18, 2022

--

Dari anggi

Today’s technology and information give birth to new innovations in business. One of the technologies that we can use is Data Mining in finding useful information from various sales data. The purpose of this study is that the author tries to apply the Data Mining technique with the Decision Tree Regresor method on house sales in South Jakarta which is expected to provide information in the form of predictions of house prices in South Jakarta with several types of houses that have been determined such as area, building area, number of rooms. sleep and number of bathrooms. The target in this study is to analyze the house price prediction system in the South Jakarta area. data taken from here

Data Description

Contect

The House Price Dataset is a list of house prices, namely data on house prices in the South Jakarta area. The data is taken and collected from the home sales website, namely rumah123.com

Contents:

The South Jakarta house price dataset consists of 7 columns with a total of 1001 data. The column consists of:

  • HARGA : harga dari rumah.
  • LT : jumlah luas tanah.
  • LB : jumlah luas bangunan tingkat dan tidak.
  • JKT : jumlah kamar tidur.
  • JKM : jumlah kamar mandi.
  • GRS : Garasi -> ada/tidak ada
  • KOTA : nama kota.

Let’s Coding!!!

Import Library and read data

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import seaborn as sns
plt.style.use('seaborn')
sns.set_style('darkgrid')
%matplotlib inline
df=pd.read_csv('https://raw.githubusercontent.com/Anggiboy/RegressionProject/main/HARGA%20RUMAH%20JAKSEL2.csv',sep=';')df

Exploratory Data Analysis

Describes the number of rows and columns

df.shape
df

Explain column name

df.columns
df

Checking Missing Data

df.isnull().sum()
df

Data Normalization

df['HARGA']=(df['HARGA']/250000000000)
df

Corelation

corrmatrix = df.corr()
sns.heatmap(corrmatrix, annot=True)
plt.savefig('corelasi.png')
plt.show()

Pre-Processing

Analyzing numeric variables

numerik = [var for var in df.columns if df[var].dtype!='O'] print('Ada {} variabel numerik'.format(len(numerik))) print('Variabel numeriknya adalah :', numerik)df[numerik].head(10)

Analyzing categoric variables

kategorik = [var for var in df.columns if df[var].dtype=='O'] print('Ada {} variabel kategorik'.format(len(kategorik))) print('Variabel kategoriknya yaitu :', kategorik)df[kategorik].head(10)

Drop Data

df = df.drop(['KOTA'], axis=1)
df

Label

from sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()
df['GRS'] = label_encoder.fit_transform(df['GRS'])
df

Decision Tree Regression model

Choose independent and dependent variables

X=df[['LT','LB','JKT','JKM','GRS']].values.reshape(-1,5)
y=df['HARGA'].values.reshape(-1,1)

Split Data

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Fitting Decision Tree Regression to the dataset

from sklearn.tree import DecisionTreeRegressorregressor = DecisionTreeRegressor()
regressor.fit(X, y)

Regressor score

from sklearn.model_selection import cross_val_predict
score = cross_val_predict(regressor,X,y)
print(score)
np.mean(score)
regressor.score(X_test,y_test)

Show the predicted and actual values

y_pred = regressor.predict(X_test)
y_pred_reshaped=np.reshape(y_pred,(201))
y_test_reshaped=np.reshape(y_test,(201))
result = pd.DataFrame({'prediksi':y_pred_reshaped,'aktual':y_test_reshaped}).astype(float)
result*250000000000

Show Error Value

from sklearn import metricsprint('Mean Absolute Error',metrics.mean_absolute_error(y_test,y_pred))print('Mean Absolute Percentage Error',metrics.mean_absolute_percentage_error(y_test,y_pred))print('Mean Squared Error',metrics.mean_squared_error(y_test,y_pred))print('Root Mean Squared Error',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Plotting

regressor_contoh = DecisionTreeRegressor()
regressor_contoh.fit(df[['LT']].values,df['HARGA'].values)
plt.figure(figsize=(15,10))
X_grid = np.arange(min(df['LT'].values), max(df['LT'].values))
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(df['LT'].values,df['HARGA'].values,color='red')
plt.plot(X_grid, regressor_contoh.predict(X_grid),color='blue')
plt.title('Decision Regression Model')
plt.xlabel('LT')
plt.ylabel('HARGA')
plt.savefig('picture.png')
plt.show()

Evaluation

LT=input('LT =')
LB=input('LB =')
JKT=input('JKT=')
JKM=input('JKM=')
GRS=input('GRS=')
val = regressor.predict(np.array([LT,LB,JKT,JKM,GRS]).reshape(-1,5))
val_new=val*250000000000
print('Prediksi :')
pd.DataFrame({'LT': LT,'LB':LB,'JKT':JKT,'JKM':JKM,'GRS':GRS,'Prediksi':val_new})

Here you determine the width of the land, the number of rooms, the number of bathrooms outside the garage, to see the prediction of house prices

THANK YOU!!!!!!

--

--

Anggi Setyawan Riyadi

Data enthusiast Anggi Setyawan: Unveiling insights, sharing tips, and exploring the world of data together! | https://anggise2023.github.io/ 🚀 #DataExploration