Prediction of house prices in South Jakarta using the Decision Tree Regressor method
Today’s technology and information give birth to new innovations in business. One of the technologies that we can use is Data Mining in finding useful information from various sales data. The purpose of this study is that the author tries to apply the Data Mining technique with the Decision Tree Regresor method on house sales in South Jakarta which is expected to provide information in the form of predictions of house prices in South Jakarta with several types of houses that have been determined such as area, building area, number of rooms. sleep and number of bathrooms. The target in this study is to analyze the house price prediction system in the South Jakarta area. data taken from here
Data Description
Contect
The House Price Dataset is a list of house prices, namely data on house prices in the South Jakarta area. The data is taken and collected from the home sales website, namely rumah123.com
Contents:
The South Jakarta house price dataset consists of 7 columns with a total of 1001 data. The column consists of:
- HARGA : harga dari rumah.
- LT : jumlah luas tanah.
- LB : jumlah luas bangunan tingkat dan tidak.
- JKT : jumlah kamar tidur.
- JKM : jumlah kamar mandi.
- GRS : Garasi -> ada/tidak ada
- KOTA : nama kota.
Let’s Coding!!!
Import Library and read data
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import seaborn as sns
plt.style.use('seaborn')
sns.set_style('darkgrid')
%matplotlib inlinedf=pd.read_csv('https://raw.githubusercontent.com/Anggiboy/RegressionProject/main/HARGA%20RUMAH%20JAKSEL2.csv',sep=';')df
Exploratory Data Analysis
Describes the number of rows and columns
df.shape
df
Explain column name
df.columns
df
Checking Missing Data
df.isnull().sum()
df
Data Normalization
df['HARGA']=(df['HARGA']/250000000000)
df
Corelation
corrmatrix = df.corr()
sns.heatmap(corrmatrix, annot=True)
plt.savefig('corelasi.png')
plt.show()
Pre-Processing
Analyzing numeric variables
numerik = [var for var in df.columns if df[var].dtype!='O'] print('Ada {} variabel numerik'.format(len(numerik))) print('Variabel numeriknya adalah :', numerik)df[numerik].head(10)
Analyzing categoric variables
kategorik = [var for var in df.columns if df[var].dtype=='O'] print('Ada {} variabel kategorik'.format(len(kategorik))) print('Variabel kategoriknya yaitu :', kategorik)df[kategorik].head(10)
Drop Data
df = df.drop(['KOTA'], axis=1)
df
Label
from sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()
df['GRS'] = label_encoder.fit_transform(df['GRS'])df
Decision Tree Regression model
Choose independent and dependent variables
X=df[['LT','LB','JKT','JKM','GRS']].values.reshape(-1,5)
y=df['HARGA'].values.reshape(-1,1)
Split Data
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressorregressor = DecisionTreeRegressor()
regressor.fit(X, y)
Regressor score
from sklearn.model_selection import cross_val_predict
score = cross_val_predict(regressor,X,y)
print(score)
np.mean(score)regressor.score(X_test,y_test)
Show the predicted and actual values
y_pred = regressor.predict(X_test)
y_pred_reshaped=np.reshape(y_pred,(201))
y_test_reshaped=np.reshape(y_test,(201))
result = pd.DataFrame({'prediksi':y_pred_reshaped,'aktual':y_test_reshaped}).astype(float)result*250000000000
Show Error Value
from sklearn import metricsprint('Mean Absolute Error',metrics.mean_absolute_error(y_test,y_pred))print('Mean Absolute Percentage Error',metrics.mean_absolute_percentage_error(y_test,y_pred))print('Mean Squared Error',metrics.mean_squared_error(y_test,y_pred))print('Root Mean Squared Error',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
Plotting
regressor_contoh = DecisionTreeRegressor()
regressor_contoh.fit(df[['LT']].values,df['HARGA'].values)
plt.figure(figsize=(15,10))
X_grid = np.arange(min(df['LT'].values), max(df['LT'].values))
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(df['LT'].values,df['HARGA'].values,color='red')
plt.plot(X_grid, regressor_contoh.predict(X_grid),color='blue')
plt.title('Decision Regression Model')
plt.xlabel('LT')
plt.ylabel('HARGA')
plt.savefig('picture.png')
plt.show()
Evaluation
LT=input('LT =')
LB=input('LB =')
JKT=input('JKT=')
JKM=input('JKM=')
GRS=input('GRS=')
val = regressor.predict(np.array([LT,LB,JKT,JKM,GRS]).reshape(-1,5))
val_new=val*250000000000
print('Prediksi :')pd.DataFrame({'LT': LT,'LB':LB,'JKT':JKT,'JKM':JKM,'GRS':GRS,'Prediksi':val_new})
Here you determine the width of the land, the number of rooms, the number of bathrooms outside the garage, to see the prediction of house prices
THANK YOU!!!!!!