[Week 2 — House Price Prediction]

Halis Taha Şahin
bbm406f18
Published in
2 min readDec 14, 2018

Team Members: Harun Özbay, Halis Taha Şahin, Cihat Duman

Hi there,

Last week we published a post about our data set and the methods of machine learning we will use.

This week we will evaluate the preprocessing part of the data as in every machine learning study. Data preprocessing can be defined as evaluating the features of the data and evaluating the effect of each feature on the result separately. The process of preparing data for Machine Learning algorithm comprises the following:

  • Data Selection
  • Data Preprocessing: Formatting the data, Cleaning the data, Sampling the data
  • Data Transformation: Scaling, Aggregation

Considering all this, we will look at our own work. We have taken an idea of the problem by examining the work that has been done before and realized our studies.

First, we evaluated the null elements in the data set. Because in order to use some machine learning models, there should be no null elements in the dataset.

Missing Values

As we have mentioned a previous week, our data has 80 features except for sale price. Let’s take a look at our data set. In this table, we can see the features that have the most missing data.

Correlation

After that, we look at the correlation between the sales price and the features in our dataset. Here are the top 50% correlation features with a sales price.

On the other hand, our most important features relatively to sales prices:

OverallQual      0.817185
GrLivArea 0.700927
GarageCars 0.680625
GarageArea 0.650888
TotalBsmtSF 0.612134
1stFlrSF 0.596981
FullBath 0.594771
YearBuilt 0.586570
YearRemodAdd 0.565608
GarageYrBlt 0.541073

Based on all these, we must adhere to these evaluations in our work. In addition, we should carry out our future studies by considering these results.

Future Work

In the following weeks, we will fill the missing elements in the data by adhering to the work we do on the data. Then, we will compare the results by applying machine learning models to our processed data. In addition, we should determine the features that cause noise on the result.

References

https://www.simplilearn.com/data-preprocessing-tutorial

https://www.kaggle.com/surya635/house-price-prediction

--

--