[Week 3-Real Estate Price Estimation]

Tolga Kesemen

Published in

bbm406f18

3 min readDec 16, 2018

Team Members: Kübra Hanköylü, Emre Dağıstan, Ufuk Baran Karakaya

**Source**: https://www.insaatderyasi.com/altin-ve-dovizden-sonra-en-cok-gayrimenkul-kazandirdi-7429h.htm

Data Analysis

After examining the dataset, we decided to make some adjustments. We observed that there are some features with missing values and some with only one unique values. For the sake of simplicity, it was agreed to drop those features from the dataset.

The list of features with missing values are below:

architecturalstyletypeid
basementsqft
buildingclasstypeid
decktypeid
finishedsquarefeet13
finishedsquarefeet6
poolsizesum
pooltypeid10
pooltypeid2
storytypeid
typeconstructiontypeid
yardbuildingsqft26
fireplaceflag
taxdelinquencyflag
taxdelinquencyyear

The list of features with only one unique value are below:

decktypeid
hashottuborspa
poolcnt
pooltypeid10
pooltypeid2
pooltypeid7
storytypeid
fireplaceflag
taxdelinquencyflag

All features can be found in here.

Diving into CatBoost

As it was mentioned in the previous post, it was decided to use CatBoost machine learning algorithm which is developed by Yandex. CatBoost uses gradient boosting on decision trees.

“Boost” comes from the automatic gradient learning algorithm since this library is based on the gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges, such as fraud detection, recommendation elements, forecasting and also works well. It can also give very good results with relatively less data, unlike DL models that need to learn from a large amount of data.

Advantages of Classification and Regression Trees (CART)

Simple to understand, interpret, visualize.
Decision trees implicitly perform variable screening or feature selection.
Can handle both numerical and categorical data. Can also handle multi-output problems.
Decision trees require relatively little effort from users for data preparation.
Nonlinear relationships between parameters do not affect tree performance.

Disadvantages of CART

Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.

References

Decision Trees in Machine Learning

A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering…

towardsdatascience.com

[Week 3-Real Estate Price Estimation]

Data Analysis

Diving into CatBoost

References

Decision Trees in Machine Learning

A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering…

CatBoost: A machine learning library to handle categorical (CAT) data automatically

How many of you have seen this error while building your machine learning models using "sklearn"? I bet most of us! At…

Written by Tolga Kesemen