[Week 3-Real Estate Price Estimation]

Tolga Kesemen
bbm406f18
Published in
3 min readDec 16, 2018

Team Members: Kübra Hanköylü, Emre Dağıstan, Ufuk Baran Karakaya

Data Analysis

After examining the dataset, we decided to make some adjustments. We observed that there are some features with missing values and some with only one unique values. For the sake of simplicity, it was agreed to drop those features from the dataset.

The list of features with missing values are below:

  • architecturalstyletypeid
  • basementsqft
  • buildingclasstypeid
  • decktypeid
  • finishedsquarefeet13
  • finishedsquarefeet6
  • poolsizesum
  • pooltypeid10
  • pooltypeid2
  • storytypeid
  • typeconstructiontypeid
  • yardbuildingsqft26
  • fireplaceflag
  • taxdelinquencyflag
  • taxdelinquencyyear

The list of features with only one unique value are below:

  • decktypeid
  • hashottuborspa
  • poolcnt
  • pooltypeid10
  • pooltypeid2
  • pooltypeid7
  • storytypeid
  • fireplaceflag
  • taxdelinquencyflag

All features can be found in here.

Diving into CatBoost

As it was mentioned in the previous post, it was decided to use CatBoost machine learning algorithm which is developed by Yandex. CatBoost uses gradient boosting on decision trees.

Boost” comes from the automatic gradient learning algorithm since this library is based on the gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges, such as fraud detection, recommendation elements, forecasting and also works well. It can also give very good results with relatively less data, unlike DL models that need to learn from a large amount of data.

Advantages of Classification and Regression Trees (CART)

  • Simple to understand, interpret, visualize.
  • Decision trees implicitly perform variable screening or feature selection.
  • Can handle both numerical and categorical data. Can also handle multi-output problems.
  • Decision trees require relatively little effort from users for data preparation.
  • Nonlinear relationships between parameters do not affect tree performance.

Disadvantages of CART

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
  • Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.

--

--