[Week 2-Real Estate Price Estimation]

Ufuk Baran Karakaya
bbm406f18
Published in
3 min readDec 9, 2018

Team Members: Kübra Hanköylü, Tolga Kesemen, Emre Dağıstan

Observing Example Projects

As the second step, example projects which are published on Kaggle‘s page are observed. Approximately all projects based on XGBoost Model. Every participant considers the reduction of features to use memory effectively and gain faster run times. These projects give us some criterions which we should focus on.

XGBoost Model (Extreme Gradient Boosting)

Nowadays , XGBoost is one of the most popular machine learning algorithm. Regardless of the type of prediction task at hand; regression or classification.

How XGBoost distinguishes itself from other models ?

  • XGBoost Model is quite popular because the faster than other ensemble classifiers. Its algorithm is parallelizable on GPU’s. Especially large datasets may lead some run-time problems but working on GPU’s can handle them.
  • A wide range of settings: inside XGBoost, there are options for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, API compatible with scikit-learn, etc.

Light BGM

Advantages of Light GBM :
Faster learning speeds and higher efficiency: Light GBM uses a histogram based algorithm, that is, it combines continuous feature values ​​into discrete cells that speed up the learning process.
Lower memory usage: replaces continuous values ​​with discrete cells, resulting in less memory usage.Better accuracy than any other boosting algorithm: it creates much more complex trees by following a phased partitioning approach, not a level-based approach, which is a major factor in achieving higher accuracy. However, sometimes this can lead to retrofitting, which can be avoided by setting the max_depth parameter.
Compatibility with large data sets: it can work equally well with large data sets with a significant reduction in training time compared to XGBOOST.
Parallel training is supported.

CatBoost

  • Performance: CatBoost provides better results and is competitive with any leading machine learning algorithm on the performance front.
  • Automatically handle categorical features: We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values ​​into numbers using various statistics on combinations of categorical characteristics and combinations of categorical and numerical characteristics.
  • Robust: reduces the need for a wide tuning of hyper-parameters and reduces the possibilities of overfeeding, which leads to more generalized models. Although, CatBoost has multiple parameters to tune and contains parameters such as the number of trees, the learning speed, the regularization, the depth of the tree, the size of the fold, the bagging temperature and others.
  • Easy to use: you can use CatBoost from the command line, using an easy-to-use API for Python and R.

XGBoost vs. Light BGM vs. CatBoost

The table shows us comparison of these models properly. As we can see the below, CatBoost is the best model of them.

To sum up, CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as one-hot encoding. Because all of these reasons, we have decided to work with the CatBoost.

source: https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

Analysis of Features

source: https://www.kaggle.com/philippsp/exploratory-analysis-zillow

Since most of the features are sparse, we have decided to eliminate some of them to lower the memory usage and gain faster run-time.

References

--

--