Understanding CatBoost Algorithm

Meet Raval
Analytics Vidhya
Published in
4 min readAug 17, 2020

One of the Best Boosting Algorithm

Photo by Hannah Troupe on Unsplash

In this quick tutorial, we are going to discuss:

  • Origins of CatBoost.
  • When to use CatBoost (Which type of data).
  • How to implement CatBoost on any dataset.

CatBoost originated in a Russian company named Yandex. It is one of the latest boosting algorithms out there as it was made available in 2017. There were many boosting algorithms like XGBoost, LightGBM etc. but none comes close to the CatBoost for several reasons.

CatBoost means Categorical Boosting because it is designed to work on categorical data flawlessly, If you have Categorical data in your dataset

Here are some features of the CatBoost, which makes it stand apart from all the other Boosting Algorithm.

  • High Quality without parameter tuning
  • Categorical Features support
  • The fast and scalable GPU version
  • Improved accuracy by reducing overfitting
  • Fast Predictions
  • Works well with less data

For the reason mentioned above CatBoost is beloved in recent Kaggle competitions, now let’s answer another question When to use this Amazing Boosting Algo?

When to Use the CatBoost Algorithm?

There are two types of Data out there Heterogeneous data and Homogeneous data.

Heterogeneous data: It is any data with high variability of data types and formats. They can be ambiguous and low quality due to missing values, high data redundancy. example: Dataset to predict Credit Score

Homogeneous data: It is dataset made up of the things which are similar to each other, which means that the entire dataset is of same data type and formate. example: Dataset of Images, Video, Sound, Text.

CatBoost Works well on Heterogeneous data.

So, if you have a dataset of Images, Text or Sound it is probably a good idea to use Neural Networks as Neural Networks are state of the art for these types of Homogeneous data.

But, if you have a classification problem with Heterogeneous data then using CatBoost will be the safest thing to do because CatBoost is able to outperform the majority of the Boosting algorithms in the first run.

Implementing CatBoost

We are going to apply CatBoost on a Human Activity recognition dataset for multiclass classification. Human Activity Recognition dataset is UCI dataset, in this classification problem we have to predict six activities like WALKING, WALKING UPSTAIRS, WALKING DOWNSTAIRS, SITTING, STANDING & LAYING, based on the data extracted by smartphone’s sensors like gyroscope and accelerometer.

To solve this problem after importing the dataset I applied the pre-processing using Standard Scaler, and after that came the main part where we apply CatBoost Classifier. Let’s see the classifier code and understand every-single-line.

So, first of all we have to import the CatBoost Classifier as shown in the first line of the code.

After importing the CatBoost Library we will create our model, now let’s go through those parameters.

  • Iterations: 1000 iterations means the CatBoost algorithm will run 1000 times to minimize the loss function.
  • Loss Function: In here as we are classifying multiple classes we have to specify ‘Multiclass’. In the case of Binary classification, it is okay if we don't mention the Loss Function the algorithm will understand and perform binary classification.
  • bootstrap_type: This parameter affects the Regularization & speed of the algorithm aspects of choosing a split for a tree when building the tree structure. Here we have chosen Bayesian, But it is okay if we didn’t specify this parameter.
  • eval_metric: In here as we have to do multiclass classification we have chosen ‘Multiclass’ as eval_metric and when working with Binary Classification we don’t have to specify this parameter.
  • leaf estimation iterations: This parameter defines rules for calculating leaf values after selecting the tree structure, we have taken 100 but it is also okay to not specify this parameter.
  • random strength: It specifies how random do we want our gradient boosting trees to be from each other. It is okay if we didn’t specify.
  • depth: How deep do we want our tree to be I have specified 7 because it gave me the highest accuracy but it is okay not to specify it and let the CatBoost algorithm use its default value.
  • l2 leaf regularization: To specify the L2-regularization value, we have taken 5 but it’s not mandatory.
  • learning rate: It is very important but generally default CatBoost learning rate of 0.03 also works well.
  • Bagging temperature: Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes. Use the Bayesian bootstrap to assign random weights to objects. Not mandatory to specify.
  • task type: It is very much recommended to use CatBoost algorithm with GPU only because with CPU CatBoost algorithm becomes quite slow.

After changing the parameters and finding the best parameter we will fit the model and predict the output. It is very easy to work with and CatBoost only needs several parameters to tune for example if you are tuning,

  • Learning rate + iterations
  • depth
  • l2_regularization
  • bagging_temoerature
  • random_strength

Then you will be able to easily get very good performance compared to other boosting algorithms.

In the CatBoost you can run the model with just specifying the dataset type (Binary or Multiclass classification) and still you will be able to get a very good score without any overfitting.

So, this is how this algorithm works. I am providing some links in the reference through which you will be able to learn about this CatBoost algorithm.

Reference

--

--

Meet Raval
Analytics Vidhya

Software Developer at Arctic Wolf. Want to make the world a better place.