Working with categorical data: Catboost

Katerina
What's your data?
Published in
4 min readJun 19, 2018

I introduced the issues with categorical data and machine learning with the intent of demonstrating catboost. After upgrading my OS, reinstalling anaconda, updating pip, I finally got catboost installed. Now my older laptop is part of the present.

What is catboost

A machine doesn't understand categorical data. Therefore it is necessary to turn categories, such as Dutch, German, Belgian into numerical values. Simply assinging different levels to them (0,1,2) doesn’t work. If you do this with nominal variables, you are introducing ranking. While this makes sense for t-shirt sizes, you can’t do it with nationalities, colors etc. A common solution is to transform all nominal values into binary dummy variables (one-hot encoding). But that can quickly explode your feature set.

How does catboost work?

Catboost calculates for every category (e.g., Dutch, German, Belgian) of a nominal variable (e.g., nationality), a value (target-based statistic). This is done using a number of steps:

  1. We begin with one categorical feature (e.g., Nationality). This is called x.
  2. In one randomly chosen row (k-th row in the training data set), we exchange one random level of this categorical feature (i-th level of x) with a number (e.g., Dutch by 5)
  3. This number (in our example 5) is usually based on the target variable (the one we want to predict) conditional on the category level. In other words, the target number is based on the expected outcome variable.
  4. A splitting attribute is used to create two sets of the training data: One set that has all categories (e.g., German, French, Indian etc) who will have greater target variable than the one computed in step 3, and the other set with smaller target variables.

These are the steps. The number in step 3 isn’t just a random variable, but based on a target based statistic (TBS) . This can be calculated in four different ways: Greedy, Holdout, leave-out-one and ordered. Catboost uses the last one (ordered TBS)

The greedy TBS is an average of the target (predicted values for a category) using the complete data set. Of course, this can be problematic. The holdout method creates two random data sets from the original training data set. One data set is used to calculate the target statistics and the other to test how accurate it is. Of course this requires more data than the greedy approach. When using the leave-out-one approach, this is kinda the holdout approach, just leaving out one observation, the observation for which the target variable is being calculated.

When doing an ordered TBS the assumption is that what happens in the past influences the future. This makes sense for analyzing clicks, or films watched. For categorical variable like nationality this assumption is a bit weird, but for the sake of calculating the TBS let’s just go with it and introduce ‘artificial time’. In catboost the sequence of rows preceding the focal row are randomized, and used to calculate the target statistics. This is done several times.

Prediction shift and ordered boosting

When predicting a variable, often several models are combined. This is an ensemble method, also called stacking models. The goal is to increase the power of weak learners (weak models).

A problem that arises with weak learner is that they might not be better than a random guess. What can happen if you put several of these just-a-that-better-than-random-guess models together, is that they learn from each other in a bad way. This means they keep on making bad predictions. It’s like they are stuck in a hole and can’t get out (local maximum)

Fitting a function with gradient descent (from Arvind Nataraja or Towards Data Science)

Boosting is a way to get out of this local hole (blue area). This is important to get better prediction, better in the sense that the production are also accurate for data that is not included in the training data.

Catboost is using an ordered boosting method to avoid a prediction shift. A prediction shift happens when the prediction, the variable we want to find out (e.g., GPA or house prices) moves closer to the data in the data set and further away from the true values.

The algorithm developed by Yandex and used in catboost tries to avoid a shift in residuals. Residuals are the difference between the observed and the predicted values. In the left figure this would be the difference between the green line and the red dots.

Catboost makes use of the same ordering trick for calculating the gradient boosting than for computing the target statistics. Again, the sequence of rows before the target observations are changed, and the previous data (i-1) is used to compute the current model (i). Unfortunately this is practically not possible, as it would be too complex and taking up too much memory.

An alternative is gradient boosting algorithms with decision trees. The decision trees are oblivious, meaning that the same splitting criteria is used for every split. Recall, that a a decision tree begins with a root. At this point a decision has to be made which leads to two branches, which split further. There is lots more to learn about decision trees. Important to keep in mind is that splitting costs resources, and hence the cost of splitting should be less than the information gained from the split.

Decision Tree (from Xoriant Blog)

For an excellent tutorial on how to implement catboost (with comparison to other algorithms), check out this post by Alvira Swalin on Towards Data Science

--

--

Katerina
What's your data?

Behind every problem is a web of connectors and links. I look for patterns and offer solution. — I’m also raising 4 humans: I know problems can be solved.