Working with categorical data
Coming from the social sciences I’m used to working with a data sets that have a couple of categorical variables. These are often experimental condition (control vs. intervention), gender (female vs. male), or nationality (Dutch vs. German vs. Belgian vs. other). But as part of the week 2 exercise in Method Data Science Community, my team is tackling a Kaggle challenge with many categorical variables. As my laptop is still working on the one-hot encoding, a different solution is needed. My team member Chris Dinant already suggested Catboost as a way to work with the categorical data. This post will delve into this method, explaining the situation when it can be applied, and how to apply it.
What’s the problem with categorical variables?
When social scientists work with categorical variables, often they use one of two solutions: First, an ANOVA or MANOVA is used. By using a factorial design, it is possible to make interference about the differences between groups. Secondly, if the categorical variable has only two levels, it is included as a dummy variable in a regression. Nationality is often only included as a control variables. In this case it often doesn’t matter what nationality is positively or negatively related with the outcome variable. It’s just there to control for potential national and cultural tendencies.
When running machine learning algorithms, simply assigning numbers to categorical variables work if a category has only two levels. This is the case for gender (male/female), bought a product (yes/no), attended a course (yes/no). When a category has several levels, as with nationality, assigning numbers to each level implies an order of the levels. This means that one level of the category has a lower rank than another level. While this makes sense for ordinal variables (e.g., preferences of food items or educational degree), it is a wrong assumption for nominal variables such as color preferences, nationality, residential city.
Transformation: One-hot encoding and getting dummies
To deal with categorical variables that have more than two levels, the solution is one-hot encoding. This takes every level of the category (e.g., Dutch, German, Belgian, and other), and turns it into a variable with two levels (yes/no). In our example the variable nationality would then be transformed into four variables (Dutch, German, Belgian, and other), each taken the value of 1 if the participant has the nationality and 0 otherwise.
Working directly with nominal variables: Catboost
Catboost was developed by researchers and engineers at Yandex for their own work of ranking tasks, forecasting, and making recommendations. In short it’s a gradient boosting method on decision trees.
Gradient boosting is a method by which several not-so-good regression or classification models are combined to come up with a prediction. These regression or classification are called not-so-good, because they are weak. This means they are only a slightly bit better than a previously tried model. By combining different weak prediction, an ensemble of models is created. This is also called stacking models.
More reading on gradient boosting:
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning: Introduction of the method and example
- Gradient Boosting from scratch: If you like math formulas, check out this post
Decision tree is a method and algorithm for regression and classification tasks in which at every step a decision is made if a certain observation belongs to the previous ones or not. A decision tree begins with the root, the value that should be predicted. It then fans out the further you go, with values of the various variables being assigned to different branches.
More readings on decision trees:
- Decision trees in machine learning: Excellent introduction into decision trees for machine learning.
Understanding now what gradient boosting is and what a decision tree is, we can now make more sense of what catboost is supposed to do: It’s an algorithm that combines the predictive powers of different decisions trees.
Catboost and dealing with categorical features
In their paper Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin describe how catboost is dealing with categorical features. The standard way is to compute some statistics, such as median, based on the label values of the category. However, this creates problems if there is only one example for a label value. In this case, the numerical value of the category would be the same than the label value. For example if in our example with nationalities, the category Belgian is assigned the value 2, and there is only 1 Belgian student, this student would get the value 2 for nationality. This can create problems of overfitting.
Overfitting happens when a model is very well able to predict the outcome variable in the training data set, but is terrible at predicting the outcome variable in another data set. In this case, the model has been overly worked to fit the training data set.
To avoid this problem, the authors designed a solution which involves randomly changing the order of rows in the complete data set
We perform a random permutation of the data set and for each example we compute average label value for the example with the same category value placed before the given one in the permutation (p. 2)
In their paper they also describe how different features are combined to create a new feature. Think about it, every individual observations of categorical and numerical data points describe one observation. The chances that two observations are exactly identical is slim. Hence, different categorical values and numerical values could be combined to create a unique merged categorical variable which contains all the different individual choices. While this might sound easy, doing this for all potential types of combinations will be computational intensive.
Another way to combine different features is to do a greedy search at every tree split. Catboost does this by combining all categorical and numerical values at the current tree with all categorical values in the data set.
Greedy search is when you look for a solution but do not consider all the potential combinations and decision points. Instead you focus on the local surrounding. This means making local optimal choices. This is contrasted with exhaustive search, where all the potential combinations are considered.
Catboost offers tutorials. I would post a tutorial, but I have another issue that needs to be solved right now: jupyter notebook doesn’t start after installing anaconda. Why did I install anaconda ? catboost didn’t want to install via pip command. I was still using TSL 1.0.