Photo by John Moeses Bauan on Unsplash

Working with categorical data

Katerina
What's your data?
Published in
5 min readJun 17, 2018

--

Coming from the social sciences I’m used to working with a data sets that have a couple of categorical variables. These are often experimental condition (control vs. intervention), gender (female vs. male), or nationality (Dutch vs. German vs. Belgian vs. other). But as part of the week 2 exercise in Method Data Science Community, my team is tackling a Kaggle challenge with many categorical variables. As my laptop is still working on the one-hot encoding, a different solution is needed. My team member Chris Dinant already suggested Catboost as a way to work with the categorical data. This post will delve into this method, explaining the situation when it can be applied, and how to apply it.

What’s the problem with categorical variables?

When social scientists work with categorical variables, often they use one of two solutions: First, an ANOVA or MANOVA is used. By using a factorial design, it is possible to make interference about the differences between groups. Secondly, if the categorical variable has only two levels, it is included as a dummy variable in a regression. Nationality is often only included as a control variables. In this case it often doesn’t matter what nationality is positively or negatively related with the outcome variable. It’s just there to control for potential national and cultural tendencies.

When running machine learning algorithms, simply assigning numbers to categorical variables work if a category has only two levels. This is the case for gender (male/female), bought a product (yes/no), attended a course (yes/no). When a category has several levels, as with nationality, assigning numbers to each level implies an order of the levels. This means that one level of the category has a lower rank than another level. While this makes sense for ordinal variables (e.g., preferences of food items or educational degree), it is a wrong assumption for nominal variables such as color preferences, nationality, residential city.

Transformation: One-hot encoding and getting dummies

To deal with categorical variables that have more than two levels, the solution is one-hot encoding. This takes every level of the category (e.g., Dutch, German, Belgian, and other), and turns it into a variable with two levels (yes/no). In our example the variable nationality would then be transformed into four variables (Dutch, German, Belgian, and other), each taken the value of 1 if the participant has the nationality and 0 otherwise.

Output from code above: Dataset df
Output from code above: array X
Output from code above: Dataset pd after using get_dummies

Working directly with nominal variables: Catboost

Catboost was developed by researchers and engineers at Yandex for their own work of ranking tasks, forecasting, and making recommendations. In short it’s a gradient boosting method on decision trees.

Gradient boosting is a method by which several not-so-good regression or classification models are combined to come up with a prediction. These regression or classification are called not-so-good, because they are weak. This means they are only a slightly bit better than a previously tried model. By combining different weak prediction, an ensemble of models is created. This is also called stacking models.

More reading on gradient boosting:

Decision tree is a method and algorithm for regression and classification tasks in which at every step a decision is made if a certain observation belongs to the previous ones or not. A decision tree begins with the root, the value that should be predicted. It then fans out the further you go, with values of the various variables being assigned to different branches.

More readings on decision trees:

Understanding now what gradient boosting is and what a decision tree is, we can now make more sense of what catboost is supposed to do: It’s an algorithm that combines the predictive powers of different decisions trees.

Catboost and dealing with categorical features

In their paper Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin describe how catboost is dealing with categorical features. The standard way is to compute some statistics, such as median, based on the label values of the category. However, this creates problems if there is only one example for a label value. In this case, the numerical value of the category would be the same than the label value. For example if in our example with nationalities, the category Belgian is assigned the value 2, and there is only 1 Belgian student, this student would get the value 2 for nationality. This can create problems of overfitting.

Overfitting happens when a model is very well able to predict the outcome variable in the training data set, but is terrible at predicting the outcome variable in another data set. In this case, the model has been overly worked to fit the training data set.

To avoid this problem, the authors designed a solution which involves randomly changing the order of rows in the complete data set

We perform a random permutation of the data set and for each example we compute average label value for the example with the same category value placed before the given one in the permutation (p. 2)

In their paper they also describe how different features are combined to create a new feature. Think about it, every individual observations of categorical and numerical data points describe one observation. The chances that two observations are exactly identical is slim. Hence, different categorical values and numerical values could be combined to create a unique merged categorical variable which contains all the different individual choices. While this might sound easy, doing this for all potential types of combinations will be computational intensive.

Another way to combine different features is to do a greedy search at every tree split. Catboost does this by combining all categorical and numerical values at the current tree with all categorical values in the data set.

Greedy search is when you look for a solution but do not consider all the potential combinations and decision points. Instead you focus on the local surrounding. This means making local optimal choices. This is contrasted with exhaustive search, where all the potential combinations are considered.

Catboost offers tutorials. I would post a tutorial, but I have another issue that needs to be solved right now: jupyter notebook doesn’t start after installing anaconda. Why did I install anaconda ? catboost didn’t want to install via pip command. I was still using TSL 1.0.

--

--

Katerina
What's your data?

Behind every problem is a web of connectors and links. I look for patterns and offer solution. — I’m also raising 4 humans: I know problems can be solved.