Types of Categorical Features (Never Forget Cyclic!)

Elevate your feature engineering by treating variables the way they deserve

5 min readJul 28, 2022

Also called qualitative, categorical features contain values that are sorted into defined groups. Outside of this general definition, there are several subsets that must all be treated differently in the feature engineering process.

Nominal

The classic, this is the type of categorical feature people think of first. This subset does not have an intrinsic order. Say you wrote a survey that questioned participants on their favorite food out of the options: “sandwich”, “salad”, “pasta”, and “soup”. There are four options, but there’s no set order they are described unless you purposely find one and it is communicated to the participants (most carb heavy maybe?).

To prepare the data for a machine learning model, it is normally cleaned using one-hot encoding. In our example, this would expand our single feature to four separate ones (favorite_food to ff_sandwich, ff_salad, ff_pasta, ff_soup) and instead of the value being “sandwich” the four corresponding values are given a 1 or a 0 depending on if that was the participant’s response.

Dichotomous / Binary

The ole binary feature. This categorical feature has only two categories/options. Mistakenly considered a numerical feature very, very often.

Examples here would be a light being “on” or “off”. Two options, no other possibility.

Two drawn lightbulbs, one turned off and one lit up. — 📷: Cite

The confusion comes in when instead of writing the light being “on” or “off”, it is represented as binary numbers 1 (on) and 0 (off). On initial reaction, since 1 and 0 are numbers, many people consider the feature numerical and thus treat it incorrectly when doing feature imputation.

However, as long as you keep in mind the feature is categorical, the easiest way to represent a binary feature in a model is to have the number 1 represent one value and the number 0 represent the other.

Ordinal

This subset represents categorical features with a defined order, but there is often no quantifiable measure of distance between the different categories.

For example, the categories “tolerable”, “liked”, “liked a lot” have a defined order of increasingly positive emotion. However, how much distance in emotion is there from “liked” to “liked a lot”?

Example of ordinal data listing the values: Didn’t like, Tolerable, Liked, Liked a Lot, Loved. — 📷: Cite

To use these values in models, a person’s gut reaction is usually, “Oh, since they have an order anyways, let’s assign them increasing numbers: 4 (loved), 3(liked a lot), 2 (liked), 1 (tolerable), and 0 (didn’t like).” That way the order is maintained.

While sometimes the best method, the data developer first needs to justify that the categories are all equal distance from each other in order for this logic to hold (ie, “a lot” is the same distance from “a little” as “a little” is from “none”).

For more information on how to handle cleaning ordinal features, I found “Five Ways to Analyze Ordinal Variables” by Karen Grace-Martin very helpful!

Cyclic

For these feature, there is a rank/order, but it loops. Think a circle of values.

The most common cyclic variable you’ll experience is time. Not seconds since epoch time, but features like time of day or month value.

Using a feature like hour of day is very common in modeling. For example, in an electricity usage model, the hour of the day is extremely important to how much energy is being used. But, this feature is often incorrectly assumed to be numeric, because the hour of a day is a number, when in fact there are 24 categories of hour.

If the number is left raw, with values 0–23, it is difficult for the model to learn that 23 is as close to 0 as it is to 22. Instead, to clean cyclic variables we transform the feature so our model can quickly tell the relationship. This is done by mapping the value to the sin/cos values on a circle.

I know, I know, very few people want to dig up that 11th grade trigonometry after graduating school. Luckily, in this case, I won’t make you memorize the unit circle. We just need to understand why and when to use it.

Above shows a very similar clock to before, but now the (sin, cos) values are listed at various points along the edge of the clock. So, we can see that hour 14 is also at (-(3^.5)/2, 0.5) where sin is the x value and cos is the y value. Hour values begin at point (1, 0) instead of (0, 1).

By changing the feature hour to instead be two features, hour_sin and hour_cos, we tell the model the relative distance of different hours to each other avoiding the 23 o’clock problem.

This works for hours combined with minutes and seconds as well, because all of these values have points on the clock. Other cyclic examples are months of the year and days of the week.

To get to the sin and cos values you need, first multiple the value by (2*pi/(max_value + 1)). The first value starts at 0. For hour of day, max_value would be 23, for month of the year it would be 11 (January starts at 0), for day of the week it would be 6.

Then, you can use a Python library like math or numpy to calculate the sin and cos. See example below from David Kaleko.

Wrap Up

Now you’re armed with the knowledge needed to handle those categorical values correctly! Happy modeling!