Handling rare categorical values in Pandas 🐼

Transforming rare categorical values into “Other” with a single line of code

Maya Gilad
Feb 6 · 4 min read

When working with categorical features, a common approach is to use one-hot-encoding and create a binary feature for each possible category. For example:

Applying one-hot-encoding on a categorical feature

The Problem

Take New York City’s Airbnb Open Data as an example. It has 48K apartment records in 221 neighborhoods.

If you one-hot-encode on the neighbourhood feature, you will end up with a sparse data frame similar to this one:

Tree-based algorithms, such as XGBoost, can deal with a sparse matrix as long as we have enough data for each of the features. However, in this example, many of the binary features are almost always zero. Therefore, these features have negligible statistical significance and models won’t be able to handle them correctly. Moreover, having many features will increases our model’s training time.

How can we solve this?

One possible solution is to map rare values to “Other”.

In our case, we want to group rare neighbourhoods, but before we begin — let’s get some statistics about our dataset :

  1. 50% of the neighborhoods appear in 32 apartments at most.
Neighborhood value count statistics

2. The ten most common neighbourhoods make up about 48% of our dataset and the least frequent of them represents only ~3% of our data.

Most frequent neighbourhoods

Given this dataset, determining whether a neighbourhood is rare or not can be a bit tricky. Hand-picking a reasonable number such as 1% will result in tagging ~30% of the neighbourhoods as “Other”.

Therefore, you should infer a threshold from the data. A possible way to do this is to limit the percentage of values that will be grouped together:

Inferring a threshold from data

For simplicity, we will assume that resulted threshold is 1%.

Using the “mask” and “map” functions, you can group all rare neighborhoods under a single value “Other”. If you don’t know these functions, no worries, I’ll expand on them individually in a bit.

You can replace rare values with “Other”, e.g.:

Which produces the following series:

Top 10 frequencies, after transformation

Let’s break it into the smaller steps:

First, you will calculate the neighborhood’s frequency using “value_counts” method. The default behavior of “value_counts” is to return the actual value and not its’ frequency, so you will need to call it with normalize=True or divide the output by the size of the dataset.

Then, you can use “map” to replace each neighbourhood with its corresponding frequency:

Finally, you use “mask” and replace the neighbourhood value when the corresponding frequency is lower than your desired threshold (in this example - it is 1%):

Wrapping up

There isn’t a single solution for handling rare values. In this post, we’ve chosen to demonstrate a simple way in which you can try to tackle this problem. It might not be a good fit for your problem, but it’s a one liner that you can easily integrate into your code, train your model and reevaluate its performance.

Hope you enjoyed the read, see you next time.

Gett Engineering

Code, stories, tips, thoughts, experimentations from the day-to-day work of our R&D team.

Thanks to Shai Mishali

Maya Gilad

Written by

An algorithm developer, puzzle solver and a geek

Gett Engineering

Code, stories, tips, thoughts, experimentations from the day-to-day work of our R&D team.

More From Medium

More from Gett Engineering

More from Gett Engineering

Disposing RxSwift’s Memory Leaks

More from Maya Gilad

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade