Handling rare categorical values in Pandas 🐼

Transforming rare categorical values into “Other” with a single line of code

Maya Gilad
Feb 6, 2020 · 4 min read
Image for post
Image for post

When working with categorical features, a common approach is to use one-hot-encoding and create a binary feature for each possible category. For example:

Image for post
Image for post
Applying one-hot-encoding on a categorical feature

The Problem

Take New York City’s Airbnb Open Data as an example. It has 48K apartment records in 221 neighborhoods.

If you one-hot-encode on the neighbourhood feature, you will end up with a sparse data frame similar to this one:

Image for post
Image for post

Tree-based algorithms, such as XGBoost, can deal with a sparse matrix as long as we have enough data for each of the features. However, in this example, many of the binary features are almost always zero. Therefore, these features have negligible statistical significance and models won’t be able to handle them correctly. Moreover, having many features will increases our model’s training time.

How can we solve this?

One possible solution is to map rare values to “Other”.

In our case, we want to group rare neighbourhoods, but before we begin — let’s get some statistics about our dataset :

  1. 50% of the neighborhoods appear in 32 apartments at most.
Image for post
Image for post
Neighborhood value count statistics

2. The ten most common neighbourhoods make up about 48% of our dataset and the least frequent of them represents only ~3% of our data.

Image for post
Image for post
Most frequent neighbourhoods

Given this dataset, determining whether a neighbourhood is rare or not can be a bit tricky. Hand-picking a reasonable number such as 1% will result in tagging ~30% of the neighbourhoods as “Other”.

Therefore, you should infer a threshold from the data. A possible way to do this is to limit the percentage of values that will be grouped together:

Inferring a threshold from data

For simplicity, we will assume that resulted threshold is 1%.

Using the “mask” and “map” functions, you can group all rare neighborhoods under a single value “Other”. If you don’t know these functions, no worries, I’ll expand on them individually in a bit.

You can replace rare values with “Other”, e.g.:

Which produces the following series:

Image for post
Image for post
Top 10 frequencies, after transformation

Let’s break it into the smaller steps:

First, you will calculate the neighborhood’s frequency using “value_counts” method. The default behavior of “value_counts” is to return the actual value and not its’ frequency, so you will need to call it with normalize=True or divide the output by the size of the dataset.

Image for post
Image for post

Then, you can use “map” to replace each neighbourhood with its corresponding frequency:

Image for post
Image for post

Finally, you use “mask” and replace the neighbourhood value when the corresponding frequency is lower than your desired threshold (in this example - it is 1%):

Image for post
Image for post

Wrapping up

There isn’t a single solution for handling rare values. In this post, we’ve chosen to demonstrate a simple way in which you can try to tackle this problem. It might not be a good fit for your problem, but it’s a one liner that you can easily integrate into your code, train your model and reevaluate its performance.

Hope you enjoyed the read, see you next time.

Gett Engineering

Code, stories, tips, thoughts, experimentations from the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store