Handling rare categorical values in Pandas 🐼
Transforming rare categorical values into “Other” with a single line of code
When working with categorical features, a common approach is to use one-hot-encoding and create a binary feature for each possible category. For example:
Take New York City’s Airbnb Open Data as an example. It has 48K apartment records in 221 neighborhoods.
If you one-hot-encode on the neighbourhood feature, you will end up with a sparse data frame similar to this one:
Tree-based algorithms, such as XGBoost, can deal with a sparse matrix as long as we have enough data for each of the features. However, in this example, many of the binary features are almost always zero. Therefore, these features have negligible statistical significance and models won’t be able to handle them correctly. Moreover, having many features will increases our model’s training time.
How can we solve this?
One possible solution is to map rare values to
In our case, we want to group rare neighbourhoods, but before we begin — let’s get some statistics about our dataset :
- 50% of the neighborhoods appear in 32 apartments at most.
2. The ten most common neighbourhoods make up about 48% of our dataset and the least frequent of them represents only ~3% of our data.
Given this dataset, determining whether a neighbourhood is rare or not can be a bit tricky. Hand-picking a reasonable number such as 1% will result in tagging ~30% of the neighbourhoods as “Other”.
Therefore, you should infer a threshold from the data. A possible way to do this is to limit the percentage of values that will be grouped together:
For simplicity, we will assume that resulted threshold is 1%.
“map” functions, you can group all rare neighborhoods under a single value
“Other”. If you don’t know these functions, no worries, I’ll expand on them individually in a bit.
You can replace rare values with “Other”, e.g.:
Which produces the following series:
Let’s break it into the smaller steps:
First, you will calculate the neighborhood’s frequency using
“value_counts” method. The default behavior of
“value_counts” is to return the actual value and not its’ frequency, so you will need to call it with
normalize=True or divide the output by the size of the dataset.
Then, you can use
“map” to replace each neighbourhood with its corresponding frequency:
Finally, you use
“mask” and replace the neighbourhood value when the corresponding frequency is lower than your desired threshold (in this example - it is 1%):
There isn’t a single solution for handling rare values. In this post, we’ve chosen to demonstrate a simple way in which you can try to tackle this problem. It might not be a good fit for your problem, but it’s a one liner that you can easily integrate into your code, train your model and reevaluate its performance.
Hope you enjoyed the read, see you next time.