Target Encoding with RAPIDS cuML: Do More with Your Categorical Data

Published in

RAPIDS AI

6 min readSep 10, 2020

Have you ever been puzzled by how to encode categorical variables? Most machine learning libraries require a numeric representation of the data for training. If the columns of data contain strings or are categorical in nature, they need to be transformed into numeric forms. The most popular encoding methods are LabelEncoder and OneHotEncoder. Unfortunately, neither of them is ideal. LabelEncoder arbitrarily maps each value in a column to an integer, which could be misinterpreted by the downstream classifiers. OneHotEncoder converts each category value into a new binary column (True/False). The downside is adding a big number of new columns to the data set and slowing down the training pipeline. The high dimensionality also makes it hard for downstream classifiers to fit.

Recently, a new encoding method, Target Encoding, has emerged as being both effective and efficient in many data science projects. For example, it is the major component of our RAPIDS team’s winning solution of Recsys Challenge 2020. The idea is to calculate the mean value of the target column “y” (the column to predict), for each distinct element in the categorical column “x” and then replace each element in “x” with the associated mean. This is why it is also known as the “mean target encoding”. If done naively, this approach could lead to “target leakage” — data from the y value contaminating the x predictors, leading to both overfitting and overconfidence in results. With careful optimizations, it is possible to prevent this leakage, but these optimizations make most existing implementations complicated and cumbersome.

We are happy to announce that TargetEncoder has come to RAPIDS cuML 0.15 with a simple scikit-learn transformer-style API and 100x speedup with a single GPU. In this blog, we will discuss how to use the new TargetEncoder feature in cuML and show that it could boost both accuracy and performance significantly. The full notebook can be found in cuML’s repo.

A Case Study

We show the usage of cuML’s TargetEncoder on a real-world dataset. The Criteo 1TB benchmark is a well-known dataset for click-through rate prediction with high cardinality categorical columns. Encoding such columns is critical to achieving a high prediction accuracy. The original dataset includes 10 days of online advertising logs, 13 numerical columns and 26 categorical columns, and 1 Terabyte in total. The label, click_or_not, is binary. For the sake of simplicity, we only use the first three categorical columns as features.

The categorical columns are given in hash codes for anonymity. We sample 2 millions rows each for the training and validation dataset.

Each column has a lot of unique values — tens of thousands! OneHotEncoding of these columns results in large and sparse data and slows down training significantly. We compare TargetEncoder against the simple cuML LabelEncoder , both of which output dense encoding results. The encoded features are trained by an XGBoost model with fixed hyperparameters and the validation ROC AUC is used to compare the two encoding methods.

Simple pipeline to compare TargetEncoder against LabelEncoder

Basic Usage

The TargetEncoder follows the convention of cuML/sklearn data transformers’ API design with methods such as fit , fit_transform , transform , which makes it effortless to integrate into a cuML/sklearn ML pipeline. We use the TargetEncoder to process the categorical values in the Criteo data.

There are several observations worth noting:

The fit method requires two input arrays: x and y, where x is the column to encode and y is the target column.
The output of TargetEncoder is dense and meaningful. It returns one output column for each input column. The output value represents the mean target of each input value.
It is fast! Fitting 2 million rows and transforming 4 million rows take less than 1 second! We will show a detailed speedup analysis later.

Advanced Features

The most common mistake when implementing TargetEncoder naively is target leakage. For example, the following code calculates the mean target values directly from the training data and uses them to encode both training and validation data.

Naive implementation. Note how it is different from our TargetEncoder’s output.

The problem is that the target label of a sample is directly used to create features for that sample in the training set so the downstream classifier becomes overly confident about these target encodings. This leads to serious overfitting. Instead, our TargetEncoder implements a cross-calculation internally as shown below.

We implement a cross-calculation method to prevent target leakage.

We split data into n_folds and encode one fold using the target variables of the remaining folds. This ensures that the target (y_i) for a given observation i is never used to encode the x_i for the same observation. So the linkage between x_i and y_i is not leaked. You can also select different split_method based on the nature of the data. Large n_folds usually improves the predictive power of the encoding at the cost of longer execution time. The execution time is linearly proportional to n_folds if each fold is calculated sequentially in a for loop. In order to avoid such serial execution, we calculate the encodings of all folds in parallel and fully utilize the computing power of GPU.

In summary, cuML’s TargetEncoder offers the following advanced features.

Fast parallel cross-calculation of the target values where n_folds and split_method are configurable.
Multi-column joint encoding.
Smoothing of the encoding based on the count of the input value, controlled by hyperparameter smooth.
Unlike LabelEncoder , TargetEncoder is capable to encode unknown values automatically, which means it can be used to transform streams of new data.

We provide a walkthrough notebook to go through these advanced features in detail.

Caveat: Split After Fit Leads to Overfitting

Unlike LabelEncoder or OneHotEncoder, TargetEncoder is a supervised encoding method, which requires the target variable in TargetEncoder.fit. While the cross-calculation avoids direct target leakage, indirect target leakage can still occur if the validation/test data’s target is seen during TargetEncoder.fit. The most common mistake is splitting the data into train and test after TargetEncoder.fit.

To avoid indirect target leakage, always fit the TargetEncoder after splitting the data.

Performance

We evaluate the performance of the RAPIDS cuML TargetEncoder in terms of validation AUC and speedup. We compare different encoding methods, where encoded features are sent to XGBoost for training and validation. On the Criteo sample dataset (4 million rows, 3 columns), our TargetEncoder improves validation AUC from 0.64 to 0.7 — a pretty dramatic lift from just changing the way categorical values are encoded! Thanks to the cross-calculation method, we avoided overfitting as shown in the naive implementation.

Encoded features are sent to XGBoost for training and validation

We measure the speedup of using synthetic random data of 1 column and an increasing number of rows. We compare three implementations: Pandas-based implementation on CPU, GPU implementation with sequential for loop, and fully parallel optimized GPU implementation (cuML version). On a single GPU, the cuML’s TargetEncoder achieves up to 100x speedup over pandas based implementation and 4x speedup over for loop-based GPU implementation.

Conclusion and Future Work

We implemented a fast TargetEncoder on GPU with many built-in optimizations and advanced features. Please try Rapids cuML 0.15 and the detailed walkthrough notebook. The next milestone is to bring multi-GPU support, and you can find a prototype in our Recsys 2020 winning solution.