Building a One Hot Encoding Layer with TensorFlow
How to create a Custom Neural Network layer to One Hot Encode categorical input features in TensorFlow
One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. It’s easier to understand visually: in the example below, we One Hot Encode a color feature which consists of three categories (red, green, and blue).
Sci-kit Learn offers the OneHotEncoder
class out of the box to handle categorical inputs using One Hot Encoding. Simply create an instance of sklearn.preprocessing.OneHotEncoder
then fit the encoder on the input data (this is where the One Hot Encoder identifies the possible categories in the DataFrame and updates some internal state, allowing it to map each category to a unique binary feature), and finally, call one_hot_encoder.transform()
to One Hot Encode the input DataFrame. The great thing about the OneHotEncoder
class is that, once it has been fit on the input features, you can continue to pass it new samples, and it will encode the categorical features consistently.
One Hot Encoding for TensorFlow Models
Recently, I was working with some categorical features that were being passed as inputs to a TensorFlow model, so I decided to try and find a “Tensorflow-native” way of One Hot Encoding.
After a lot of searching, I mostly came across two suggestions for how to do this:
Just use Sci-kit Learn’s OneHotEncoder
We already know it works, and One Hot Encoding is One Hot Encoding right, so why bother doing it with TensorFlow?
While this is a valid suggestion that works well for simple examples and demonstrations, it can lead to some complications in scenarios where you plan to deploy your model as a service so that it can perform inference in a production environment.
To take a step back, one of the big benefits of using OneHotEncoder
with Sci-kit Learn models is that you can include it, along with the model itself, as a step in a Sci-kit Learn Pipeline, essentially bundling One Hot Encoding (and potentially other preprocessing) logic and inference logic as a single deployable artifact.
So back to our TensorFlow scenario: if you were to use OneHotEncoder
to preprocess input features for a TensorFlow model, you would have some additional complexity to deal with, because you would either have to:
- Duplicate the One Hot Encoding logic anywhere that the model is used for inference.
- Or, deploy both the fit
OneHotEncoder
and the trained TensorFlow model as separate artifacts, and then ensure that they are used properly and kept in-sync by all applications that use the model.
Use the tf.one_hot Operation
This was the other suggestion I came across. The tf.one_hot
operation takes a list of category indices and a depth (for our purposes, essentially a number of unique categories), and outputs a One Hot Encoded Tensor.
You’ll notice a few key differences though between OneHotEncoder
and tf.one_hot
in the example above.
- First,
tf.one_hot
is simply an operation, so we’ll need to create a Neural Network layer that uses this operation in order to include the One Hot Encoding logic with the actual model prediction logic. - Second, instead of passing in the string categories (red, blue, green), we’re passing in a list of integers. This is because
tf.one_hot
does not accept the categories themselves, but instead accepts a list of indices for the One Hot Encoded features (notice that category index 0 maps to a 1x3 list where column 0 has value 1 and the others have value 0) - Third, we have to pass in a unique category count (or depth). This value determines the number of columns in the resulting One Hot Encoded Tensor.
So, in order to include One Hot Encoding logic as part of a TensorFlow model, we’ll need to create a custom layer that converts string categories into category indices, determines the number of unique categories in our input data, then uses the tf.one_hot
operation to One Hot Encode the categorical features. We’ll do all of this next.
Creating a Custom Layer
The first order of business is to convert string categories into integer indexes in a way that converts the categories consistently (e.g. the string blue should always be converted to the same index).
Enter TextVectorization
The experimental TextVectorization
layer can be used to standardize and tokenize sequences of strings, such as sentences, but for our use case, we’ll simply convert individual string categories into integer indices.
We specify output_sequence_length=1
when creating the layer because we only want a single integer index for each category passed into the layer. Calling the adapt()
method fits the layer to the dataset, similar to calling fit()
on the OneHotEncoder
. After the layer has been fit, it maintains an internal vocabulary of unique categories and maps them consistently to integer indices. You can view the layer’s vocabulary by calling get_vocabulary()
after it has been fit.
The OneHotEncodingLayer Class
Finally, we can now create the Class that will represent a One Hot Encoding Layer in a neural network.
The class inherits from PreprocessingLayer
so that it inherits the base adapt()
method. When this layer is initialized, a TextVectorization
layer is also initialized, and when adapt()
is called, the TextVectorization
is fit on the input data, and two class attributes are set:
self.depth
is the number of unique categories in the input data. This value is used when callingtf.one_hot
to determine the number of resulting binary features.self.minimum
is the minimum index output by theTextVectorization
layer. This value is subtracted from the indices at runtime to ensure that the indices passed totf.one_hot
fall in the range [0, self.depth-1] (e.g. if theTextVectorization
outputs values in the range [2, 4] we will subtract 2 from each value so that the resulting indices are in the range [0, 2].
The get_config()
method allows TensorFlow to save the state of the layer when the model is saved to disk. The values from the layer’s config will be passed to the layer’s __init__()
method when the model is loaded into memory. Notice that we’re explicitly setting the vocabulary, depth, and minimum whenever these values are passed in.
Using the Custom Layer
Now we can try the new layer out in a simple Neural Network.
This simple network just accepts a categorical input, One Hot Encodes it, then concatenates the One Hot Encoded features with the numeric input feature. Notice I’ve added a numeric id
column to the DataFrame to illustrate how to split categorical inputs from numeric inputs.
And that’s it! We now have a working Neural Network layer that can One Hot Encode categorical features! We can also save this model as a JSON config file, deploy it, and reload it into memory to perform inference. Notice that it One Hot Encodes the color category in the same way as before, so we know that the subsequent layers of our model will be provided the same features in the same order as they appeared during training.
You can find a notebook containing all of the code examples here: https://github.com/gnovack/tf-one-hot-encoder/blob/master/OneHotEncoderLayer.ipynb
Thanks for reading! Feel free to leave any questions or comments below.