Tensorflow 2.0 Tutorial on Categorical Features Embedding

A comprehensive guide to categorical features embedding

Oussama Errabia
Analytics Vidhya
7 min readAug 13, 2019

--

Introduction :

It is well known that data preparation may represent up to 80% of the time required to deliver a real-world ML product. Additionally, working with categorical features is one of those things that can be a bit tricky and time consuming, especially in the case of high cardinal data. When you have features with more than 1000 categories and you need to build a model on top of it, every data scientist needs to choose the best approach with which to present those categorical features to the model. To name few common ones:

  • Factorization, where each unique category is assigned a unique label.
  • One hot encoding, this method produces a vector with length equal to the number of categories in the data set. If a data point belongs to the ith category then you will find a 1 in the ith component and 0 elsewhere (can put you in a high dimensional data situation)
  • Target encoding(a bit tricky as it may cause over-fitting), which consists of encoding each value with the mean of the target variable (must be done in a cross-validation scheme)

In this tutorial we will be learning another very effective approach for dealing with categorical features(especially in the case of high cardinality) and for that we will be using Tensorflow 2.0; so make sure to upgrade in order to follow up.

Data set: We will be working on a real-world dataset on Census income, also known as the Adult dataset available in the UCI ML Repository where we will be predicting if the potential income of people is more than $50K/yr or not.

Categorical Features Embedding :

If you have worked on an NLP project before, then most likely you will be familiar with the word Embedding. If not, let me explain it :

Embedding means to represent something as a vector, a projection.

It is as simple as that, but the question is, how do we get that vector ? This is where deep neural networks comes handy.

So enough talk, let us speak code now:

1 — First we will load the data(we don’t have to download it, just install the shap package and the data is accessible from it) :

import shap

data,labels = shap.datasets.adult(display=True)

2 — Next, let us inspect our categorical features by running this line of code :

data.select_dtypes('category').columns

So, our categorical features are: ‘Workclass’, ‘Marital Status’, ‘Occupation’, ‘Relationship’, ‘Race’, ‘Sex’, ‘Country’.

2.1 — And just from sense of curiosity let us also check our numerical features by running the following code :

data.select_dtypes('number').columns

And the Output is : ‘Age’, ‘Education-Num’, ‘Capital Gain’, ‘Capital Loss’,
‘Hours per week’.

3 — Now, let us check the cardinality of the categorical features(how many unique values each one holds) :

data[data.select_dtypes('category').columns].nunique().reset_index(n
ame='cardinality')
features cardinality

So it seems we don’t have a very high cardinal features (with more than 100 categories), but we do have ‘Country’ with 42 categories, also ‘Occupation’ with 15 categories.

However, this tutorial is applicable for whatever the number is.

4 — Now let us start building our model : As highlighted before, the project is about predicting if a person will be getting more or less than USD 50K based on a list of features, and for that we will be building a neural net in Tensorflow 2.0, using both categorical and numerical features.

What we will Do :

What will try to do is build a multi input neural net, one input for each categorical feature, as for the numerical features, all of them will be fed from a single input. Let me explain further:

What I said above means this:

  • First we need to embed the categorical features(represent each one of the unique values of a categorical feature by a vector),
  • For that we will be defining an embedding model for each categorical feature(which is an input layer plus an embedding layer),
  • As for the other numeric features, we will feed them to Our model like we usually do for any regular deep learning network from that last input layer.

So in total will be having number_of_categorical_features +1 models (number_of_categorical_feature embedding model + an identity model).

Once we define these models, and since we need 1 model at the end, we will concatenate them into a single layer.

So again, enough talk and let us speak code :

1 — First, we will be building this small function that will factorize our categorical features, as the deepnet expects numbers and not strings :

def prepar_data_set(data_df):
categoy_features = data_df.select_dtypes('category').columns
numerique_features = data_df.select_dtypes('number').columns
for col in categoy_features:
encoder = LabelEncoder()
data_df[col] = encoder.fit_transform(data_df[col])
return data_df,categoy_features,numerique_features

The function will get the categorical features and encode them one by one into integers and will return 3 things : (i) the data encoded, (ii)the list of categorical features, and (iii) the list of numerical features.

2 — Since now we have our train data ready, and also the labels, let us build the architecture of our model :

Let us break that peace of code above:

2.1 — As you can see, for each category in our categorical features, we define an input layer which accepts an input of shape 1(as our input will be the value of the category which just a number),

2.2 — We give it a name so we can properly send to it the right data (very practical thing to do, I recommend it).

2.3 — Then we define our embedding layer which is basically a matrix with a number of row and columns.

2.3.1 — The number of rows will be the cardinality of the categorical features(how many unique values),

2.3.2 — The number of columns will be the lent of the vector which will represent those unique values(which are the parameters to be tuned). For this tutorial we choose 200(a very common number to start with).

2.3.2 — Pay attention that we set the layer to Trainable. Since we have initialized it with just random numbers which are of no value to us, we need it to keep updating during the training(back propagation).

2.3.3 — Finally we have to reshape the output to a single 1-D array, which basically will have the shape of the lent of the embedding vector.

So, this is how we will define our number_of_categorical_feature embedding models (in this case 7 categorical feature which mean 7 embedding models).

2.4 — As for our numerical features, will be feeding them as we usually do, from their own Input layer, like this :

num_input = tf.keras.layers.Input(shape=(len(num_features)),\
name='input_number_features')
# append this model to the list of models
models.append(num_input)
# keep track of the input, we are going to feed them later to the #final model
inputs.append(num_input)

2.5 — Now that we have 8 models (7 embedding models and 1 identity model), and we have all appended them into a list called models, let us concatenate them into a single layer:

merge_models= tf.keras.layers.concatenate(models)

2.6 — Now we have one layer that we can stack on top of it a list of fully connected layers :

As you can see, we have built 2 fully connected layers with 1000 units each on top of the merged models. After it we added the prediction layers which will be returning the probabilities of a person having more or less than USD 50K, and finally we compiled our model to minimize the cross entropy using the adam optimizer, and the accuracy as an evaluation function.

2.7 — Now the last thing is to feed the data to our model.

Since we have used a multi input neural network, it is best practice to feed your train data as a dictionary, where your keys are the name of the Input layer and the values are what each layer is expected to have. So let us do that so you can understand what I mean :

input_dict= {
'input_Workclass':train["Workclass"],
"input_Marital_Status":train["Marital Status"],
"input_Occupation":train["Occupation"],
"input_Relationship":train["Relationship"],
"input_Race":train["Race"],
"input_Sex":train["Sex"],
"input_Country":train["Country"],
"input_number_features": train[num_featture]
}

As you can see, like this we are 100 % sure we are sending the right data to right model.

2.8 — And that is it, let us fit our model :

model.fit(input_dict,labels,epochs=50,batch_size=64)

And Voila, we reached the end of this tutorial, Ihope it was clear enough and practical.

PS : One last remark regarding the training: Neural Nets are highly sensitive to features having big numbers, like in the case of this data set, and it won’t learn anything if kept as is, you will have to scale your numerical features between -1 and 1 using the following code :

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train[num_featture] = scaler.fit_transform(train[num_featture])

I hope you have enjoyed this tutorial, more is coming so stay tuned.

link to full code repo : https://github.com/oussamaErra/tf-2-tutorial-categorical-features-embedding

If you found this tutorial good and practical make sure to follow me on Medium for many more good stuff on practical data science and AI.

And if you have any question for me you can contact me directly on Linkedin or gmail(listed below)

About Me

I am a Principal Data Scientist @ Clever Ecommerce Inc, we help businesses to Create and manage there Google Ads campaigns with a powerful technology based on Artificial Intelligence.

You can reach out to me on Linkedin or gmail: errabia.oussama@gmail.com.

--

--

Oussama Errabia
Analytics Vidhya

Principal Data scientist, I crack down AI problems @ Clever Ecommerce Inc.