Demystifying Neural Networks: Taming Categorical Features with Embeddings

Giving meaning to the meaningless

6 min readFeb 5, 2024

This article is part of the series Demystifying Neural Networks.

Introduction

In the vast and evolving landscape of machine learning, dealing with categorical data remains a pivotal challenge, especially when integrating this type of data into neural networks. Traditional methods like one-hot and target encoding have served as the go-to techniques, but they come with significant limitations that can hinder model performance and complexity. This is where embeddings emerge as a game-changer, offering a nuanced and powerful way to incorporate categorical information into neural networks. Let’s dive deeper into these concepts and see how embeddings can transform our approach to handling categorical data.

The Limitations of Traditional Encoding Methods

Before we explore the solution, it’s crucial to understand the challenges presented by traditional encoding techniques:

One-Hot Encoding

While one-hot encoding is widely used due to its simplicity, it suffers from several drawbacks:

Dimensionality Explosion: With each category represented as a separate feature, the feature space can become unwieldy, making models computationally expensive and difficult to train.
Sparsity: The resulting binary vectors are mostly zeros, which is inefficient and can degrade model performance.
Absence of Inter-category Relationships: This method treats each category as independent, failing to capture any inherent relationships between them.

Target Encoding

Target encoding attempts to mitigate some issues by replacing categories with a statistic (mean) of the target variable, but it introduces problems of its own:

Overfitting Risk: Especially in categories with few data points, the model might learn to overfit the training data.
Information Loss: Reducing categories to a single statistic can strip away valuable categorical information.
Potential for Data Leakage: If not implemented carefully, it can lead to optimistic performance estimates due to leakage.

Introducing Embeddings

Embeddings offer a sophisticated solution to these challenges by mapping each category to a dense vector within a continuous vector space. This representation allows the model to capture and leverage the relationships between categories, enhancing its predictive power.

How Embeddings Work

In a neural network, an embedding layer learns the vector representation of each category during the training process. These vectors are adjusted along with the network’s weights to minimize the loss function, allowing the embeddings to capture rich, task-specific information about each category.

Advantages of Embeddings

Embeddings address the shortcomings of traditional encoding methods by:

Reducing Dimensionality: They offer a compact, dense representation of categorical data.
Capturing Category Relationships: Embeddings can reflect the similarity between categories in their vector space.
Enhancing Model Performance: By providing richer information to the model, embeddings can improve predictive accuracy.

Example: Predicting House Prices with Zip Codes

To illustrate the power of embeddings, consider predicting house prices based on features like square footage, number of bedrooms, and ZIP code. The ZIP code is a categorical variable that significantly influences house prices due to factors like neighborhood desirability and local amenities.

Generating a Synthetic Dataset

We generate a synthetic dataset with these features, introducing variability and patterns, such as certain ZIP codes having higher average prices. We also add noise and outliers to simulate real-world data complexities.

Building a Neural Network with Embeddings

Our model includes:

An embedding layer for ZIP codes.
Input layers for numerical features.
A concatenation of the embedding vector and numerical features, fed through dense layers to predict house prices.

Note that in this simple example, 1 dimension for ZIP code is sufficient, but in more complex examples, you can use more dimensions for your categorical data.

The code is available in this colab notebook:

import numpy as np
import pandas as pd
from keras.models import Model
from keras.layers import Input, Embedding, Flatten, Dense, Concatenate
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

def set_seed(seed=42):
    np.random.seed(seed)


# Function to generate a synthetic dataset for modeling
def generate_synthetic_data(num_samples=1000, num_zip_codes=20):
    # Random economic status for each ZIP code, influencing house prices
    zip_code_economic_status = np.random.rand(num_zip_codes)
    
    # Randomly assign ZIP codes to samples
    zip_codes = np.random.randint(0, num_zip_codes, size=num_samples)
    # Generate random square footage between 1000 and 2000 for each sample
    square_footage = np.random.rand(num_samples) * 1000 + 1000
    # Randomly assign number of bedrooms (1-4) for each sample
    num_bedrooms = np.random.randint(1, 5, size=num_samples)
    
    # Add Gaussian noise to square footage and number of bedrooms to introduce variability
    square_footage_noise = np.random.normal(0, 50, num_samples)
    num_bedrooms_noise = np.random.normal(0, 0.5, num_samples)
    square_footage += square_footage_noise
    num_bedrooms = np.round(num_bedrooms + num_bedrooms_noise).clip(1, 5)
    
    # Calculate base price based on square footage and number of bedrooms
    base_price = square_footage * 100 + num_bedrooms * 50000
    # Adjust base price based on ZIP code's economic status
    price_multiplier = zip_code_economic_status[zip_codes] * 0.5 + 1
    # Add additional Gaussian noise to prices
    price_noise = np.random.normal(0, 10000, num_samples)
    prices = base_price * price_multiplier + price_noise
    
    # Introduce outliers to prices for realism
    introduce_outliers(prices, num_samples)
    
    # Return the DataFrame containing the generated features and target variable
    return pd.DataFrame({
        'ZIP Code': zip_codes,
        'Square Footage': square_footage,
        'Number of Bedrooms': num_bedrooms,
        'Price': prices
    })

# Function to introduce outliers into the prices
def introduce_outliers(prices, num_samples, percentage=0.01):
    num_outliers = int(percentage * num_samples)  # 1% of the dataset as outliers
    # Select random indices for high outliers
    high_indices = np.random.choice(num_samples, size=num_outliers // 2, replace=False)
    # Increase prices significantly for high outliers
    prices[high_indices] *= np.random.uniform(2, 3, size=num_outliers // 2)
    # Select random indices for low outliers
    low_indices = np.random.choice(num_samples, size=num_outliers // 2, replace=False)
    # Decrease prices significantly for low outliers
    prices[low_indices] *= np.random.uniform(0.1, 0.5, size=num_outliers // 2)

# Function to preprocess the data: normalize features and split the dataset
def preprocess_data(df):
    # Initialize standard scalers for features and target
    scaler_features = StandardScaler()
    scaler_target = StandardScaler()
    
    features = ['Square Footage', 'Number of Bedrooms']  # Feature column names
    target = 'Price'  # Target column name
    
    # Normalize features
    df[features] = scaler_features.fit_transform(df[features])
    # Normalize target
    df[target] = scaler_target.fit_transform(df[[target]].values.reshape(-1, 1))
    
    # Split the dataset into training and testing sets
    return train_test_split(df.drop('Price', axis=1), df['Price'], test_size=0.2, random_state=42), scaler_target

def build_model(num_zip_codes, embedding_size):
    zip_code_input = Input(shape=(1,), name='zip_code_input')
    numerical_input = Input(shape=(2,), name='numerical_input')

    zip_code_embedding = Embedding(input_dim=num_zip_codes, output_dim=embedding_size, name='zip_code_embedding')(zip_code_input)
    zip_code_vec = Flatten(name='flatten_zip_code')(zip_code_embedding)

    concatenated_features = Concatenate()([zip_code_vec, numerical_input])
    dense = Dense(128, activation='relu')(concatenated_features)
    output = Dense(1, name='output')(dense)

    model = Model(inputs=[zip_code_input, numerical_input], outputs=output)
    model.compile(optimizer=Adam(0.001), loss='mean_squared_error')
    return model

def plot_history(history):
    plt.figure(figsize=(10, 5))
    plt.plot(history.history['loss'], label='Train Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('Model Loss Over Epochs')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(loc='upper right')
    plt.show()

def plot_predictions(y_test, y_pred):
    plt.figure(figsize=(10, 5))
    # The x-axis represents the actual prices of the houses.
    # The y-axis represents the predicted prices of the houses.
    plt.scatter(y_test, y_pred, alpha=0.3)
    plt.title('Actual vs. Predicted Prices')
    plt.xlabel('Actual Prices')
    plt.ylabel('Predicted Prices')
    # This line represents the points where the predicted prices perfectly
    # match the actual prices. In other words, if a prediction is exactly equal
    # to the actual price, its corresponding point would lie on this line.
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
    plt.show()

# Run the workflow
set_seed()
df = generate_synthetic_data()
(X_train, X_test, y_train, y_test), scaler_target = preprocess_data(df)
model = build_model(num_zip_codes=20, embedding_size=1)
model.summary()

# Train the model and save the history
history = model.fit(
    [X_train['ZIP Code'], X_train[['Square Footage', 'Number of Bedrooms']]],
    y_train,
    epochs=20,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

# Evaluate the model on the test set
test_loss = model.evaluate(
    [X_test['ZIP Code'], X_test[['Square Footage', 'Number of Bedrooms']]],
    y_test,
    verbose=1
)
print(f'Test loss: {test_loss}')

# Predict on the test set and inverse transform target scaler
y_pred = model.predict([X_test['ZIP Code'], X_test[['Square Footage', 'Number of Bedrooms']]])
y_pred_inv = scaler_target.inverse_transform(y_pred)
y_test_inv = scaler_target.inverse_transform(y_test.values.reshape(-1, 1)).flatten()


# Visualizations
plot_history(history)
plot_predictions(y_test_inv, y_pred_inv)

Conclusion

Embeddings revolutionize how we handle categorical data in neural networks, offering a robust solution to the limitations of traditional encoding methods. By leveraging embeddings, we can build more sophisticated, efficient, and accurate models that fully exploit the richness of categorical data. This approach not only enhances model performance but also provides deeper insights into the underlying patterns within our data, demystifying the process of integrating categorical information into neural networks.