Scalable Recommender Systems with NVTabular- A Fast Tabular Data Loading and Transformation Library

Ronay Ak

Published in

RAPIDS AI

9 min readJul 23, 2020

By: Ronay Ak, Even Oldridge, Julio Perez, and Benedikt Schifferer

Introduction

Recommender Systems (RecSys) are part of our daily lives. When we visit a website, we all experience that a list of recommendations appears for a product, content, and/or service. Netflix’s movie recommendations or Spotify’s playlist recommendations can be given as an example.

RecSys plays a critical role in motivating users to spend more time on online platforms, and therefore, drives user engagement. Deep learning (DL) based recommender systems have gained more attention in recent years due to their ability to generate relevant personalized recommendations. DL-based models can better capture the complex user-item interaction patterns by processing terabytes of data of customer and item information, user interactions, ratings, etc. However, there are performance bottlenecks in building a large-scale DL-based recommender system with such high volumes of data:

Data ETL (extract-transform-load) takes a considerable amount of time,
Feature engineering experiments create numerous sets of new features from existing ones, which are then tested for effectiveness,
Data loading at train time can become the input bottleneck, if not well optimized, leading to the under-utilization of high-throughput computing devices such as GPUs.

Training Pipeline for a Recommender System.

The entire data ETL, engineering, training, and evaluation process likely must be repeated many times on many model architectures, requiring significant computational resources and time. Even after being deployed, you would want your recommender system to maintain high accuracy over time that requires periodic retraining to account for model drift. For all these reasons, accelerating data preparation is critical to address the volume and scale that large-scale recommender systems require.

Speeding up the Data Pipeline

As we mentioned above, the ETL and feature engineering steps in large-scale recommender systems often take longer than model training and hyperparameter optimization. This is especially true in CPU workflows. So, how can we run ETL much faster than training to quickly and easily manipulate terabyte-scale tabular datasets? After running into this problem ourselves too many times to count we have developed NVTabular, a feature engineering and preprocessing library for recommender systems. NVTabular provides a high-level abstraction to simplify code and accelerates computation on the GPU using RAPIDS data frame library, cuDF.

NVTabular is one of the building blocks of NVIDIA Merlin, an open source, GPU-accelerated recommendation framework that scales to datasets and user/item combinations of arbitrary size. Merlin accelerates the entire RecSys pipeline from ingesting and training to deploying GPU-accelerated deep recommender systems. For a detailed blog on Merlin, visit here.

NVIDIA Merlin Deep Recommender Application Framework.

Here’s how the pipeline works. NVTabular wraps the RAPIDS cuDF library, which provides the bulk of the functionality, accelerating data frame operations on the GPU. And it provides mechanisms for iteration when the dataset exceeds GPU memory, allowing you to focus on what you want to do with your data, not how you need to do it. It processes data in chunks that fit in GPU memory, so it supports datasets larger than CPU/GPU memory.

In this blog we will walk you through the NVTabular workflow steps in an example where we use ~1.3TB Criteo dataset shared by CriteoLabs for the predicting ad click-through rate (CTR) Kaggle challenge. We’ll show you how to use NVTabular as a preprocessing library to prepare the Criteo dataset on a single V100 32 GB GPU. The large memory footprint of this dataset presents an excellent opportunity to highlight the advantages of the online fashion in which NVTabular loads and transforms data much larger than what fits into available GPU memory.

For this example,

We will use the parquet files converted from raw tsv files that were published by Criteo. Optimize-criteo notebook can be used to generate the parquet files.
The training dataset consists of Criteo’s traffic over 23 days.
There are 13 features taking integer values and 26 categorical features, and the first column (label column) indicates whether this ad has been clicked or not.

ETL with NVTabular on GPUs

There are five steps to get up and running on NVTabular:

Download and Installation
Define a Workflow
Add OPs for Preprocessing
Instantiate a Dataset Object
Apply Phase

Step 1- Download and Installation

NVTabular is provided as an open source toolkit. Visit the getting started section in our GitHub repo for the instructions of installing NVTabular with Anaconda or using docker.

Now, let’s import the necessary libraries:

For applications like the one that follows, where RAPIDS will be the only library using GPU memory and resources, a good best practice is to use the RMM (RAPIDS Memory Manager) library to allocate a dedicated pool of GPU memory. Using RMM allows for fast, asynchronous memory management. Here, we dedicate 80% of free GPU memory to this pool to make sure we get the most utilization possible. The RMM library is responsible for the asynchronous optimization of GPU memory loading. We set aside 20% of the GPU memory that is not tracked by the memory manager and spills outside the RMM buffer during compute-intensive operations. Leveraging RMM accounts for a 20–30% boost in ETL pipeline speed.

Now, we define our continuous features (compute-intensive) and categorical features (memory-intensive particularly for high cardinality features), where our data is, and where the processed parquet files will be stored.

Here we prepare train and validation set paths. The first 23 days will be used for training, and the last day for validation.

At this point, our data still isn’t in a form that’s ideal for consumption by neural networks. There are missing values, and our categorical variables are still represented by random, discrete identifiers, and need to be transformed into contiguous indices for embedding lookups. The distributions of our continuous variables are uncentered.

We can perform these processes in a concise and GPU-accelerated manner with an NVTabular Workflow. In the next step, we’ll instantiate one with our current dataset schema, then symbolically add operations on that schema.

Step 2- Define a Workflow

It is time to define a Workflow that is used to represent the chains of feature engineering and preprocessing operations performed on a dataset. Workflow is instantiated with a description of the dataset’s schema so that it can keep track of how columns transform with each operation.

Step 3- Add OPs for Preprocessing

Now, we add operations to the Workflow by leveraging the add_(cat|cont)_feature and add_(cat|cont)_preprocess methods for categorical and continuous variables, respectively. We use add_(cat|cont)_preprocess method when the transformation applied on the targeted columns requires stats to be calculated as in Normalize operator, whereas we use add_(cat|cont)_feature method when the transformation does not require the stats to be calculated as in ZeroFill operator.

We are applying the following preprocessing steps on this dataset:

Continuous Features:

imputing missing values (with 0) using the ZeroFill() operation,
log transforming with the LogOp() operation,
normalizing the features to have zero mean and standard deviation of 1 with theNormalize() operation.

Categorical Features:

Encode categorical features into continuous integer values if the category occurs more often than the specified threshold- frequency threshold. Infrequent categories are mapped to a special ‘unknown’ category. This handy functionality will map all categories which occur in the dataset with some threshold level of infrequency to the same index, keeping the model from overfitting to sparse signals.

Step 4- Instantiate a Dataset Object

The Ops in our Workflow requires statistics calculated across the entire dataset. For example, the Normalize op requires measurements of the dataset mean and standard deviation, and the Categorify op requires an accounting of all the unique categories a particular feature can manifest. However, we need to measure these properties across datasets that are too large to fit into GPU memory or CPU memory.

NVTabular solves this by providing a dataset object which partitions the dataset into chunks that will fit into GPU memory. This dataset can be used to compute statistics in an online fashion.

Step 5- Apply Phase

We apply our Workflow to our datasets and save the results out to parquet files for fast reading at train time. We also measure and record statistics on our training set using the record_stats=True parameter so that our Workflow can use them at apply time. We allow for the application of transforms not only during the modification of the dataset but also during data loading, with plans to support the same transforms during inference. At this stage, we compute the statistics, transform the data, export the transformed data frames to disk.

And just like that, we have the training and validation sets ready to feed to a DL model!

Processing the Criteo Terabyte dataset takes five and a half days on a CPU using the original NumPy CPU script . On the other hand, with NVTabular we are able to complete the ETL process in about 21 minutes on a single V100 32 GB GPU.

NVTabular Criteo comparison. GPU (Tesla V100 32 GB) vs. CPU (AWS r5d.24xl, 96 cores, 768 GB RAM).

Deep Learning

At this stage, you might be wondering how you can move to the DL phase, and start training a DL-based recommender system. No worries! NVTabular library is designed to be interoperable with both PyTorch (PyT) and TensorFlow (TF). Currently, batch data loaders for PyT, TF, and HugeCTR (a high-efficiency GPU framework designed for CTR estimation training) have been developed as extensions of native framework code, accelerating one of the most common bottlenecks that occur in deep recommender system models to help reduce total training time.

The NVTabular data loaders rely on the dlpack, open in-memory tensor structure to effectively move data among frameworks. DLPack is currently supported by several frameworks including cuDF, cuPy, PyT, TF 2.X, and MxNet.

Check out the examples in the NVTabular GitHub repo for more details on DL training.

Summary

In this blog, we covered NVTabular — a feature engineering and preprocessing library designed to quickly and easily manipulate terabytes of tabular datasets. NVTabular provides a high-level abstraction and accelerating computation on GPUs using the RAPIDS cuDF library.

One of our goals is to provide data scientists and engineers with a more optimized ETL process in building DL-based recommender systems with large volumes of data, without knowing complex processes (e.g., spark, dask, etc.). With just 10–20 lines of high-level API code, you can build a data engineering pipeline and achieve up to 6X speedup (on a single GPU) compared to optimized CPU-based approaches, while experiencing no dataset size limitations, regardless of the GPU/CPU memory capacity.

We believe NVTabular, as part of the Merlin framework, is going to have broad adoption across e-commerce and online advertising industries to accelerate and scale Recommender Systems pipeline.

NVTabular is under active development. Contributions are very welcome! If you are interested in contributing to NVTabular, please review the Contributing.md in the repo, and check out our open issues and feature requests. If you have any questions, suggestions, or requests regarding the NVTabular library, you can reach out to us via our Github repo.

Authors

Ronay Ak is a Sr. Data Scientist at NVIDIA working on deep learning-based recommender systems and NLP tasks. She holds a Ph.D. degree in Power and Energy Systems. She has authored more than 20 technical publications published in internationally reputed conferences and journals.

LinkedIn | Github

Julio Perez is a Sr. Software Engineer at NVIDIA with experience in cloud infrastructure and machine learning. He contributed to the fifteenth placed team in RecSys 2019. His previous work includes research in adversarial machine learning. He’s actively working on the development of the NVTabular library to leverage GPU architecture for ETL data pipelines.

LinkedIn | Github

Even Oldridge is a Sr. Applied Research Scientist at NVIDIA and leads the team developing NVTabular. He has a Ph.D. in Computer Vision but has spent the last five years working in the recommender system space with a focus on deep learning-based recommender systems.

Twitter | LinkedIn | Github

Benedikt Schifferer is a Deep Learning Engineer at NVIDIA working on recommender systems. Prior to his work at NVIDIA, he graduated from MSc. in Data Science program at Columbia University, New York, and developed recommender systems for a German e-commerce company.

LinkedIn | Github

Scalable Recommender Systems with NVTabular- A Fast Tabular Data Loading and Transformation Library

Introduction

Speeding up the Data Pipeline

ETL with NVTabular on GPUs

Step 1- Download and Installation

Step 2- Define a Workflow

Step 3- Add OPs for Preprocessing

Step 4- Instantiate a Dataset Object

Step 5- Apply Phase

Deep Learning

Summary

Authors

Written by Ronay Ak