Scaling and Accelerating large Deep Learning Recommender Systems — HugeCTR Series Part 1

Published in

NVIDIA Merlin

5 min readMay 12, 2021

After deep learning has become the mainstream machine learning technique in computer vision, natural language processing and speech recognition, recently deep learning is applied to the recommendation systems domain. Training deep learning recommender systems is not easy and we discussed the challenges in a previous blog post. Deep learning recommender models often have a shallow architecture with few fully connected layers and large user/item embedding tables, which reach multiple GBs or TBs in size. To address the challenges, NVIDIA develops the open source framework NVIDIA Merlin to scale and accelerate deep learning recommender systems with GPUs end-to-end: from ETL to training to deployment. This is the first part of a blog post series explaining the challenges and solutions to train neural networks for really large recommendation systems with HugeCTR. Merlin HugeCTR is a recommender system specific framework which accelerates the training and deployment of complex deep learning models on GPUs at scale.

Do the Embedding Tables not exceed the memory?

Your first thoughts may be “Yes, I want to train deep learning recommender systems but how can I fit the embedding tables into the GPU memory”. You are right and that is one main challenge we experienced in training large deep learning recommendation models. The embedding tables can reach 100GBs to multiple TBs in size and do not fit into a single GPU or host memory. As a solution to that bottleneck, HugeCTR scales training by exploiting both data and model parallelisms, and distributes a single embedding table over multiple GPUs as visualized below.

As the fully connected layers have only a “few” parameters, each GPU contains a copy of the weights for data parallel training. The large embedding tables can also be distributed over multiple GPUs or multiple nodes. Depending on the specific batch, the forward/backward passes are executed accordingly and all parameters are updated. In this way, HugeCTR scales the training in both model parallel (embedding layers) and data parallel (dense layers). In our experiments, we observed 114x speed-up in comparison to TensorFlow CPU version and were able to scale embedding tables upto 10 terra-byte in size.

Can I use the same strategy with other Deep Learning frameworks?

HugeCTR is highly optimized for recommendation systems with its key value being to scale well on multiple GPUs. Achieving these strategies is not trivial and requires custom implementation in CUDA C++. Meanwhile, TensorFlow is one of the most popular general purpose deep learning frameworks which have a strong user base. We want to provide the functionality to the community and, therefore, exported the HugeCTR embedding as a TensorFlow custom op, so that it can be interoperable with customers’ existing code and pipeline built upon TensorFlow. As you can see below, using the HugeCTR TensorFlow op requires only to change a single line and can be used easily with TensorFlow Keras model. Check out our Jupyter notebook to see how you can integrate the HugeCTR embedding into the DeepFM model by creating a new keras layer. If you are interested in more details such as its underlying mechanism and performance, watch our related talk at GTC 2021. In summary, our HugeCTR TensorFlow op achieves 10x speedup over the original TensorFlow embedding layer on a single DGX A100. Regarding the end-to-end training performance with 7 fully connected layers, it delivers 3.5x speedup.

Training on a Single Machine when Embedding Tables still exceed available memory

The embedding tables of recommendation systems can reach 100GBs to TBs in size and even a multi-GPU cannot fit them into GPU memory. As the number of users and items of a service increases, the table size also rapidly grows. As we discussed above, HugeCTR can scale the model to multiple nodes, but it is not always possible for a company to scale out or scale up their GPU cluster whenever their embedding table size exceeds its aggregated GPU memory capacity.

HugeCTR’s Embedding Training Cache (ETC) feature enables training of deep learning recommender systems on a single machine, even if the embedding tables do not fit into the GPU memory. It tackles the aforementioned problem in a coarse-grained manner. ETC oversubscribed the model (Model Oversubscription) and facilitates the training of terabyte-scale embedding tables on a single machine such as DGX A100. It assumes that the dataset is split into sub-datasets, each of which has a user specified amount of samples, and the unique values of categorical features (keyset) are extracted from each sub-dataset. For instance, a user may want to decide that each sub-dataset has a month amount of data so that its categorical features can lead to an embedding table which can be loaded with a given GPU memory capacity. Then, the training consists of multiple passes where every pass uses a different pair of keyset and sub-dataset.

In analogy, what model_subscriber.update() does at the beginning of every pass is similar to the process context switch. It dumps the embedding vectors trained in the previous pass into the secondary storage such as SSD. After that, it loads the new embedding vectors into device memory with the current keyset, so that they can be trained in the current pass. In such a way, we can effectively train a large embedding table whose size is beyond the available device memory capacity.

The idea is that moving a large chunk of data into the GPU is efficient. For every sub-dataset the required embedding vectors are loaded and multiple batch updates are executed. Afterwards the model parameters are synchronized in the parameter server. As the access of embedding vectors often have a power-law distribution, we need only a fraction of the embedding tables for each subset. As we iterate over the full dataset in sub-datasets, we still train all embedding vectors over time.

HugeCTR ETC is only available in Python. We provide a notebook on how to enable and use it based upon our low-level Python API. We are proactively researching on how to effectively utilize device memory bandwidth whilst in pursuit of streamlining the path between the SSD and device memory.

Try out NVIDIA Merlin and HugeCTR to scale your Recommendation System Pipelines

We provide many examples for HugeCTR or end-to-end examples of ETL-Training-Inference in our GitHub repositories (NVIDIA Merlin, NVIDIA NVTabular and NVIDIA HugeCTR). You can read more about HugeCTR in our User Guide. HugeCTR provides multiple references implementations of common deep learning models, such as Wide&Deep, Deep Cross Network and Deep Learning Recommender Model (DLRM). If you have any issue or feature request in using HugeCTR, you can reach out to us by filing a GitHub issue in our repository. This was the first blog post about “Scaling and Accelerating large Deep Learning Recommender Systems” in our HugeCTR series. The next one will talk about the Python API and hands-on examples. Stay tuned!

Scaling and Accelerating large Deep Learning Recommender Systems — HugeCTR Series Part 1

Can I use the same strategy with other Deep Learning frameworks?

Training on a Single Machine when Embedding Tables still exceed available memory

Try out NVIDIA Merlin and HugeCTR to scale your Recommendation System Pipelines

Written by Minseok Lee