Why isn’t your recommender system training faster on GPU? (And what can you do about it?)

Even Oldridge
Published in
6 min readDec 3, 2020


Deep learning has taken the machine learning world by storm and in almost every field from Vision, to NLP, to Speech and beyond, GPUs are the obvious choice for accelerating your training. But recommender systems remain an outlier; deep learning is now widely used, but a lot of the training and inference for recommender systems in production still happens on CPU because GPUs don’t offer the same speedups that we see in other domains out of the box. In this article we’ll take a look at why that is and more importantly what we can do about it. My team at NVIDIA has been working on this problem for the past year and we’ve got some exciting developments to share.

Compute, Memory, and I/O (oh my!)

Part of the reason that GPUs have been so successful at Vision and NLP tasks is that models in those domains are large and complex. In vision, compute makes up the entire workload, and the parallelism of the GPU shines. For NLP the case is slightly more complex; BERT-base has 12-layers, with a 768-hidden width and 12 attention heads for a total of 110M parameters. It’s vocabulary size is 30K x 1024 meaning that the embeddings which store representation learned by the model make up ~30% of its total size. This is a significant proportion, but compute is still the dominant factor in throughput.

Relative model and embedding sizes for Computer Vision, NLP and Recommender Systems vary significantly. What works well in one domain isn’t necessarily going to work in another.

Compare that with recent session based recommender system architectures such as BST which follow the same general transformer architecture but are configured very differently. BST has a single transformer layer followed by a 1024–512–256 width MLP and 8 attention heads. Its equivalent ‘vocabulary’ on the other hand is a whopping 300M users and 12M items in the example Taobao dataset which are embedded with a width of 64. So the amount of compute is several orders of magnitude lower than we see in NLP and the embeddings which are IO bound, not compute bound, make up well over 95% of the model and are hard to fit on a single GPU (stay tuned for a future article on that). What that means is that in order to keep the GPU running efficiently we need to make sure other aspects of the workflow are well tuned..

A digression into dataloaders

Most people take this aspect of training for granted. Point the framework’s dataloader at the directory or files that you want the model to train on and you’re good to go. In situations where the model is dominated by compute this approach is often okay. An asynchronous dataloader only ever has to feed data faster than the forward and backward pass of the model. As long as it’s able to get the next batch ready before the GPU is done processing the current batch it has done its job.

For NLP or Vision architectures compute is significant relative to time taken to get the data to the model. You’re also usually working with small batches of large examples and the strategies that work there aren’t optimal for the kind of data that recommender systems use. Properly tuning the dataloader and I/O is important, and I highly recommend you check out the work being done by the DALI team at NVIDIA if your workload includes images or audio, but you can’t apply the same principles to recommendation.

For starters, not only is the compute smaller, but at the example level and even at the batch level recommender system data is usually quite small. Most dataloaders work by aggregating randomly selected examples into batches and then passing that information to the GPU. In future blog posts we’ll deep dive into the specifics of the different dataloaders available in deep learning frameworks but for this blog we’ll focus on the common dataloader case where batches are aggregated from random examples. Even if we try to solve the problem by piling on more workers to create batches, we’re still hammering on memory (or more likely disk) in an access pattern that is horribly inefficient. Pulling data example by example just doesn’t make sense for tabular data.

NVTabular dataloaders to the rescue!

When we ran into this issue a little over a year ago we started to look at ways to get data to the GPU more efficiently using RAPIDS in conjunction with PyTorch. Since then we’ve been iterating on the concept of dataloading for tabular data and have a solution. We now support fast tabular dataloading in Tensorflow 2.3+, PyTorch 1.2+, and Fast.ai v2 using APIs that are derived from each library. Using Dask-cuDF we were able to share the same back end for all frameworks and use dlpack to transfer data between our dataloader and the batch in tensor format. Our next blog post dropping in a week or so will share the details of our Tensorflow dataloader and we’ll deep dive into the others in the coming weeks.

I won’t overload you with details here, but at a high level the NVTabular dataloader transfers large chunks of data into GPU memory, creating a buffer of multiple batches on the GPU. These batches are grabbed as contiguous memory blocks from the buffer resulting in a much greater efficiency. The number of chunks grabbed and the size of the buffer are completely configurable. We also asynchronously update the buffer so that the GPU always has data available. It’s worth mentioning that the dataset can be much bigger than the total GPU or even CPU memory which we know is important in the recommender system space. An NVME or equivalent fast storage is recommended though for best performance.

Overall results are impressive. In our initial benchmarking of our dataloader in Tensorflow we saw GPU utilization jump by 2x from 40% to 85% with a corresponding 2x speedup in total training time. What’s even more amazing is that when we apply techniques meant to make compute more efficient like Automatic Mixed Precision (AMP) the gains compound to a 4x total speedup, which we don’t see at all when we only apply AMP. Increasing training speed on GPU by 4x over the standard TF windowed dataloader makes training 22x faster on GPU than CPU, a huge difference when comparing total cost to train in a cloud environment. We’ll have more details about each of the frameworks in our dataloader deep dives.

So we’re done?

Not even close. A 22x speedup over CPU is a great start, but there’s always more to be done and a lot of details yet to come. Storage, I/O, and memory bandwidth play a big role in performance. Getting embeddings to fit on the GPU is another interesting challenge that the Merlin HugeCTR team has tackled. There’s also the recently announced A100 80GB cards which will revolutionize the size of embeddings that we can fit, and with 2 TB/s memory bandwidth training times get even faster.

We’ll be sharing further improvements and ideas within the recsys space along with our research and general information related to the training and productionizing of recommender systems. Stay tuned to this blog for more recsys knowledge.

In the meantime go check out our examples in the NVTabular Repo to see how you can use NVTabular to speed up both the training your recommender models and the data preparation phase. We’d love to hear what you’re working on and any gaps you see. You can reach us through our github with any issue or feature request or by leaving a comment here. And finally, if you’re as passionate about recommender systems as we are please check out this open role for the team. We’re growing fast and would love to work with you to help make RecSys fast and easy to use on the GPU.



Even Oldridge

I’m a research scientist working at NVidia on deep learning for tabular data.