GPU-accelerated ML Inference at Pinterest

Unlocking 16% Homefeed Engagement by Serving 100x Bigger Recommender Models

8 min readAug 4, 2022

Pong Eksombatchai | Software Engineer, Advanced Technology Group; Zhiyuan Zhang | Engineering Manager, ML Serving Platforms

Three black computer fans — Image from https://unsplash.com/photos/vWgoeEYdtIY

We enabled serving 100x larger recommender models at Pinterest by transitioning our machine learning serving from CPU to GPU — increasing Homefeed Pinner engagement by 16% through a step function improvement in model quality. In this blog post, we’ll share our optimizations to achieve this at neutral cost and latency, including optimizing individual ops, consolidating memory transfers, executing static graphs on-device through CUDA Graphs, and rethinking our distributed system setup.

Background

Pinterest’s mission is to bring everyone the inspiration to create a life they love. To make our mission a reality, one of the key components in all of our product surfaces are various recommender models whose jobs are to predict the right content to show to the right person at the right time. Our recommender models are machine learning models that we trained using advanced algorithms to understand Pinners’ behavior as they spend time on our app. We serve our recommender models using our in-house machine learning model server (Scorpion Model Server, or SMS).

The technical challenges that we deal with for SMS are very difficult as it has to provide 400+ million Pinners relevant recommendations from a corpus of 300+ billion Pins in milliseconds. SMS performs machine learning inference on CPU and is heavily optimized over the years to fit our stringent latency and infrastructure cost requirements. We were pretty much at the limit of what SMS could do even with the latest generation of CPUs and had to be mindful that our modeling changes justify every latency and infrastructure cost increase.

The problem is made worse by the recent trend in machine learning where there is an explosion in the number of model parameters and computation. Models that are 100x bigger with 100B+ parameters are now commonplace in recommender systems and are commonly described in the industry. At Pinterest however, we took a slightly different path to make our models bigger through computation by using modern model architectures such as Transformers. With bigger models, we immediately noticed step function improvements in model accuracy, which translates to massive increases in Pinner engagement. However, serving these modern models on CPU comes at an exorbitant price and would increase cost and latency up to 40x. Therefore, we turned to using GPUs to accelerate the model inference in order to serve those models at reasonable cost.

Chart showing Model Variation, Cumulative Engagement Win, Cumulative Model Size Increase, and Infra Cost / Latency Increase (CPU). Categories include Bigger MLP, Transformer, User Action Sequence + Transformer, DCNv2 + Bigger Transformer, and Even Bigger DCNv2 and Transformer *Not yet launched. Infra Cost increases for CPU.

Optimizations

When we tried GPU serving out-of-box, we quickly realized optimizations were required before we could cost effectively utilize GPUs for recommender model serving. We started off by using a profiler to examine what was happening under the hood during model inference. While taking a closer look at the profiling results, we noticed a large number of small CUDA kernels on the timeline chart. It is an expected behavior for recommender models where hundreds of features are processed individually before they are concatenated at later stages of the model. However, with a large number of small operations, the overhead of launching the CUDA kernels is more expensive than the actual computation. The problem was exacerbated by having relatively small batch sizes at serving time compared to the batch sizes at training time.

Graph displaying different timeline charts in blue outlined by red boxes. First kernel chart is showing gaps. Second kernel chart is more continuous with less gaps. — Profiling results before (upper) and after (bottom) optimizations. The CUDA kernel timeline (highlighted by red boxes) shows the kernel launch overhead (gaps between blue blocks) is significantly reduced and therefore GPU is better utilized allowing more cycles to be spent on kernel execution.

Reducing the number of small model ops

The first approach that we took was to identify opportunities to reduce the number of small operations. We looked for model architecture components that are frequently used and optimized them the best that we could. One example is our embedding table lookup module, which consists of two computation steps: raw id to table index lookup and table index to embedding lookup, which are repeated hundreds of times due to the number of features that we have. We were able to significantly reduce the number of operations by leveraging cuCollections to support hash tables for the raw ids on GPUs and implementing a custom consolidated embedding lookup module to merge the lookups into one lookup. We started seeing better performance right away after performing a few of these optimizations.

Consolidating memory copies

Similarly, there is an opportunity to consolidate our data transfer when we move tensors between host and GPU memories. A common recommender model normally takes hundreds of features as input for each candidate. For every inference, each feature is copied to the GPU memory as individual tensors. While it’s very fast to move data between the host and GPU memories, the overhead of scheduling hundreds of cudaMemcpy() calls for each request quickly becomes the bottleneck.

To resolve this problem, we applied a simple optimization that reduces the the number of cudaMemcpy() calls from hundreds to one: instead of relying on the Torch framework to move tensors to GPU individually, we first put all tensors’ data onto a pre-allocated continuous memory buffer and copy the buffer to GPU once. The GPU tensors are then reconstructed by pointing to the GPU memory buffer by different offsets.

First Graphic: Host Memory to GPU Memory with red, blue, and green Tensors, cudaMemcpy arrow connecting the two. Second Graphic: Host Memory Buffer to GPU Memory Buffer with red, blue, green Tensors, cudaMemcpy arrow connecting the two and the buffers. — Copying tensors from host to GPU individually vs copying the memory buffer one time

This optimization comes at the cost of explicitly managing the lifecycle of the pre-allocated memory buffers and manually handling GPU memory alignment for various data types. But as a result, the P50 data copy latency was reduced from 10ms to sub-1ms, which justified the extra complexity.

Utilizing CUDA Graph

To further optimize our model inference, we relied on CUDA Graphs to completely eliminate the remaining small operations overhead. CUDA Graphs allowed us to capture the model inference process as a static graph of operation instead of individually scheduled ones, allowing the computation to be executed as a single unit without any kernel launching overheads. We supported CUDA Graph as a new backend of our model server. When a model is initially loaded, the model server executes the model inference once to build the graph instance, which can be executed repeatedly for live traffic.

CPU Time displaying Launch A to A, Launch B to B, Launch C to C, Launch D to D, Launch E to E on top. GPU time on the bottom displaying Build Graph to Launch Graph to A, B, C, D, E and time saved in comparison to CPU time. — CUDA Graph executes the kernels in one batch (bottom) instead of one-by-one in a sequence (top), which reduces the gaps of CPU launching overhead between the kernels. Diagram credited to https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/.

CUDA Graph comes with a few limitations and brings extra complexity to our model server. The biggest limitation is that CUDA Graph requires all tensors to have static shapes and layouts, which makes it challenging to form dynamically-sized batches and ragged tensors with varying lengths. However, we believed that the tradeoff for significantly better performance was worth it, and we were able to pad input tensors to well-chosen static shapes.

Forming larger batches

Last but not least, we revisited the batching strategy that SMS performs for model inference. SMS supports dynamic batching, which allows merging items from multiple requests into larger batches. It normally leads to better throughput at the cost of a short waiting time to gather enough items from the request queue. For ML inference on CPU, we normally want to increase parallelism and reduce latency by splitting the request into small batches. However for GPUs, the latency is less sensitive to the batch sizes, and it is important to form larger batches to make the inference workload efficient for GPUs.

This requirement on batch sizes made us revisit the distributed system setup in SMS. For ML inference on CPU, we used a scatter-gather architecture to split the original request into small ones and run them in parallel on multiple leaf nodes for better latency. Additionally, the architecture allowed us to assign a fixed data shard to each leaf node to optimize the cache hit-rate during feature fetching. However, since small batches are not preferred with GPU anymore, it makes more sense to remove the root layer and utilize the larger batches in the original requests directly. We ended up using CacheLib’s hybrid cache which utilizes both DRAM and SSD to compensate for the cache capacity loss compared to the scatter-gather architecture setup.

Results

We first measure the latency for a single run of model inference. We use c5.18xlarge AWS instances for CPU serving and g5.4xlarge AWS instances for GPU serving.

Bar Graph. Graph Title: Inference latencies before and after optimizations (log scale), Latencies on y axis and Batch Size on x axis. CPU in blue, Out-of-the-box GPU in red, and Optimized GPU in yellow. All latencies increase as batch size increases.

The CPU latency scales linearly with the batch size. The GPU latency with smaller batch sizes are virtually the same due to the kernel launch cost dominating the latency. However as the batch size increases, the actual computation dominates the latency and the GPU latency scales in a sub-linear fashion. In practice, SMS works with bigger batches where the GPU efficiency improvement shines. With all the optimizations combined, we achieved amazing results where GPU serving improves latency for bigger batch sizes by more than 100x compared to CPU serving.

Latency Improvement vs CPU baseline Chart. Batch Size 256 vs 512. Out-of-the-box GPU 70x vs 93x. Optimized GPU 109x vs 128x.

Our server metrics show equally impressive results. By optimizing model ops, revisiting the distributed system setup, optimizing data transfer, and utilizing CUDA Graphs, we are able to serve a 77x bigger model at 30% lower latency and support 20% more throughput at neutral cost.

GPU SMS Improvement vs CPU SMS chart. Categories Out-of-the-box GPU vs. Optimized GPU. Model Size 30x vs 77x. Latency +332% vs -29%. Throughput -76% vs +23%. Cost 1x for both.

Last but not least, the two order of magnitude increase in efficiency unlocks state of the art recommender model architectures at Pinterest. We see a step function improvement in model quality, which translates directly to massive engagement wins. Over the past year, we were able to increase engagement on one of our major product surfaces by 16% with neutral infra cost. We are on track to launch our biggest model yet which is more than 100x bigger than our CPU model very soon.

Chart showing Model Variation, Cumulative Engagement Win, Cumulative Model Size Increase, and Infra Cost / Latency Increase (GPU). Categories include Bigger MLP, Transformer, User Action Sequence + Transformer, DCNv2 + Bigger Transformer, and Even Bigger DCNv2 and Transformer *Not yet launched. Infra Cost Increase is 1x across the board.

Conclusion

Our path to transition our CPU-based model server to a GPU-based one was complicated, but it is a necessary step for us to enable state of the art recommender models at Pinterest. We are able to serve 100x bigger recommender models at neutral cost which provides a basis for our ML engineers to unlock more relevant and responsive recommendations for our Pinners.

Acknowledgements

This project is the result of a close collaboration from many teams at Pinterest. We’d like to thank the following people for their contributions: Po-Wei Wang, Nazanin Farahpour, Saurabh Vishwas Joshi, Xue Xia, Chia-Wei Chen, Prabhat Agarwal, Li Tang, Sihan Wang, Dhruvil Deven Badani, Karthik Anantha Padmanabhan, Andrew Zhai, all our partners in SRE teams, AWS and many others. Also a special thanks to our partners at NVIDIA for their technical support and guidance.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page