GPU-accelerated ML Inference at Pinterest
Unlocking 16% Homefeed Engagement by Serving 100x Bigger Recommender Models
Pong Eksombatchai | Software Engineer, Advanced Technology Group; Zhiyuan Zhang | Engineering Manager, ML Serving Platforms
We enabled serving 100x larger recommender models at Pinterest by transitioning our machine learning serving from CPU to GPU — increasing Homefeed Pinner engagement by 16% through a step function improvement in model quality. In this blog post, we’ll share our optimizations to achieve this at neutral cost and latency, including optimizing individual ops, consolidating memory transfers, executing static graphs on-device through CUDA Graphs, and rethinking our distributed system setup.
Pinterest’s mission is to bring everyone the inspiration to create a life they love. To make our mission a reality, one of the key components in all of our product surfaces are various recommender models whose jobs are to predict the right content to show to the right person at the right time. Our recommender models are machine learning models that we trained using advanced algorithms to understand Pinners’ behavior as they spend time on our app. We serve our recommender models using our in-house machine learning model server (Scorpion Model Server, or SMS).
The technical challenges that we deal with for SMS are very difficult as it has to provide 400+ million Pinners relevant recommendations from a corpus of 300+ billion Pins in milliseconds. SMS performs machine learning inference on CPU and is heavily optimized over the years to fit our stringent latency and infrastructure cost requirements. We were pretty much at the limit of what SMS could do even with the latest generation of CPUs and had to be mindful that our modeling changes justify every latency and infrastructure cost increase.
The problem is made worse by the recent trend in machine learning where there is an explosion in the number of model parameters and computation. Models that are 100x bigger with 100B+ parameters are now commonplace in recommender systems and are commonly described in the industry. At Pinterest however, we took a slightly different path to make our models bigger through computation by using modern model architectures such as Transformers. With bigger models, we immediately noticed step function improvements in model accuracy, which translates to massive increases in Pinner engagement. However, serving these modern models on CPU comes at an exorbitant price and would increase cost and latency up to 40x. Therefore, we turned to using GPUs to accelerate the model inference in order to serve those models at reasonable cost.
When we tried GPU serving out-of-box, we quickly realized optimizations were required before we could cost effectively utilize GPUs for recommender model serving. We started off by using a profiler to examine what was happening under the hood during model inference. While taking a closer look at the profiling results, we noticed a large number of small CUDA kernels on the timeline chart. It is an expected behavior for recommender models where hundreds of features are processed individually before they are concatenated at later stages of the model. However, with a large number of small operations, the overhead of launching the CUDA kernels is more expensive than the actual computation. The problem was exacerbated by having relatively small batch sizes at serving time compared to the batch sizes at training time.
Reducing the number of small model ops
The first approach that we took was to identify opportunities to reduce the number of small operations. We looked for model architecture components that are frequently used and optimized them the best that we could. One example is our embedding table lookup module, which consists of two computation steps: raw id to table index lookup and table index to embedding lookup, which are repeated hundreds of times due to the number of features that we have. We were able to significantly reduce the number of operations by leveraging cuCollections to support hash tables for the raw ids on GPUs and implementing a custom consolidated embedding lookup module to merge the lookups into one lookup. We started seeing better performance right away after performing a few of these optimizations.
Consolidating memory copies
Similarly, there is an opportunity to consolidate our data transfer when we move tensors between host and GPU memories. A common recommender model normally takes hundreds of features as input for each candidate. For every inference, each feature is copied to the GPU memory as individual tensors. While it’s very fast to move data between the host and GPU memories, the overhead of scheduling hundreds of cudaMemcpy() calls for each request quickly becomes the bottleneck.
To resolve this problem, we applied a simple optimization that reduces the the number of cudaMemcpy() calls from hundreds to one: instead of relying on the Torch framework to move tensors to GPU individually, we first put all tensors’ data onto a pre-allocated continuous memory buffer and copy the buffer to GPU once. The GPU tensors are then reconstructed by pointing to the GPU memory buffer by different offsets.
This optimization comes at the cost of explicitly managing the lifecycle of the pre-allocated memory buffers and manually handling GPU memory alignment for various data types. But as a result, the P50 data copy latency was reduced from 10ms to sub-1ms, which justified the extra complexity.
Utilizing CUDA Graph
To further optimize our model inference, we relied on CUDA Graphs to completely eliminate the remaining small operations overhead. CUDA Graphs allowed us to capture the model inference process as a static graph of operation instead of individually scheduled ones, allowing the computation to be executed as a single unit without any kernel launching overheads. We supported CUDA Graph as a new backend of our model server. When a model is initially loaded, the model server executes the model inference once to build the graph instance, which can be executed repeatedly for live traffic.
CUDA Graph comes with a few limitations and brings extra complexity to our model server. The biggest limitation is that CUDA Graph requires all tensors to have static shapes and layouts, which makes it challenging to form dynamically-sized batches and ragged tensors with varying lengths. However, we believed that the tradeoff for significantly better performance was worth it, and we were able to pad input tensors to well-chosen static shapes.
Forming larger batches
Last but not least, we revisited the batching strategy that SMS performs for model inference. SMS supports dynamic batching, which allows merging items from multiple requests into larger batches. It normally leads to better throughput at the cost of a short waiting time to gather enough items from the request queue. For ML inference on CPU, we normally want to increase parallelism and reduce latency by splitting the request into small batches. However for GPUs, the latency is less sensitive to the batch sizes, and it is important to form larger batches to make the inference workload efficient for GPUs.
This requirement on batch sizes made us revisit the distributed system setup in SMS. For ML inference on CPU, we used a scatter-gather architecture to split the original request into small ones and run them in parallel on multiple leaf nodes for better latency. Additionally, the architecture allowed us to assign a fixed data shard to each leaf node to optimize the cache hit-rate during feature fetching. However, since small batches are not preferred with GPU anymore, it makes more sense to remove the root layer and utilize the larger batches in the original requests directly. We ended up using CacheLib’s hybrid cache which utilizes both DRAM and SSD to compensate for the cache capacity loss compared to the scatter-gather architecture setup.
We first measure the latency for a single run of model inference. We use c5.18xlarge AWS instances for CPU serving and g5.4xlarge AWS instances for GPU serving.
The CPU latency scales linearly with the batch size. The GPU latency with smaller batch sizes are virtually the same due to the kernel launch cost dominating the latency. However as the batch size increases, the actual computation dominates the latency and the GPU latency scales in a sub-linear fashion. In practice, SMS works with bigger batches where the GPU efficiency improvement shines. With all the optimizations combined, we achieved amazing results where GPU serving improves latency for bigger batch sizes by more than 100x compared to CPU serving.
Our server metrics show equally impressive results. By optimizing model ops, revisiting the distributed system setup, optimizing data transfer, and utilizing CUDA Graphs, we are able to serve a 77x bigger model at 30% lower latency and support 20% more throughput at neutral cost.
Last but not least, the two order of magnitude increase in efficiency unlocks state of the art recommender model architectures at Pinterest. We see a step function improvement in model quality, which translates directly to massive engagement wins. Over the past year, we were able to increase engagement on one of our major product surfaces by 16% with neutral infra cost. We are on track to launch our biggest model yet which is more than 100x bigger than our CPU model very soon.
Our path to transition our CPU-based model server to a GPU-based one was complicated, but it is a necessary step for us to enable state of the art recommender models at Pinterest. We are able to serve 100x bigger recommender models at neutral cost which provides a basis for our ML engineers to unlock more relevant and responsive recommendations for our Pinners.
This project is the result of a close collaboration from many teams at Pinterest. We’d like to thank the following people for their contributions: Po-Wei Wang, Nazanin Farahpour, Saurabh Vishwas Joshi, Xue Xia, Chia-Wei Chen, Prabhat Agarwal, Li Tang, Sihan Wang, Dhruvil Deven Badani, Karthik Anantha Padmanabhan, Andrew Zhai, all our partners in SRE teams, AWS and many others. Also a special thanks to our partners at NVIDIA for their technical support and guidance.