How we scaled the size of Pinterest’s ad corpus by 60x

Pinterest Engineering
Pinterest Engineering Blog
7 min readAug 5, 2021


Nishant Roy | Tech Lead, Ads Serving Platform

In May 2020, Pinterest launched a partnership with Shopify that allowed merchants to easily upload their catalogs to the Pinterest platform and create Product Pins and shopping ads. This vastly increased the number of shopping ads in our corpus available for our recommendation engine to choose from, when serving an ad on Pinterest. In order to continue to support this rapid growth, we leveraged a key-value (KV) store and some memory optimizations in Go to scale the size of our ad corpus by 60x. We had three main goals:

  • Simplify scaling our ads business without a linear increase in infrastructure costs
  • Improve system performance
  • Minimize maintenance costs to boost developer productivity

This blog explains how we scaled a business-critical, high traffic recommendation engine and the benefits we saw from it.


In 2018, here’s what the ads-serving architecture looked like:

Flowchart showing how data flows through the old Pinterest Ads-Serving Architecture in 2018, and the online request path. The Index Publisher reads from the Ads Database and uploads index data to the candidate funnel and candidate ranking systems. The Ads Mixer receives an ads request from downstream clients and sends sequential requests to candidate retrieval, candidate funnel, and candidate ranking, and returns the winning ads to the client.
Fig 1: Ads-Serving Architecture in 2018
  • An offline workflow would run every few hours to publish the index of active ads to the system. The candidate funnel service would load this index in memory. Due to memory constraints, the funnel service had nine shards.
  • Downstream clients would then call the ads mixer service, which would perform feature expansion
  • Next, the candidate retrieval service would return a list of the best candidates for the given user and context features
  • Then, the candidate funnel would enrich the candidates with index data, call the ranking service for scoring, and trim the candidates based on various rules (e.g. relevance, price, budget, etc.)
  • Finally, the mixer service would run the auction to select the best candidates, assign prices, and return the ads to show to the user

This system served us well for a few years, but we soon started to run into a few challenges that were preventing us from scaling our ads business.


Memory Bottleneck

As our ads corpus started to grow, so did our index size, causing our candidate funnel service to start hitting memory limits. A larger corpus is highly desirable; it improves the quality of ads that Pinners see, allows us to onboard more advertisers, and improves their ad delivery rates.

We considered a couple short-term solutions:

  • Adding more shards: By growing the number of shards to be greater than nine, we could support a large ads index. However, this is an expensive and complex process. Adding more clusters would increase our infrastructure and maintenance costs, and it is not entirely future-proof (i.e., we may have to reshard again in the future).
  • Vertical scaling: We experimented with EC2 instances that were twice as large and cut the cluster size in half to keep cost constant. However, we found that our service was concurrency bound, so the smaller fleet couldn’t handle the same amount of traffic. This would be a harder problem to solve and is also not future-proof.

Solution: Deprecate In-Memory Index

Rather than horizontally or vertically scaling the existing service, we decided to move the in-memory index to an external data store. This removed the need for shards entirely, allowing us to merge all nine shards into a single, stateless, ads-mixer service. This vastly simplified the system as well, bringing us down from 10 clusters to just one.

Flowchart showing how data flows through the new Pinterest Ads-Serving Architecture and the online request path. The Index Publisher reads from the Ads Database and sends batch and realtime data updates to a distributed key-value store. The Ads Mixer receives an ads request from downstream clients, calls candidate retrieval, then sends parallel requests to candidate ranking for scoring and key-value store for index data, and returns the winning ads to the client.
Fig 2: Ads-Serving Architecture now

Another benefit of removing the in-memory index was that it significantly reduced startup time from 10 minutes to <2 minutes (since we no longer needed to parse and load this index into memory). This allowed us to move from time-based cluster auto scaling to CPU-based cluster auto scaling, which makes our infra cost more reflective of actual traffic as opposed to dependent on hard coded capacity numbers that are not tuned frequently enough. It also makes our cluster more resilient to growth in traffic, since a larger cluster is automatically provisioned, without the need for manual intervention.

Garbage Collection CPU Pressure

The Go Garbage Collector (GC) is heavily optimized and provides good performance for most use cases. However, for systems with a high rate of allocation, Go GC starts to steal CPU resources from the main program to ensure the rate of allocation is not greater than the rate of collection (which would cause an OOM crash). As a result, Go GC becomes more expensive as the number of heap objects grows, since it needs to scan every object on the heap to identify what is collectible.

We found that our system was subject to this phenomenon, causing intermittent spikes of nearly 10% in CPU usage. This caused large latency spikes and drops in success rate, which meant we were losing potential ad impressions.

Solution: Heap Optimizations

We were able to reduce the number of objects on the heap from ~280 million to ~60 million, which significantly improved our system performance — almost a 10% reduction in tail latency and a 1% improvement in success rate. We achieved this by cleaning up unused data fields, reducing the number of long-living objects in favor of creating objects on demand, and pooling objects to reduce large allocation bursts.

Line graph showing the number of objects on heap for the Ads-Serving service. The number of objects drops from 280 million to 60 million after the memory optimizations were applied.
Fig 3: Number of Objects on Heap for Ads-Serving

After entirely deprecating the in-memory forward index, we also saw the intermittent CPU spikes smooth out, making our system more stable and reliable.

Line graph showing the CPU usage for the Ads-Serving service. The CPU usage used to spike by nearly 10–15% due to GC pressure, and the spikes were entirely eliminated after the GC optimizations were applied.
Fig 4: CPU usage before and after GC optimizations

For more detailed information on how Go GC works, how it may cause performance regressions, and how to optimize your system’s performance, you may refer to this blog post or talk on Heap Optimization for Go Systems.

Scale & Latency

Now that the index data is in an external KV store, we need to introduce another RPC to fetch the data. Since we can’t process the ad candidates without this index data, we are blocked till we get the response back. Waiting for this response would increase our end-to-end latency by ~12%. This would impact our success rate, resulting in fewer ad impressions and degraded Pinner experience due to slower responses. We are also requesting data for millions of candidates per second, resulting in high infrastructure cost for the KV store cluster.

Flowchart showing the problems of introducing the index fetcher stage — an increase in latency in the critical path since it’s called sequentially, and a very high QPS to the key-value store, requiring several replicas, causing high infra cost.
Fig 5: Increased ads-serving latency and infra cost to fetch index data from external KV Store

Solution: Parallelization and Caching

By running the index fetcher in parallel with the ranking stage, we were able to minimize the latency impact. Looking at Figure 5 above, we see that both the ranking stage and index fetcher are primarily blocked on RPCs. The ranking stage is the slowest stage in the funnel, so its latency typically dominates.

Flowchart showing that by parallelizing the ranking stage and the index fetcher stage, we minimize the latency impact, but this does not reduce the QPS to the key-value store and therefore does not reduce the infra cost.
Fig 6: Parallelizing candidate ranking and index retrieval to minimize latency

Next, by implementing a local cache in the ads-serving funnel, we were able to reduce traffic to the KV store by ~94%. We set a low Time to live (TTL), on the order of minutes, to maintain a balance between high traffic and data freshness.

Flowchart showing that by adding a local cache for the index fetcher stage, we can significantly lower the QPS to the KV store cluster and minimize the infra cost.
Fig 7: Adding a local cache to lower the QPS to the KV Store cluster, and minimize infra cost
Line graph showing that by introducing a local cache, we lowered the QPS to the key-value store by 94%
Fig 8: Reduction in QPS to KV Store cluster due to local cache. 6% and 10% traffic refers to how much experimental traffic was being served by the new architecture.

Data Delay

Due to delays in the KV Store update pipeline relative to candidate retrieval, newly created ads were frequently missing from the KV Store index, preventing them from being served to the user. This delay was due to the fact that the batch workflows for updating the KV Store were run every few hours, while the candidate retrieval index was updated in real time.

Even for ads that were not missing, the data could be stale. Since this data includes ad budgets, we could potentially be serving ads that have already exhausted their budget, resulting in “overdelivery” (i.e. ad impressions that are not billable).

Solution: Real-Time Data Updates

By enabling real-time updates for the KV Store, we were able to reduce the number of missing candidates and improve data freshness from a few hours to a few minutes. Our KV store follows a lambda architecture pattern, which allows us to push real-time updates through Kafka in addition to pushing batch workflow updates.

Flowchart demonstrating how the lambda architecture for the distributed key-value store works, receiving a batch data upload from a Hadoop workflow, and realtime data from Kafka, and merging them together.
Fig 8: Lambda Architecture with Batch + Real-time Data Updates
Line graph showing that the number of missing keys in the key-value store reduced drastically after we enabled real-time index data updates.
Fig 9: Drop in missing keys due to real-time index data updates


We redesigned our ads-serving architecture to address our memory and performance bottlenecks and prepare for future growth of the ads corpus. In fact, we were able to quadruple our ads index size in 2020, just a few months after launching this new system, with no adverse impact except for a slightly lower cache hit rate. We also greatly reduced the maintenance costs and on-call burden by vastly simplifying the system.

We were able to improve our system performance, reduce latency, and improve reliability, which resulted in an increase in ad revenue, ad impressions, and ad clicks. Our performance improvements also resulted in a large reduction in our infrastructure costs, since we were able to run at a much higher CPU utilization without any degradation in quality.


A huge shout out to everyone who helped make this multi-year endeavor a success. I would like to thank Shu Zhang, Joey Wang, Zack Drach, Liang He, Mingsi Liu, Sreshta Vijayaraghavan, Di An, Shawn Nguyen, Ang Xu, Pihui Wei, Danyal Raza, Caijie Zhang, Chengcheng Hu, Jessica Chan, Rajath Prasad, Indy Prentice, Keyi Chen, Sergei Radutnuy, Chen Hu, Siping Ji, Hari Venkatesan, and Javier Llaca Ojinaga for their help in the design and execution of this project.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.