Pinterest Engineering Blog - Medium

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

Pinterest Engineering — Fri, 01 May 2026 16:01:02 GMT

Guangtong Bai | Staff Software Engineer, Product ML Infrastructure*; Shantam Shorewala | Software Engineer II, Product ML Infrastructure*; Chi Zhang | Staff Software Engineer, AI Platform*; Neha Upadhyay | Software Engineer II, AI Platform*; Haoyang Li | Director, Product ML Infrastructure

*These authors contributed equally to this article.

Background

At Pinterest, our online ML serving systems employ a root-leaf architecture. On a high level, the architecture looks as follows:

Figure 1: Root-leaf Architecture of Online ML Serving Systems at Pinterest

In the diagram, “Client Service” is responsible for recommending organic or promoted Pins to users. In order to know if a given Pin is relevant to a particular user request, client service sends a score request to the online ML serving system to have the Pin scored by a bunch of ML models, each of which scores an aspect of “relevancy”.

The online ML serving system is composed of 2 parts:

Root: This component handles initial feature processing. Its responsibilities include retrieving necessary features from the feature store, performing required preprocessing, and distributing (fanning out) the scoring requests to the various leaf partitions.
Leaf: This is where the actual model inference takes place, typically utilizing GPU machines. It is structured into multiple partitions, each of which hosts a related group of models, such as one production model and several experimental variants.

What is flowing between the services are ML features. In this blog, we share how passing too many features from root to leaf created a network bottleneck and how we resolved it with Feature Trimmer.

Motivation

The root-leaf architecture provides us with significant benefits, namely:

Simplified Model Onboarding: New ML models can easily be onboarded for online serving by creating new leaf partitions, transparent to root and upstream clients.
Reduced Feature Store QPS: The system minimizes RPCs to the feature store for fetching ML features by having all leaf partitions share a large in-memory feature cache in the root.
Optimized Resource Utilization: Separating CPU (feature fetching, preprocessing) and GPU (model inference) workloads allows for optimized resource use, improving efficiency and reducing cost.

However, this setup introduced a new challenge — the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute. We observed this pressure in the Ads server on both the root and leaf partitions:

On leaf partitions, peak network usage was significantly higher than peak GPU SM activity (see Figure 2). Consequently, the network bottleneck prevented us from fully utilizing the available GPU compute power.
On root, we had to use the network optimized AWS instance type m6in to ensure the server latency met our internal SLA.

Figure 2. Comparison of the network bandwidth usage vs GPU SM activity on a subset of the leaf partitions of the online ML server

That led to a straightforward idea: reduce the root-leaf network bandwidth usage to unlock immediate fleet downscaling and infrastructure savings. If we could cut bandwidth enough, we could also move the root from network-optimized m6in instances to standard m6i instances (about 20% cheaper), further reducing cost.

Enable compression to reduce network usage

The most direct way to reduce the root-leaf network bandwidth usage is to compress the requests between them.

This compression strategy is well-suited for the requests sent from the root to the leaf, which primarily carry ML features for multiple candidate Pins for a given user request. These requests are compressible for several reasons:

Feature Set Consistency: The set of features requested is identical across different candidate Pins, although the actual feature values vary.
Feature Similarity: There are groups of features that share similar representations (e.g., last_x_pins_user_viewed and last_x_pins_user_clicked )
Sparsity: Many features are sparse, containing numerous empty or zero values.

After a few quick tests, we enabled lz4 compression in fbthrift (the RPC framework used by root and leaf) for root-leaf traffic. That reduced 20% root-leaf network usage, at the cost of 5% CPU usage increase and 5ms (~10%) p90 latency increase.

Compression was a solid early win, but it didn’t change the underlying problem: we were still shipping too much unused data. The bigger lever was to stop sending unused features altogether, which led to our “Send What You Use” approach.

Send What You Use

In our root–leaf architecture, the root is shared across many leaf partitions and must fetch ML features for all models. To minimize feature store QPS, the root fetches the union of features needed across models (per candidate Pin), stores them in an efficient in-memory cache, and then fans out the full feature set to each leaf model. Each model converts and uses only the features it needs; the rest are effectively discarded before inference.

This approach was acceptable in our prior architecture, where the same GPU host handled both feature fetching/preprocessing and local model inference. In that context, the unnecessary features only increased main memory usage, which was not a bottleneck on GPU machines. However, within the new root-leaf architecture, transmitting these unneeded features across the network introduces a significant efficiency problem.

If we could send only the required features and trim everything else, similar to C++’s “include what you use” header management tool removing unnecessary #include’s, we could potentially cut root-leaf network usage by ~50%. Like compression, this trades network savings for some additional CPU work and potential latency overhead.

Figure 3: Overview of the ML inference engine with root-leaf setup and feature trimming

To make this work, the root must know the exact feature list required by each leaf model. Since models refresh continuously, we also need to keep the feature allowlist on root in sync with the feature expectations of the latest model version on the leaf.

Source of Truth: Model Signature

The source of truth for which features are needed by a model is its model signature. Model signature defines the inputs and outputs of a model, similar to a function signature. As a version of a model finishes training, its model signature is exported as an extra file alongside the TorchScript artifact in the .pt archive file. Below is what a model signature looks like:

❯ unzip -p model.pt archive/extra/module_info.json | jq
{
  "input_names": [
    "feature_id_1",
    "feature_id_2",
    "feature_id_3",
    ...
  ],
  "output_names": [
    "output_score_1",
    "output_score_2"
  ]
}

When the leaf loads a specific model version from the .pt archive, it not only deserializes the weights from the TorchScript artifact, but also builds a feature converter from the model signature. The converter transforms input features from internal company format into native PyTorch tensors before passing them to the model. Because it knows the model’s inputs, it converts only the required features and discards the rest.

A crucial convention is that a model’s signature remains unchanged across different versions. If a signature modification is necessary — for instance, to introduce a new input feature — a new model is forked from the original. This practice is essential because it underpins the fallback mechanism for the versioned lookup feature of the Feature Trimmer, a topic discussed in detail later in the “Versioned Lookups and Fallback” section.

Model Deploy Synchronization

Feature Trimmer only works if the root knows exactly the features that the leaf model expects. That sounds simple until you factor in reality: models are refreshed frequently (hourly to daily), multiple models are shipped together as a “bundle”, and rollouts happen gradually (canary → prod, rolling deploys, occasional rollbacks).

This section explains how we keep the root up to date with what’s actually deployed on the leaf without adding heavy runtime dependencies or introducing brittle, manually managed configs.

At a high level, our approach is:

Treat the model signature as the source of truth which is exported as module_info.json.
Publish signatures as lightweight artifacts that can be consumed by deployment pipelines.
Aggregate per-model signatures into a per-bundle artifact that is deployed to the root alongside existing root configs.
Use the same staged delivery semantics as model rollout (canary, automated canary analysis, prod, rollback), so trimmer config changes ride the same operational rails as everything else.

Figure 4: Root configurations artifact generation and delivery integrated with existing model deployment

Publish module_info.json as a standalone artifact

To make the model signature easy to ship and consume, we export module_info.json as a standalone file as part of the model training workflow, next to other model files (for example, alongside the model artifact and config files). This is important for synchronization as it ensures signatures are available before deployment, and available in a form that can be aggregated and deployed without any heavy runtime dependencies.

Generate a bundle-level module_info mapping during bundle build

In production, roots don’t serve a single model, they typically serve bundles containing multiple models (and sometimes multiple versions during a rollout window). So instead of deploying N per-model signatures independently, the bundle pipeline generates one bundle-level artifact that looks like:

{
  "model_A": [
    {
      "version": "1",
      "input_names": ["feature_id_1", "feature_id_2", "..."],
      "output_names": ["score_1", "..."]
    },
    {
      "version": "2",
      "input_names": ["feature_id_1", "feature_id_2", "..."],
      "output_names": ["score_1", "..."]
    }
  ],
  "model_B": [
    {
      "version": "7",
      "input_names": ["feature_id_9", "..."],
      "output_names": ["score_x", "..."]
    }
  ]
}

During the build step, the model deploy pipeline iterates over the model versions that will be shipped in the bundle.

If a model version includes module_info.json, the pipeline parses it and records the signature.
If the signature is missing, the pipeline logs a warning and skips that version rather than failing the entire build. This keeps the system resilient while signature publishing is being rolled out across use cases.

Finally, the bundle-level module_info file is packaged and uploaded together with other root configuration files, so the root receives one coherent “ configs” package.

Deploy root configs through the same staged delivery flow

Once the bundle build produces the root-config package, deployment follows the standard staged delivery pattern:

Deploy root configs to Canary
Deploy model configs to Canary
Run Automated Canary Analysis (ACA)
Deploy root configs to Production
Deploy model configs to Production

This is important because it integrates the feature trimmer into the existing model deployment system and ensures that the “root’s trimming view of the world” is updated using the same guardrails and rollback mechanics as other model changes.

We deploy root configs before rolling out new leaf model versions because the feature trimmer keys feature allowlists by model name + version. If a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs, which can cause a temporary rollout gap. To prevent this, we ship a backwards-compatible root artifact containing allowlists for both the current and pending versions. Discussed in more detail in a later section “Versioned Lookups and Fallback.”

On successful completion, the root hosts receive the bundle-level signature mapping at a known location on disk, and the trimmer can begin using it for per-model feature allowlisting.

A Closer Look into Trimmer Internals

Feature Allowlist or Blocklist

Once the root hosts have an idea of which features each model requires, we only keep the needed features in the fan-out request to leaf partitions. This allowlist approach, compared to a blocklist where we keep features not in the list, does not carry the burden of tracking all the features that might be in development or deprecated. Given the evolving nature of ML models and volume of experiments at Pinterest, the blocklist is significantly larger for any given model and it is probable that it will grow faster than the allowlist in the future.

Concurrent Updates Across Model Bundles

As mentioned earlier, a model bundle can contain multiple ML models. Additionally, the model bundles do not map 1:1 to the root cluster — each root cluster can receive traffic for multiple bundles. The bundles, each with their own module_info artifact, are deployed independently and often at different cadence. Further, we need to support independent rollbacks for even a single model bundle.

Figure 5: Concurrent update handling for multiple bundles

A feature trimmer module is initialized on each root host when it comes online. This module maintains a consolidated, in-memory mapping from models to their versioned feature allowlist. Each trim request is efficiently serviced by looking up the model name and version within this consolidated map. The consolidated map uses the model name and version as nested keys for fast read access as follows.

{
  "model_A": {
"version_N": ["feature_id_1", "feature_id_2", "..."],
 "version_M": ["feature_id_1", "feature_id_2", "..."],
  },
  "model_B": {
"version_N": ["feature_id_3", "feature_id_4", "..."],
 "version_K": ["feature_id_4", "feature_id_5", "..."],
  },
}

This per-model feature allowlist map needs to be continuously refreshed as the model bundle is updated. Here is how it is managed:

Configuration: The root cluster is configured with the active model bundles, and the file path for each corresponding module_info.json is set using GFlags.
Initial Loading: The feature trimmer module loads the content of each module_info.json file into an independent in-memory map.
Monitor for Content Updates: A file watcher is attached to each module_info.json. Any content refresh triggers a reload of its contents into the in-memory map for the given model bundle.
Consolidation: On initial loading or when any model bundle is refreshed, the module:
— Scans and merges all independent maps.
— Creates a new consolidated map.
— Atomically replaces the current active consolidated map with the new one.
Concurrency Management w/ Read-Write Lock:
— Concurrent reads of the consolidated and independent maps are managed with a shared lock.
— Write access during the map replacement is managed with a unique lock.

Versioned Lookups and Fallback

Figure 6: Request flow for versioned lookup and fallback

Each scoring request sent to the root cluster must include the model name and optionally, the model version. If the version is omitted, it defaults to the latest version. The feature trimmer parses these fields to determine the version-specific feature allowlist for the requested model.

If no feature allowlist exists for the model, the request proceeds untrimmed.
If both model name and version are specified and found, the specific version’s allowlist is used.
If the model name is found but the version is either not specified or not found, the trimmer uses the latest version of the allowlist. This design choice is based on the assumption at Pinterest that the model signature remains consistent across versions, which also simplifies the deployment by avoiding the need to keep multiple versions in memory during a rolling deployment.

The adoption of the feature trimmer is expected to reduce network bandwidth consumption for root-leaf connections. This places the trimmer on the critical failure path: failure to trim score requests can cause a significant spike in network bandwidth, potentially leading to cascading failures. Therefore, robust handling of artifact (module_info.json) corruption or deployment failures is essential.

We have implemented the following safeguards:

Initialization Failure Railguard: Upon Feature Trimmer module initialization, any failures while parsing the required module_infoartifacts are emitted to our observability dashboard and trigger an on-call alert. We specifically chose not to block host launch on initialization failure. This decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself.
Isolate Failures from a Single Model Bundle: The feature trimmer loads the module_info contents for each model bundle into a separate map in its memory. If a model bundle’s file gets corrupted on disk during an update, the feature trimmer keeps using the old, in-memory version for that bundle. Because each bundle has its own map, the feature trimmer can still successfully update the information for all the other model bundles.

The fundamental assumption that the model signature is consistent across different model versions allows us to implement these precautions, ensuring the Feature Trimmer remains reliably operational even in the event of intermittent deployment failures.

Efficiency Wins

Reduced Network Stress

Ads root-leaf server setup was the biggest beneficiary of this launch. Figures 7 and 8 compare the network performance of the Ads root and leaf clusters post the launch of the feature trimmer module.

Figure 7. Comparison of the network bandwidth usage vs GPU SM activity on a subset of the leaf partitions of the online ML server after feature trimmer was enabled. The reduction in network usage allowed us to tune the cluster size and batch size config to improve the GPU utilization.

Figure 8: Comparison of the network bandwidth consumption before and after launch of the feature trimmer on the Ads root cluster. It dropped from a peak of 4GBPS to <1.5GBPS even after downsizing the root cluster by 27%.

Figure 9: Comparison of network bandwidth performance on Ads leaf partitions after the launch of the feature trimmer. The peak usage dropped from 1000–1200 MBPS in some clusters to <200MBPS for all clusters.

Later, we also applied the feature trimmer to other use cases such as HomeFeed and Related Pins and saw latency and network reductions similar to Ads, amplifying the overall impact of this initiative. Figures 10 and 11 show the network savings in Homefeed Root and Leaf.

Figure 10: In our Homefeed Root cluster, outbound network usage dropped substantially from ~1.2–2.1 GB/s to ~0.45–1.1 GB/s

Figure 11: We saw 65–75% reduction in inbound network usage across Homefeed GPU leaf clusters

As a result, we reduced the Homefeed root cluster fleet size by 33% and are still working on rightsizing the Homefeed leaf clusters, unlocking significant infrastructure savings.

Latency Improvement

While the payload size reduction directly contributed to the network performance improvement, we also saw a reduction in CPU utilization on the root cluster and a reduction in both server-side and client-side root latency. We believe this is largely because a smaller payload leads to less CPU cycles spent on SerDe (serialization/deserialization). This additional latency headroom allowed Ads to save additional cost by trading some latency for cost and the remainder was used to unblock future experiments (see latency increases in late June).

Figure 12: Ads client (AdMixer) P90 latency dropped significantly as well, peaking above 90ms prior the launch to <80ms peak after feature trimmer was enabled.

For our Related Pins surface, the model score latency p99 (ms) before the feature trimmer for most models sits around ~130–180 ms with frequent spikes above 200 ms. After the feature trimmer is enabled, the p99 baseline shifts down to roughly ~95–125 ms for most models, a notable ~25–30% drop in latency.

Figure 13: Feature Trimmer reduces Related Pins model p99 latency by ~25–30%. Note that the feature trimmer was not available for some models because they did not have a valid feature allowlist so these models still see the same peak latency post rollout.

Cost Saving

Based on the efficiencies realized in terms of network performance and client latency, we were able to resize the ML servers at Pinterest to realize significant cost savings:

Ads was the biggest beneficiary of this project — the team could downsize the root cluster by 27% without any performance regression. On the leaf side, the network improvement allowed us to tinker with the batching logic to finetune GPU utilization without impacting any other metrics, representing roughly 5% of the total GPU capacity at the time.
— The latency reduction unblocked future improvements and marginally reduced the failures due to server timeouts — this led to a marginal 0.17% increase in revenue as well.
Across other use cases like Search and Notification, we saw approximately 45% and 65% drops in egress network throughput, with no material change in p99 latency. Because these clusters were initially network-bound, feature trimmer allowed us to move to more optimized instance types, resulting in ≥30% cost reduction for both.
— This realized an additional $0.98M in annual infrastructure cost from rightsizing the clusters

Overall, this project saved over $4M in annual infrastructure costs for Pinterest while creating headroom to test bigger models and features without latency or network performance concerns. It effectively shifted the bottleneck from network to CPU cycles on the root cluster. This also allows the team to switch focus to optimizing the payload between the client and the root to further finetune the resource utilization end-to-end.

Wrap Up

Feature Trimmer successfully addressed a critical network bottleneck in Pinterest’s root-leaf ML serving architecture, moving beyond simple payload compression to implement a “Send What You Use” philosophy. By establishing the model signature as the source of truth for required features and deploying a robust, version-aware feature allowlisting system in sync with model rollouts, we significantly reduced the data volume passed between the root and leaf clusters. This optimization resulted in substantial network bandwidth reduction, improved client-side latency, and ultimately delivered significant cost savings.

In Part II of this blog series, we will shift focus to how request feature compression further optimizes the network connection between the client and the root. Keep an eye out for the next installment to discover how we achieve even greater efficiencies in our ML serving infrastructure.

Acknowledgement

This project would not have been possible without former team members Yiran Zhao and Queena Zhang’s early exploration and prototyping. We extend our sincere gratitude to the following individuals for their invaluable support in deploying Feature Trimmer into production: Miao Wang, Randy Carlson, Runze Su, Qifei Shen, and Tao Mo. We would also like to thank Nazanin Farahpour, Howard Nguyen, Bo Liu, Sihan Wang, Renjun Zheng and Zheng Liu for their helpful review of this blog post.

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation at Pinterest

Pinterest Engineering — Mon, 27 Apr 2026 16:01:05 GMT

Introduction

At Pinterest, conversion ads are crucial for matching users with products they are likely to purchase, boosting value for both users and advertisers¹. While conversion actions like checkout or add-to-cart are highly valuable, they are also technically challenging to optimize for. Because they occur offsite, conversion events are significantly sparser and noisier than onsite engagement signals. Historically, Pinterest’s shopping ads retrieval relied on engagement-based models. While effective for driving interaction, this system was not designed to optimize for lower-funnel conversions. This gap motivated us to build a dedicated candidate generation model tailored for conversions, aiming to surface higher-intent products and improve advertiser performance.

We launched our first shopping conversion model in 2023, achieving meaningful wins across both conversion and engagement, including a higher clickthrough rate (CTR). Further iterations in 2025 unlocked even stronger conversion value and improved Return on Ad Spend (RoAS) for our advertisers. This blog post documents our journey building this conversion candidate generation model, from its technical design and challenges to the key learnings of deploying it to our 600+ million monthly active users at Pinterest.

Training Data Design

Modeling conversion events is challenging. Unlike frequent, real-time onsite engagements (e.g., clicks), offsite conversions are reported by advertisers, making the data sparse, noisy, and delayed. Despite these difficulties, conversions remain one of the most valuable signals for a purchase intent model, offering a far stronger indication of advertiser value and true user intent than engagement alone. To address the inherent sparsity of conversions, we made several key design decisions:

Multi-Surface Model: We train a single model across all shopping surfaces (Homefeed, Related Pins, Search) to avoid fragmenting sparse conversion labels. At the same time, we incorporated surface-specific features to learn contextual differences between these surfaces.
Dual Positive Signals: We supplement primary conversion signals with onsite engagement data (clicks, repins). This broadens data coverage, improving model generalization and ad funnel survival rates. To mitigate click data noise and decrease false positive clicks, we apply a log-based re-weighting function w based on the click duration:

where t is the non-negative click duration in seconds and tₘₐₓ is a tunable constant used to cap the re-weighting function.

Negative Sampling: On top of the existing in-batch negatives, we use ad impressions with no engagement as “harder negatives.” These samples can reflect the real distribution of served ads, exposing the model to a more representative inventory and promoting robust contrastive learning.

In summary, our multi-task approach uses engagement prediction as an auxiliary task to stabilize training and boost performance. The crucial challenge is balancing the two tasks, ensuring the high-value conversion signal is not diluted by the more frequent engagement data.

Feature Engineering

At the core of our model are features that capture critical signals about our users and shopping catalog, grouped into two categories: User-side and Pin-side.

User-side features are split into two types. First, context features capture a user’s real-time intent, which is vital for applications like Related Pins and Search. Examples include a subject Pin’s visual and GraphSAGE² embeddings. Second, preference & historical features capture long-term interests for personalization. These include demographics, aggregated historical actions, and sequential data processed by a Transformer to create a user history embedding.

Pin-side features take a multi-faceted approach, incorporating ID features, multi-modal/ content features for semantic understanding, and performance features tracking engagement.

This structured representation of users and Pins ensures an effective matching process, delivering both personalization and relevance in recommendations.

Model Architecture and Loss function Design

We use a two-tower model for retrieval, where user and Pin features are encoded separately, as there are no explicit user-Pin interaction features at this retrieval stage. To capture richer relationships among features within each tower, we employ DCN v2 (Deep & Cross Network v2)³ as the foundation of our cross layers. This enhances the model’s capacity to model non-linear interactions and boosts retrieval quality. After the cross layers, the output embeddings are fed into the final MLP head(s).

1. Parallel DCN v2 and MLP Cross Layers Architecture
Early in our iterations, our cross-layer design was simple: a stacked architecture where DCN v2 cross network processed the input first, feeding its output into an MLP for dimension reduction. While efficient, we hypothesized that this sequential arrangement imposed a fundamental limit on the model’s learning capacity. To move beyond the sequential design, we designed a new parallel architecture by adding an MLP in parallel (see Figure 1). Its success stems from eliminating the primary drawback of a sequential flow: the information bottleneck. In the old setup, the MLP could only learn from features already processed by DCN v2, potentially losing valuable signals from the original input.

Figure 1: Sequential (left) and Parallel DCN v2 and MLP (right) Cross Layers Architecture

In contrast, our parallel design allows both the cross network and the deep network to learn directly and simultaneously from the same input features. This effectively decouples the learning tasks, the cross network captures richer and more expressive explicit feature interactions by applying cross operations that combine the original input with each successive layer’s output to construct higher-order feature crosses, while the 3-layer MLP learns implicit abstract patterns in parallel. Because the cross network always references the original input at every layer, it constructs higher-order feature crosses without any information being lost or distorted by a preceding MLP transformation. The combined output of both funnels yields a richer and more expressive representation, unlocking a higher level of performance.

We applied this design to both the Pin and query towers, validating it on the conversion task where it delivered a +11% gain in offline recall@1000⁴. Given its success in boosting core learning ability, particularly in its ability to surface stronger feature interactions while keeping a low latency for the retrieval task, this parallel architecture was subsequently adopted by all our production engagement retrieval models, achieving similar recall improvements as well as significant gains in online metrics.

2. From a Multi-Head to a Unified Multi-Task Architecture
In the first version of our model, we designed a multi-head structure to comprehensively make use of the conversion data and engagement data. To leverage the relative abundance of click data, we used a multi-head architecture with shared encoders followed by engagement and conversion heads. The engagement head helped stabilize shared parameters, while the conversion head preserved the unique purchase-intent signal. The two heads were trained simultaneously using a distinct sampled softmax loss (see Figure 2). To balance the influence of engagement data without diluting the conversion signal, different loss weights were applied. At serving time, only the conversion Pin and query embeddings were used.

Figure 2: Multi-head architecture, 2023 (left) and Unified multi-task architecture, 2025 (right)

Through in-depth data analysis and several online experiments, we identified sparsity and noise in the conversion labels as one of the main bottlenecks of the previous model performance. To better stabilize query embeddings in regions of low conversion coverage, we moved from a multi-head architecture to a unified single-head multi-task architecture (cf. Figure 2). By merging the conversion and engagement heads, it allows the final embeddings to directly benefit from the multi-task optimization during serving.

Building on top of this, we also observed that conversion data at the Pin level exhibit high variance, making it challenging to reliably model purchase intent from Pin-level supervision alone. To address this, we introduce an advertiser-level loss function as an additional training objective, enabling the model to better capture conversion signals at a more stable and consistent granularity. With other model improvements and feature additions, we saw on average an increase of +42% recall@100⁴ for conversion tasks compared to our previous 2023 model.

Conclusion

In summary, our modeling journey in crafting the shopping conversion candidate generation was driven by the necessity of overcoming the inherent sparsity and noise of offsite conversion events. We addressed this through a sequence of loss design and architectural innovations. Key modeling decisions included the adoption of a unified model across all surfaces and the strategic use of conversion and click duration-weighted engagement data. Architecturally, we leveraged a highly effective Parallel DCN v2 and MLP Cross Layers architecture, and we progressed from an initial separate multi-head design to an unified multi-task architecture that introduced an advertiser-level matching objective to better align with the natural granularity of the conversion signal.

Introducing this new CG to production in 2023 delivered a 2.3% increase in shopping conversion volume and a 2.7% lift for the shopping impression to conversion rate. Beyond conversions, it also improved the Pinners’ shopping experience, with CTR increasing by 1.5% and CTR over 30 seconds rising by 2.2%. Building on this foundation, further iterations and refinements throughout 2025 continued to push the model’s performance forward, resulting in a 3.1% improvement in RoAS for US shopping campaigns⁴, reinforcing that strong advertiser outcomes and a great Pinner experience are not at odds, but deeply intertwined.

Acknowledgments

Ads Retrieval: Yang Liu, Jay Ma (former), Peifeng Yin (former), Qingmengting Wang, Richika Sharan, Jitong Qi, Yufeng Su, Huiqin Xin

Ads Ranking: Weiwei Ying (former), Yiwei Sun (former), Aayush Mudgal, Hongda Shen, Han Sun

Ads Signal: Jiayin Jin (former), Daniel Yang (former), Chongyuan Xiang, Lakshmi Manoharan, Litian Tao, Siping Ji

Leadership: Alice Wu, Leo Lu (former), Ling Leng (former), Hari Venkatesan (former), Behnam Rezaei (former), Jamieson Kerns

References

¹ A. Mudgal, et al. 2024. Evolution of Ads Conversion Optimization Models at Pinterest. Pinterest Engineering Blog.

² W. L. Hamilton, et al. 2017. Inductive Representation Learning on Large Graphs. In NIPS.

³ R. Wang, et al. 2020. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. WWW ’21: Proceedings of the Web Conference 2021.

⁴ Pinterest Internal Data, US, 2023 to 2025.

From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation at Pinterest was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest

Pinterest Engineering — Mon, 20 Apr 2026 16:01:04 GMT

Shanhai Liao | Senior Software Engineer, Content Acquisition and Media Platform; Di Ruan, | Senior Staff Software Engineer, Content Acquisition and Media Platform; Evan Li, | Senior Engineering Manager, Content Acquisition and Media Platform

Introduction

Accurate content understanding underpins Pinterest’s ability to drive distribution and engagement. This requires deep insight not just into the image itself, but also the outbound links or items to which those images point. At the foundation of this process lies a deceptively simple problem: URL normalization.

When Pinterest ingests content from millions of merchant domains, the same product page often appears under many different URLs. A single pair of shoes might be referenced by dozens of URL variations — each one decorated with different tracking parameters, session tokens, or analytics tags. While downstream systems can eventually deduplicate by content identity, the inability to recognize these duplicates at the URL level means every variation is independently fetched, rendered, and processed. At scale, this redundant ingestion and processing represents a significant waste of computational resources — rendering the same page dozens of times simply because its URLs differ in irrelevant parameters.

Item canonicalization — ensuring that identical items represented by different URLs are unified — is critical for organizing shopping catalogs and presenting a consistent experience to users. For many partners, a provided item ID determines canonical identity, but in its absence, the onus falls to advanced URL normalization to deduplicate effectively.

This post details the technical journey behind the Minimal Important Query Param Set (MIQPS) algorithm: a system that automatically learns which URL parameters matter for content identity, enabling dynamic and precise URL normalization at scale.

Background: The URL Normalization Challenge

Consider a typical product URL from an e-commerce site:

https://example.com/shoes?id=42&color=red

This URL identifies a specific product variant. But in practice, the same product page is often reached through URLs like:

https://example.com/shoes?id=42&color=red&utm_source=facebook&session=abc123
https://example.com/shoes?id=42&color=red&ref=pinterest&click_id=xyz
https://example.com/shoes?id=42&color=red&tracking=campaign_spring

Figure 1: The URL duplication problem. Multiple URLs with different tracking parameters all resolve to the same product content.

Caption: Figure 1: Multiple URLs with different query parameters all point to the same underlying product page.

The parameters utm_source, session, ref, click_id, and tracking are all neutral - they don’t change the content of the page. Meanwhile, id and color are non-neutral - they determine which product and variant are displayed.

The challenge is distinguishing between the two. For well-known e-commerce platforms, this can be solved with curated rules. Shopify URLs, for example, use variants as the key product differentiator. Salesforce Commerce Cloud uses parameters like start, sz, prefn1, and prefv1. For these platforms, static allowlists are sufficient.

But Pinterest ingests content from a large number of domains, operating on a wide variety of platforms.

For this long tail of domains, URL parameter conventions vary wildly. Static rules cannot scale to cover them all. We need a dynamic, data-driven approach.

The MIQPS Algorithm

The core insight behind MIQPS is straightforward: if removing a query parameter changes the content of a page, that parameter is important; if it doesn’t, the parameter is noise and can be safely stripped. Crucially, this analysis runs independently per domain — each merchant site gets its own MIQPS map, because the same parameter name can be meaningful on one domain and irrelevant on another.

The algorithm operates in three steps.

Step 1: Collect the URL Corpus

As Pinterest’s content ingestion pipeline processes URLs from domains, the system accumulates a corpus of observed URLs per domain. This corpus is stored durably and represents a snapshot of all the URL variations seen for a given domain. It serves as the input to the MIQPS analysis.

Step 2: Group URLs by Query Parameter Pattern

Not all URLs from a domain share the same set of query parameters. A product page URL might carry {id, color, utm_source} while a category page might carry {category, page, sort}. Analyzing them together would be meaningless.

Moreover, the same parameter name can play different roles depending on its context. Consider the parameter `ref`: on a product page URL like `example.com/product? id = 42 & ref = homepage`, `ref` is purely a tracking parameter and is neutral - removing it doesn’t change the product displayed. But on a comparison page URL like `example.com/compare? ref=99`, the same `ref` parameter identifies which items to compare and is non-neutral. By grouping URLs by their full parameter pattern, the algorithm evaluates each parameter within its specific context, correctly classifying it as neutral in one pattern and non-neutral in another.

To address this, the algorithm groups URLs by their query parameter pattern — the sorted set of parameter names present in the URL. For example:

URLs sharing the same query pattern are grouped together. The top K patterns by URL count are selected for analysis, focusing computational resources on the patterns that matter most.

Step 3: For Each Pattern, Test Each Parameter

For each query parameter within a pattern, the algorithm determines whether it is neutral or non-neutral through empirical testing:

1. Sample: Select up to S URLs with distinct values for the parameter under test.

2. Compare: For each sampled URL, compute the content ID — a fingerprint derived from the page’s rendered visual content — for both:
— The original URL (with the parameter present)
— A modified URL (with the parameter removed)

3. Classify: If removing the parameter changes the content ID in at least T% of samples, the parameter is classified as non-neutral (important). Otherwise, it is neutral (safe to drop).

The content ID is a hash of the page’s visual representation, meaning two URLs that render the same visible content will produce the same content ID, even if their underlying HTML differs slightly. This particular fingerprinting approach leverages Pinterest’s in-house page rendering infrastructure, which is tailored to our content pipeline. The core MIQPS algorithm, however, is agnostic to how the content fingerprint is produced — it only requires a function that returns the same identifier for the same page content. Third parties looking to adopt a similar approach could substitute alternatives such as DOM tree hashing, HTTP response body checksums, or even simpler heuristics like comparing the `` and Open Graph metadata across URL variants. The key principle remains the same: compare some representation of the page content with and without each parameter to determine its importance.A natural question is: why not simply use the **canonical URL** declared in the page’s HTML (via the `<link rel=”canonical”>` tag) to resolve duplicates? If the merchant provides a canonical URL, two variant URLs pointing to the same product should share the same canonical, making deduplication trivial. In practice, however, canonical URLs are unreliable at scale. Many merchant sites omit them entirely, set them incorrectly (e.g., pointing every page to the homepage), or include tracking parameters in the canonical URL itself. Because we cannot assume canonical URLs are present or correct across the long tail of merchant domains, MIQPS uses visual content comparison as a ground-truth signal that works regardless of how well-maintained a site’s metadata is.<h3>Algorithm Parameters</h3>The behavior of the MIQPS algorithm is governed by a small set of tunable parameters:<figure></figure>Two additional design choices make the algorithm practical at scale:<ul><li>Early exit optimization: If the mismatch rate already exceeds T% after N successful tests, we stop testing that parameter early. This avoids unnecessary page rendering calls for parameters that are clearly non-neutral.</li><li>Conservative default: When fewer than N sample URLs are available for a parameter, it is treated as non-neutral by default. The system errs on the side of keeping parameters rather than dropping ones that might matter.</li></ul><h3>Putting It Together</h3>Figure 2: The MIQPS computation pipeline.The output of this pipeline is a MIQPS map: a mapping from each query parameter pattern to the set of non-neutral parameters within that pattern. This map is published to a configuration store and consumed at runtime during URL normalization.<figure></figure><h3>Multi-Layer Normalization Strategy</h3>MIQPS does not operate in isolation. In production, URL normalization combines static rules with the dynamically computed MIQPS. Static rules capture known conventions — curated allowlists for recognized e-commerce platforms and regex patterns for widely used parameter naming schemes. These rules handle cases where we already have high confidence about which parameters matter.MIQPS complements these static rules by covering the long tail of domains where no predefined rules exist. A URL parameter is kept if it is matched by either the static rules or the MIQPS non-neutral set. Only parameters that pass neither check are stripped. This combination ensures broad coverage: static rules provide immediate, reliable handling for known platforms, while MIQPS dynamically adapts to everything else.<h3>Anomaly Detection: Guarding Against Regressions</h3>Computing MIQPS is inherently dependent on external page rendering. Pages can change, rendering infrastructure can have transient issues, and a domain’s URL structure can shift between analysis runs. Without safeguards, a bad MIQPS computation could cause the system to start dropping parameters that are actually important — leading to content deduplication errors and degraded catalog quality.To address this, the system includes an anomaly detection layer that compares each newly computed MIQPS against the previously published version. The comparison follows a set of conservative rules:<ul><li>Parameter removed from non-neutral set (anomaly): If a parameter that was previously classified as non-neutral is now classified as neutral, the pattern is flagged as anomalous. This is the dangerous case — it means we would start stripping a parameter that we previously determined was important.</li><li>Parameter added to non-neutral set (not anomalous): If a previously neutral parameter is now classified as non-neutral, this is not considered an anomaly. It simply means we discovered a new important parameter, and the worst case is keeping slightly more parameters than necessary.</li><li>Pattern removed entirely (not anomalous): If a query pattern from the previous MIQPS is absent in the new one, this is not flagged. Patterns can naturally disappear as a domain’s URL structure evolves.</li></ul>If more than A% of existing patterns are flagged as anomalous, the entire MIQPS update is rejected and the previous version is retained. This ensures the system never regresses — it errs on the side of over-keeping parameters rather than accidentally dropping ones that affect content identity.<h3>System Architecture and Integration</h3>The MIQPS system fits into Pinterest’s content processing pipeline as follows:Figure 3: End-to-end system architecture.<figure><figcaption>Figure 3: End-to-end system architecture. The content ingestion pipeline produces a URL corpus per domain. An offline job analyzes parameter importance via content ID comparison, then publishes the MIQPS to a config store after anomaly checks. The URL processor reads the MIQPS at runtime to normalize URLs during content processing.</figcaption></figure>The architecture has three distinct phases:<ul><li>Content Ingestion: As URLs are processed from domains, the system writes each unique URL to a per-domain corpus stored in S3. This happens continuously as part of normal content processing.</li><li>MIQPS Computation: After a content processing cycle completes for a domain, an offline job is triggered. This job downloads the URL corpus, runs the MIQPS algorithm (grouping, sampling, content ID comparison), performs anomaly detection, and publishes the result to both a config store (for runtime consumption) and S3 (for archival and debugging).</li><li>URL Normalization: At runtime, the URL processor loads the MIQPS map from the config store at initialization. For each URL it processes, it looks up the query pattern, retrieves the non-neutral parameter set, and strips all parameters not matched by any of the four normalization layers.</li></ul>This separation of concerns means the expensive content ID comparison happens offline and asynchronously, while runtime URL normalization is a fast, in-memory lookup.An alternative design would be to determine parameter importance **in realtime** — rendering the page with and without each parameter at the moment a URL is first encountered. This would eliminate staleness entirely and provide immediate coverage for newly discovered domains. However, we chose the offline approach for several reasons:- Latency: Each content ID computation requires rendering a full page, which takes seconds. Testing every parameter in a URL would multiply this cost, adding unacceptable latency to the content processing pipeline.- Cost: Offline analysis scales with the number of domains, while realtime analysis would scale with the number of URLs — orders of magnitude more expensive.- Reliability: Transient rendering failures in an offline job are isolated and retryable. In a realtime path, they would directly block content processing.In practice, the offline approach is a natural fit because URL parameter conventions change infrequently — on the order of weeks or months. The small amount of staleness between computation cycles is an acceptable tradeoff for the massive savings in cost, latency, and operational complexity.<h3>Conclusion</h3>URL normalization may seem like a mundane infrastructure problem, but at Pinterest’s scale — with a large number of domains and billions of URLs — getting it right has outsized impact on content quality.The MIQPS algorithm brings several key properties to this challenge:<ul><li>Dynamic and data-driven: MIQPS automatically adapts to each domain’s URL conventions without requiring manual configuration or domain-specific rules. As a domain’s URL structure evolves, the algorithm discovers new patterns and adjusts accordingly.</li><li>Layered and defense-in-depth: The multi-layer normalization strategy combines static allowlists, regex patterns, and dynamically computed MIQPS. Each layer catches a different class of parameters, and a parameter only needs to match one layer to be preserved.</li><li>Conservative and regression-resistant: The anomaly detection system ensures that MIQPS updates never regress — previously important parameters cannot be silently dropped. The system consistently errs on the side of keeping parameters rather than stripping them.</li><li>Scalable and cost-efficient: By grouping URLs by pattern, focusing on the top K patterns, and using early exit optimizations, the algorithm keeps computational costs manageable even across hundreds of thousands of domains.</li></ul>By aligning normalization strategies with proven content identity signals, MIQPS ensures every unique item or experience is surfaced cleanly — improving search and recommendations, downstream catalog management, and ultimately the user experience.<hr><a href="https://medium.com/pinterest-engineering/smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication-at-pinterest-4aa42e807d7d">Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Finding zombies in our systems: A real-world story of CPU bottlenecks</h1> Pinterest Engineering — Wed, 15 Apr 2026 16:01:04 GMT Vaibhav Shankar; Staff Software Engineer | Raymond Lee; Staff Software Engineer | Chia-Wei Chen; Staff Software Engineer | Shunyao Li; Sr. Software Engineer | Yi Li; Staff Software Engineer | Ambud Sharma; Principal Engineer | Saurabh Vishwas Joshi; Principal Engineer | Charles-A. Francisco; Senior Engineer | Karthik Anantha Padmanabhan; Director, Engineering | David Westbrook; Sr. Manager, EngineeringOne day in early 2025, the Kubernetes platform team at Pinterest (<a href="https://medium.com/pinterest-engineering/pincompute-a-kubernetes-backed-general-purpose-compute-platform-for-pinterest-8ad408df2d6f">PinCompute</a>) got a ping from our partners on the ML platform team. Their <a href="https://medium.com/pinterest-engineering/ray-infrastructure-at-pinterest-0248efe4fd52">Ray-based training jobs</a> , which often take hours of computation on expensive GPU hardware, were crashing. Not every time, but often enough that it was becoming noticeable. Their logs indicated that their distributed training jobs were seeing intermittent loss of network connectivity, and that ultimately caused their jobs to crash. Their ask was simple:<ol><li>Why is this happening?</li><li>Can you please make it stop?</li></ol>What started there led to a more than three-month-long investigation and a great lesson in profiling performance bottlenecks. Read on to learn from our fun story about CPU bottlenecks, AWS network drivers, and yes, how we discovered Zombies in our system!<h3>Background: Ray at Pinterest</h3>At Pinterest, Ray has risen as the backbone of our next-gen ML training and inference. Over the past few years, it has enabled us to scale systems, accelerate experimentation, and significantly boost the performance of models powering our diverse ML workloads.We have previously shared deep dives on our progress, including: Ray Infrastructure (provisioning ray cluster on in-house K8s clusters at scale [<a href="https://medium.com/pinterest-engineering/ray-infrastructure-at-pinterest-0248efe4fd52">blog</a>]), Batch Inference with Ray (scaling to hundreds of nodes [<a href="https://medium.com/pinterest-engineering/ray-batch-inference-at-pinterest-part-3-4faeb652e385">blog</a>][<a href="https://www.youtube.com/watch?v=HDSy09hrm2I">talk</a>]), Ray for Training (distributed dataloaders and throughput optimization [<a href="https://www.youtube.com/watch?v=yqVLRONwDJs">talk</a>]), and Last-Mile Data Processing (reducing experimentation cycles [<a href="https://medium.com/pinterest-engineering/last-mile-data-processing-with-ray-629affbf34ff">blog 1</a>][<a href="https://medium.com/pinterest-engineering/scaling-pinterest-ml-infrastructure-with-ray-from-training-to-end-to-end-ml-pipelines-4038b9e837a0">blog 2</a>]).Today, we run more than half of the offline ML workload company-wide on Ray, provisioning tens of thousands of Ray clusters per month, a feat made possible only by a robust Kubernetes environment.<h4>Network Model & Challenges</h4><figure><figcaption>Figure 1: Ray architecture at Pinterest</figcaption></figure>What makes the network stability challenging lies in Ray’s unique network model.Ray operates as a highly “network-active” system. A Ray cluster generates constant, intensive inter-pod gRPC traffic that is fundamental to the cluster’s operation, with the following two distinct layers:<ul><li>Control Plane: Handles stateful operations, such as node health check, task submission, actor scheduling, and the maintenance of Object References.</li><li>Data Plane: Handles the high-volume transfer of values within the Object Store. Our Large-scale ML training relies on this plane to move data rapidly between nodes.</li></ul>Because this traffic is highly distributed and latency-sensitive, the impact of network instability is often non-deterministic, manifesting across various components of Ray Cluster:<ul><li>Job Hanging: Caused by actor state corruption following brief network interruptions. [<a href="https://www.google.com/search?q=link&authuser=1">github issue</a>]</li><li>ObjectFetchTimedOutError / ObjectLossError</li><li>ActorDiedError</li><li>Node failed the health check and crashed</li><li>…</li></ul>All of these occurrences resulted in one common outcome: our Ray Training jobs would crash (some use cases with > 25% Success Rate drop), resulting in loss of expensive compute hours and significant slowdown in Model building and experimentation. After grinding for over a month seeking solutions for individual issues in the Ray stack, the ML Platform team realized it was necessary to turn our attention to look for more lower level network issues with our friends from the PinCompute team.<h3>Symptom 1: Network driver resets</h3>At Pinterest, our Kubernetes clusters are backed by AWS EC2 instances, which leverage the ENA Network driver (<a href="https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/RELEASENOTES.md">ref</a>) as a standard traffic component. This Network driver works with AWS Elastic Network Interfaces (ENIs) and sets up receive and transmit queues for buffering packets. Our first symptom that something was wrong was identifying that whenever the ML training jobs failed with network connectivity issues, it correlated with a Network driver ‘<a href="https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst">reset</a>’, as seen in our system logs.<pre>[] ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000). Time since last napi 6596000 usec. napi scheduled: 1 [] ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time but is scheduled # .... Bunch of stats excluded.... [] ena 0000:20:03.0: ENA Large LLQ is disabled [] ena 0000:20:03.0: Device reset completed successfully, Driver info: Elastic Net work Adapter (ENA) v2.11.0g</pre>From the reference docs:Q: What is [the] ENA device reset?A: ENA device reset is a self healing mechanism that is triggered when the driver detects unexpected device behavior. Example of such behavior could be an unresponsive device, missing keep-alive events from the device, Tx completions timeouts, netdev timeout etc. The device reset is a rare event, lasts less than a millisecond and might incur loss of traffic during this time, which is expected to be recovered by the transport protocol in the instance kernel.Ok, so the driver saw Tx threads paused for an extended period of time (hardcoded to 5s in AWS ENA Kernel drivers), and caused the device to be reset, which could cause some packet drops. A typical reason for resets was documented as <a href="https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst#cpu-starvation">CPU starvation</a>, i.e, when the Network driver’s threads don’t get CPU time for several seconds. So perhaps something CPU intensive was starving out the Network driver threads?<h3>Symptom 2: CPU utilization</h3>Our next observation was that some of the machines where we saw network resets exhibited high system CPU usage and that correlated nicely with the CPU starvation theory in the ENA documentation. We speculated that our training jobs were leveraging inefficient memory allocators and that was resulting in High page faulting.<figure><figcaption>Figure 2: Page faults per second on impacted machines</figcaption></figure>We did what many reasonable people would do:<ul><li>We tried using Huge pages (by turning on <a href="https://docs.kernel.org/admin-guide/mm/transhuge.html">TransparentHugePages</a>) to reduce page faulting.</li><li>We experimented with more efficient memory allocators like <a href="https://jemalloc.net/">jemalloc</a></li><li>We tried to give the training jobs their own CPU cores by providing them CPU affinity via <a href="https://man7.org/linux/man-pages/man1/taskset.1.html">taskset</a>.</li><li>Out of desperation, we played with interrupt pinning for ENA drivers by steering network interrupts to other cores.</li></ul>Nothing worked. While we saw some drops in overall CPU utilization and page faulting from the memory allocators and huge pages settings, the network resets continued. They sometimes happened very early in a training job run and sometimes several hours into their execution. Across 100s of training job runs, it was hard to predict when exactly we’d see a network reset, if at all.One mitigation did work, albeit briefly and it’s everyone’s favourite IT crowd advice: Yes, we turned it off and on again. When we rebooted machines with high amounts of resets, they were able to support running ML jobs just fine.. that is until they weren’t. We clocked it at approximately one week of uptime, after which the network resets returned on the rebooted machines.<h3>Symptom 3: Availability zone differences</h3>To further understand the problem, the ML platform team started emitting metrics whenever an ENA reset was observed. Once the metrics were available, the team noticed something odd — the network resets were happening on machines in one AWS Availability zone only and all their jobs with identical parameters were running just great on other zones.<figure><figcaption>Figure 3: Network resets per Availability Zone</figcaption></figure>The PinCompute team runs zonal clusters (one Kubernetes cluster per Availability zone) but when the team looked at our cluster configurations across different zones, they seemed identical. They were running the same version of Kubernetes and the same system image. So, did we get a bad hardware batch!? We reached out to our excellent AWS support team and after several engagements, were convinced that the issue was definitely not on the AWS side. Their analysis was clear: there was something on our machines in the us-east-1a zone, which was heavily using the CPU and causing the network threads starvation. So why would one availability zone’s machines only exhibit this network reset behaviour?<h3>Profiling attempts: perf and mpstat</h3>We decided it was time to stop with high level metrics and start profiling what was actually using the high amounts of CPU. Performance engineers know all about <a href="https://www.brendangregg.com/perf.html">perf</a> and its versatility. perf is a Linux profiler that can provide insights into ‘hot’ code paths and a call stack indicating CPU time spent by a particular process on a machine. Initially, our rudimentary snapshots of perf revealed the same suspected actors: Page faulting and some heavy computation from our ML jobs. However, this didn’t indicate CPU starvation all on its own.<figure><figcaption>Figure 4: perf snapshot on an impacted machine</figcaption></figure>We realized that for CPU starvation to happen, it may take as little as one CPU core to be heavily utilized and block an unlucky network thread that was scheduled onto that core. Moreover, we realized that our GPU machines had 96 vCPU cores, which meant that an overall perf view told us very little about what was happening in each individual core.To address this, we used <a href="https://linux.die.net/man/1/mpstat">mpstat</a> to get an overview of per core utilization on a per-second basis for an hour to identify if specific cores were using up large amounts of CPU. In our offline analysis, we found that sometimes, a single CPU core (in the following screenshot, CPU 39) was often using 100% of its system CPU for multiple seconds! This also correlated with when a network reset happened. We were finally closing in on the root cause!<figure><figcaption>Figure 5: 100% System CPU utilization on a single core (Core 39) when profiled per second.</figcaption></figure>Given these network resets were happening at unpredictable times and we lacked perf runs from the times of the reset, we were still missing one key detail: what process was using up the CPU for this extended period of time?<h3>Temporal profiling: Time is an important factor</h3>We realized that if there was a sporadic process (think something in your crontab or some kind of periodic sync loop in a process) that was causing high CPU utilization at specific times on the machine, then a random perf sample wouldn’t tell us about that. We needed a tool like <a href="https://github.com/intel/gprofiler">gProfiler</a> to be running for an extended period of time and then ‘time travel’ to a specific point in time to look at what was happening on the CPU cores at that time. Unfortunately, at the time of this incident, we didn’t have gProfiler running everywhere within our fleet, but the principles were sound! Thanks to some creative setup from our ML platform team, we created the following experimental setup:1. Reserved a small number of machines (via Kubernetes <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/">taints</a>) for analysis2. Kicked off a series of training jobs in parallel on these machines. For simplicity, we repurposed our in-house Hyper-parameter tuning to orchestrate identical model training across reserved machines, allowing each training run’s resource footprint to remain fairly constant.3. Kicked off a script that ran perf in 2 minute increments with profiles and CPU stacks data saved to disk. The script looked a bit like this and ran on all of our reserved machines as a system process.<pre># Bash program to generate CPU stacks snapshots on a machine. # Run perf record for 2 minutes at a time, since each perf data file can become very large for longer periods. Record the start time in the filename for 'time traveling' later! Running this 360 times covers roughly a 12 hour period of profiles $ for i in {1..360} do sudo perf record -F 97 -g -a -o perf-$(hostname)-$(date +"%Y%m%d-%H-%M-%S")-120s.data -- sleep 120 done # Generate perf stacks $ for datafile in `ls perf-*` do perf script --header -i $datafile > $datafile.stacks done</pre>4. We ran the data collection overnight (~12 hours) and waited for a reset to be triggered. Since our ML training jobs typically ran for 8–12 hours, we were confident that we would observe a reset over this period across at least a subset of the training jobs.Sure enough, when we came to analyze the data the next day, we found that network driver resets had been triggered along with Job failures. Unlike before, we now had perf data to examine from the time of the reset! We fetched the perf results for the 2 minute time window around the reset event and visualized it with the excellent <a href="https://github.com/Netflix/flamescope">Flamescope</a> tool, courtesy our friends at Netflix. Flamescope allows us to view a 2 minute CPU stack with a time travel view, allowing us to zoom into a subset of the time window and observe what was happening on the CPU at that time. From the ENA reset logs, we found that the reset had happened about 70 seconds into this profile, so we zoomed in to a 5 second region from the high-level view around the reset.<figure><figcaption>Figure 6: Temporal high-level view of CPU utilization from flamescope. X-axis is time from 0–120 seconds for the 2 minute snapshot</figcaption></figure>Our first observation was that the kubelet, our lightweight Kubernetes agent, was occupying ~6.5% of total CPU usage a few seconds before an ENA reset. This was alarming and interesting because the rest of the time, the Kubelet barely broke 1% of CPU usage.<figure><figcaption>Figure 7: Profile of the CPU just before ENA resets. Notice the high kubelet utilization.</figcaption></figure>We zoomed in a bit deeper and found that the kubelet was spending a lot of time on a system call: mem_cgroup_nr_lru_pages.<figure><figcaption>Figure 8: Zoomed in profile of the CPU stacks for the kubelet process</figcaption></figure>We now had a suspect: something was causing the Kubelet to iterate over all the <a href="https://docs.kernel.org/admin-guide/cgroup-v1/memcg_test.html">memory cgroups</a> on the host and spending significant time on the CPU. At the same time when we were researching this, we came across this <a href="https://blogs.oracle.com/linux/zombie-memcg-issues">excellent post</a> on the Oracle blog describing the problem of zombie memory cgroups. Could we be running into this problem? Fortunately, that blog post guided us perfectly and we saw the following on a network driver resetting machine:<pre># Kernel tracked cgroups (including zombies) $ cat /proc/cgroups | grep memory | awk '{print $3}' 68680 # Actual cgroups $ find /sys/fs/cgroup/memory/ -type d | wc -l 240</pre>Yup, we definitely had zombies! Nearly 70,000 memory cgroups tracked in the Kernel but only 240 actually in use. Iterating over that long list in the system call was likely what was causing the CPU utilization spikes on a single core and if a network thread landed on that core at just the right time, it could become starved! But what was causing the high build up of memcgs?<h3>Beware of system defaults</h3>Our theory at this point was that the build up memcgs was from some crashlooping container, which kept re-creating cgroups and leaking memcgs. We didn’t see any such container created by Kubernetes but spotted a container that was always only a few seconds old when we queried the docker API:<pre>$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c6fdfc760921 amazon/amazon-ecs-agent:latest "/agent" 11 seconds ago Up 10 seconds (health: starting) ecs-agent</pre>Why was the Amazon ECS Agent running (and repeatedly crashing!) in our Kubernetes nodes? This was certainly unintentional given <a href="https://aws.amazon.com/ecs/">ECS</a> is an alternative container orchestration platform that we weren’t using. It turns out that for our GPU instances, we were leveraging the <a href="https://docs.aws.amazon.com/dlami/">AWS Deep Learning AMI</a> (Ubuntu 20.04) as a base machine image and it set up ecs-agent as a default systemd unit. As part of the machine’s bootstrap process, it also started the ECS agent, which over several days of crashing accumulated a massive amount of memory cgroups. The ECS Agent was correctly crashing since we did not give our machines permissions to join an ECS cluster and so it was natural that the container failed to start up. This also explained why rebooting the machines gave us temporary relief because rebooting reset the memcg counts!We fixed the issue by simply turning off the ECS agent systemd unit in our base images and rebooting all our machines to purge the zombie memcgs. After this, we noticed that memory cgroups remained stable and most importantly, Ray Training jobs were running with their expected high success rate again. The problem of ENA resets and the zombies in our machines was fully resolved and our ML training teams could go back to building awesome new models to serve Pinterest customers!<h3>Hold on! What about the availability zones disparity?</h3>Oh.. right. Well, erm, we messed up a little. See, when we said that the two node configurations were identical across the two clusters, that was only mostly true. For our Kubernetes cluster in the unaffected availability zone, we had an independent bug where we delivered the same Kubernetes binary via two different URLs to the two clusters. Long story short, the difference in URLs caused a last step that emitted a metric to fail and caused the node bootstrap script to get marked as failed. This prevented the ECS agent from starting up because its systemd unit depended on the bootstrap script to successfully complete, which in turn allowed the nodes to remain ‘healthy’, at least from the perspective of not accumulating memcgs! The Kubernetes team was aware of this different URL issue and was independently working on fixing that as well, which in turn would have brought the network reset issue to the unaffected Availability zone as well.<h3>Key Takeaways</h3><ul><li>Introducing fleet wise metrics to track transient issues on the Platform is helpful to identify failure patterns. In this case, it helps us understand that the issue was correlated to AZ/Cluster setup, further leading us to isolate and consistently reproduce the problem.</li><li>Create reproducible, closed environments for iterative debugging. In our case, the partnership between the PinCompute and ML Platform teams to set up debugging experiments was critical to quickly identifying the root cause of the issue.</li><li>Invest in profiling tools and especially temporal profiling tools! They’re great and will save you hours and hours when working on hard-to-debug performance problems. At Pinterest, we’re developing and rolling out <a href="https://github.com/intel/gprofiler">gProfiler</a> in close collaboration with Intel for debugging situations like this.</li><li>Be aware of what processes are running on your base OS images. Sometimes, the defaults aren’t necessarily the right ones for your environment. Invest in profiling the success rate of your systemd units and watch out for the impact of regular failures.</li><li>When looking at differences between two environments that look the same but act differently, look closer.. You’re probably missing some piece of configuration that is causing the two paths to diverge. Better yet, invest in good automated tooling to ensure your environments are truly identical.</li></ul><hr><a href="https://medium.com/pinterest-engineering/finding-zombies-in-our-systems-a-real-world-story-of-cpu-bottlenecks-ea4722e552eb">Finding zombies in our systems: A real-world story of CPU bottlenecks</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Scaling Recommendation Systems with Request-Level Deduplication</h1> Pinterest Engineering — Mon, 13 Apr 2026 19:01:01 GMT Authors: Matt Lawhon | Sr. Machine Learning Engineer; Filip Ryzner | Machine Learning Engineer II; Kousik Rajesh | Machine Learning Engineer II; Chen Yang | Sr. Staff Machine Learning Engineer; Saurabh Vishwas Joshi | Principal EngineerAt Pinterest, scaling our recommendation models delivers outsized impact on the quality of the content we serve to users. Our <a href="https://arxiv.org/abs/2507.12704">Foundation Model</a> (oral spotlight, ACM RecSys 2025), for example, achieved a 100x increase in transformer dense parameter counts and a 10x increase in model dimension; translating directly into meaningful quality improvements across multiple recommendation surfaces.¹But a 100x scaleup creates massive infrastructure pressure. Storage, training, and serving costs all threaten to grow proportionally unless you’re deliberate about efficiency. The single highest-impact technique we’ve deployed to hold costs in check across all three dimensions is request-level deduplication: a family of techniques that ensures we process and store request-level data once, not once per item.In this post, we’ll walk through what request-level deduplication is, why it matters so much for modern recommendation systems, and how we applied it across the full ML lifecycle , from storage compression to training correctness and speedups to serving throughput gains.<h3>Background</h3>A request is triggered when a user opens their feed, kicking off the recommendation funnel:<ul><li>Retrieval: Aggregate user and request information into one or multiple embeddings, then fetch a large set of potentially relevant items from the entire corpus using techniques like nearest neighbor search.</li><li>Ranking: Aggregate user, request, and item information to make predictions about relevance or engagement. Typically there are early-stage ranking models (which need cheap per-item inference since they score many items) and late-stage ranking models (which can afford more expensive per-item inference since fewer items are ranked).</li></ul><figure></figure>The same user data flows through every stage of this funnel, and within each stage, it’s duplicated across every item scored. Request-level deduplication refers to the category of techniques that eliminate this redundancy when storing, moving, or transforming this data.The impact can be extremely high because:<ul><li>Request-level data is massive. It largely consists of user sequences, approximately 16K tokens encoding all actions a user has taken on the platform. These sequences power sequential user understanding components like the <a href="https://arxiv.org/abs/2507.12704">Pinterest Foundation Model</a> and <a href="https://arxiv.org/abs/2506.02267">TransAct</a>. Each sequence is duplicated identically for every candidate item scored, hundreds to thousands of copies per request.</li></ul>Processing this data is expensive. The computation associated with user tower models in retrieval and user sequence understanding components in ranking represents a significant proportion of total recommendation system compute.<figure></figure><h3>Storage</h3>One of the key ways deduplication pays off is at the storage level. A row in a training dataset typically consists of [request/user, content item, engagement label], and we can have hundreds or thousands of content/engagement labels associated with a single request. Without deduplication, the same massive user sequence is stored redundantly for every single content interaction.<figure></figure>By leveraging <a href="https://iceberg.apache.org/">Apache Iceberg</a> with user ID and request ID based sorting (<a href="https://medium.com/pinterest-engineering/how-pinterest-accelerates-ml-feature-iterations-via-effective-backfill-d67ea125519c">How Pinterest Accelerates ML Feature Iterations via Effective Backfill</a>, <a href="https://medium.com/pinterest-engineering/scaling-pinterest-ml-infrastructure-with-ray-from-training-to-end-to-end-ml-pipelines-4038b9e837a0">Scaling Pinterest ML Infrastructure with Ray</a>), we achieve 10–50x storage compression on user-heavy feature columns.² When rows sharing the same request are physically co-located, columnar compression algorithms handle the deduplication automatically.Beyond raw storage savings, request-sorted data enables improved dataset tooling:<ul><li>Bucket joins: Matching keys are co-located, eliminating expensive shuffle operations.</li><li>Efficient backfills: We can update only affected user segments rather than reprocessing entire datasets.</li><li>Incremental feature engineering: Adding new request-level features becomes a localized operation: we can append new columns to existing row groups without duplicating the entire dataset.</li></ul>Stratified sampling: Request-sorted data enables user-level sampling, ensuring training datasets maintain proper diversity without over-representing highly active Pinners.<h3>Training</h3><h4>Addressing Independent and Identically Distributed (IID) Disruption</h4>Early experiments with request-sorted data revealed 1–2% regressions on key offline evaluation metrics in our ranking models.² The root cause was the disruption of the IID assumption.<figure></figure>With IID sampling, each batch contains engagements spread across many users, yielding stable and representative statistics. With request-sorted data, batches become concentrated around fewer users, causing batch-level statistics to fluctuate dramatically based on individual user behavior. Each gradient update is computed from a less representative slice of the data: the model sees a noisier, more biased view of the training distribution, which slows convergence and degrades final quality.The specific vulnerability lies in Batch Normalization (BatchNorm), which normalizes intermediate values by computing mean and variance across the batch. Standard BatchNorm computes these statistics independently on each device’s local batch. When batches are request-sorted and highly correlated, a batch dominated by a single power user will have dramatically different statistics than one with a casual browser.<h3>Fix: Synchronized Batch Normalization (SyncBatchNorm)</h3>SyncBatchNorm aggregates statistics across all devices before normalization. This effectively increases the “statistical batch size” used for computing means and variances, even though each device still processes its local request-sorted batch. The result is that normalization statistics are computed over a much more diverse set of users and requests, restoring the representative statistics that standard BatchNorm enjoyed with IID data.In practice, this simple one-line change fully recovered the performance gap. The communication overhead of synchronizing statistics across devices was negligible compared to the training speedups gained from deduplicated computation.<figure></figure><figure></figure>With IID sampling, the probability that a randomly sampled in-batch negative is actually a positive for the anchor user is negligible: users engage with a tiny fraction of the total item corpus. With request-sorted data, however, batches are concentrated around fewer users, and each user may have dozens or hundreds of engagements grouped together. Many in-batch “negatives” are actually items the user engaged with, they’re false negatives. The false negative rate jumps from ~0% with IID sampling to as high as ~30% with request-sorted data, depending on the number of unique users per batch.²Training the model to push apart items the user actually engaged with actively degrades retrieval quality.<h3>Fix: User-Level Masking</h3>To address this, we extended our existing identity masking to also exclude negatives that belong to the same user as the anchor. The standard InfoNCE loss with logit correction:<figure></figure>becomes:<figure></figure>where:<ul><li>s(·,·) is the similarity function (e.g., dot product) between user and item embeddings</li><li>x_i is the user embedding for the anchor engagement i</li><li>y_i is the positive (target) item for engagement i</li><li>y_k represents candidate negative items from batch B</li><li>x_k is the user associated with candidate k</li><li>p_y values are streaming frequency estimates (<a href="https://research.google/pubs/pub48840/">Yi et al., 2019</a>) used for logit correction</li><li>x_k ≠ x_i is the new constraint: only use engagements from other users as negatives</li></ul>This simple masking change allowed us to successfully adopt request-sorted data for retrieval model training while preserving model quality.<h3>Manifesting Training Speedups</h3>The previous sections focused on correctness, ensuring model quality is preserved when switching to request-sorted data. Here we discuss how to actually realize the compute and memory savings that deduplication enables.<h4>Data Loading</h4>Our data loading infrastructure, shared across ranking and retrieval models, is designed to maintain deduplication as long as possible in the pipeline. All preprocessing and feature transformations operate on deduplicated request-level data. We only reduplicate (expand) at the very end, on GPU or directly in the model’s forward pass. This minimizes CPU-to-GPU transfer costs and memory allocation overhead.<h4>Retrieval Models</h4><figure></figure>Achieving request-level compute deduplication in retrieval models is straightforward thanks to the two-tower architecture. Since the user tower has no item dependencies by definition, we rewrite the forward pass to run the user tower on the deduplicated batch of R unique requests rather than the full batch of B user-item pairs. The item tower continues to operate on the full batch. Gradients for the user tower are computed at the deduplicated level and appropriately accumulated.Though conceptually simple, the savings compound in practice, memory allocation, I/O, and compute all benefit, particularly for large user sequence models where the user tower dominates training cost.<h3>Ranking Models: Deduplicated Cross-Attention Transformer (DCAT)</h3>Ranking models present a greater challenge because transformer architectures used for user understanding typically have item dependencies: each candidate item attends to the user history, coupling request-level and item-level computation.To address this, we developed DCAT, described in detail in the <a href="https://arxiv.org/abs/2507.12704">Pinterest Foundation Model paper</a>. The key insight is to separate the transformer into two components:<ol><li>Context: Apply the transformer to the user’s historical action sequence once per deduplicated request. The keys and values (KV) from each layer are cached.</li><li>Crossing: Each candidate item performs cross-attention with the cached user history KV, reusing the deduplicated context computation.</li></ol>This optimization, implemented with custom <a href="https://triton-lang.org/">Triton</a> kernels for both training and serving, achieved significant throughput gains over standard self-attention with <a href="https://arxiv.org/abs/2205.14135">FlashAttention</a>.<h3>Training Impact</h3><figure></figure>Taken together, request-level deduplication delivered a 4x end-to-end training speedup for retrieval and a ~2.8x speedup for ranking (40% from deduplicated data loading compounded with a 2x gain from DCAT cross-attention).²<h3>Serving</h3>For retrieval, serving has always been correctly deduplicated by design: we embed the user once and search against the item index. No changes were needed.For ranking, the DCAT architecture provides the same deduplication benefit at serving time as it does during training. The context transformer processes the user’s action sequence once per request, the key-value (KV) cache stores the intermediate representations, and each candidate item cross-attends to this cached context. This avoids redundantly recomputing the full user sequence for every item scored.The result is a 7x increase in ranking serving throughput.² This is what made it possible to deploy a 100x larger model without proportional serving cost increases, absorbing the full Foundation Model scaleup while holding infrastructure budgets in check.<h3>Conclusion</h3>Request-level deduplication delivered impact across every layer of our ML lifecycle:<ul><li>Storage: 10–50x compression on user-heavy feature columns via Iceberg and request sorting²</li><li>Training: 4x retrieval speedup and 2.8x ranking speedup from deduplicated data loading and DCAT²</li><li>Serving: 7x throughput increase via DCAT and custom Triton kernels²</li></ul>Three lessons stand out:<ol><li>Request-level deduplication is a cross-cutting technique. It improves storage, training, and serving simultaneously because the same fundamental redundancy exists at every layer.</li><li>Simple fixes unlock big wins. SyncBatchNorm and user-level masking are minimal code changes with outsized impact. The hardest part was identifying the problems; the solutions were straightforward.</li><li>Impact compounds across the stack. Storage compression enables faster data pipelines, training speedups accelerate experimentation velocity, and serving throughput reduces infrastructure cost, freeing capacity for the next round of model scaling.</li></ol>¹ <a href="https://arxiv.org/abs/2507.12704">Pin Foundation Model</a>, ACM RecSys 2025. ² Pinterest Internal Data, Global, 2025.<h3>Acknowledgements</h3>This work reflects joint efforts across multiple teams at Pinterest. We’d like to thank: Devin Kreuzer, Piyush Maheshwari, Hanlin Lu, Xue Xia, Abhinav Naikawadi, Yuming Chen, and Aditya Mantha (Personalization); Kousik Rajesh, Xiangyi Chen, Zelun Wang, Hanyu Li, Pong Eksombatchai, Jaewon Yang, Yi-Ping Hsu, and Hongtao Lin (Applied Sciences); Raymond Lee, Sheng Huang, Neha Upadhyay, Nazanin Farahpour, Henry Feng, Alekhya Pyla, Rubin Fergerson, and Shengtong Zhang (ML Platform); Shivin Thukral, Joseph Bongo, Zach Barnes, and Yang Cao (Search); and Anya Trivedi, Akshay Iyer, and Rui Liu (Notifications).<hr><a href="https://medium.com/pinterest-engineering/scaling-recommendation-systems-with-request-level-deduplication-93bd514142d9">Scaling Recommendation Systems with Request-Level Deduplication</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Performance for Everyone</h1> Pinterest Engineering — Wed, 08 Apr 2026 16:01:01 GMT Author: Lin Wang (Android Performance Engineer)<figure></figure><h4>Default Feature</h4>For mobile apps, performance is considered as the “default feature”, which means apps are expected to run fast and be responsive. It’s just as if we expect a watch to show the time. With no exceptions at Pinterest, we measure, protect and improve performance for all of our key user experiences’ surfaces, such as “Home Feed” and “Search Result Feed”.<h4>Hard to Measure</h4>Among all the performance metrics, the user perceived latency is a crucial one. It measures how much time the user spends since they perform an action until they see the content. This is also called “Visually Complete”.Visually Complete can be very different from app to app or even from surface to surface within one app. On Pinterest’s “Video Pin Closeup” surface, Visually Complete means the full-screen video starts playing; on our “Home Feed” surface, Visually Complete is defined as all the images rendered and videos playing; on our “Search Auto Complete Page”, Visually Complete refers to the search autocompleted suggestions’s text rendered along with the avatar images.<figure></figure>Given this dynamic nature of Visually Complete, engineers had to create customized measurement logic for each surface and that takes a lot of engineering effort and maintenance cost. This ends up as a major boundary for general product engineers to work on performance, especially on newly created surfaces. On average, it takes two engineer-weeks to implement a User Perceived Latency metric on the Android Client and wire it up to all the toolsets for production usage.<h4>All-In-One Solution</h4>Over the years, the performance team at Pinterest has been thinking about how to offer performance measures with the lowest cost to product engineers. Therefore, more product engineers can more easily have access to their feature’s user perceived latency information and work on performance.Until recently, it seems we have found an answer to this. In a nut shell, we built the Visually Complete logic into the base UI class (e.g. BaseSurface). Therefore, the Perceived Latency of any UI surface (existing or new) will be automatically measured as long as the feature is built on top of this base UI class.<h4>Walk the View Tree</h4>First we define a few common media view interfaces: PerfImageView, PerfTextView, PerfVideoView. Each of them contains a few methods to report their rendering status: isDrawn(), isVideoLoadStarted(), x(), y(), height(), width(), etc.<figure></figure>At the BaseSurface level, given that we should have access to the root android ViewGroup (e.g. RootView). We could just iterate through the view tree starting from the RootView by visiting all the views on this tree. We will focus on those visible views and judge if all the PerfImageView, PerfTextView and PerfVideoView instances are all drawn or started if it’s a video.<figure></figure><h4>In Production</h4>Since the release of this system on Android, it constantly visualizes the User Perceived Latency on over 60 surfaces at any given time. It is well received by many product teams and started to protect and improve their surface’s performance.<figure></figure><h4>Interesting Cases</h4><ul><li>Since all surfaces are measured by the same standard, we can compare multiple surfaces’ performance fairly.</li><li>For some features with short shelf time (e.g. a Christmas landing page), we previously weren’t able to code their latency metrics in time, but now those latency metrics will be ready since the surface is built.</li></ul><h4>Conclusion</h4>Once the performance metrics are offered to product engineers for free, it makes Pinterest’s performance more visible and encourages everyone to protect and optimize the User Perceived Latency on their surfaces.Following the success on Android, we have also extended the same concept to iOS and web platforms.<h4>Acknowledgements</h4>Special thanks: Arun K<hr><a href="https://medium.com/pinterest-engineering/performance-for-everyone-21a560260d08">Performance for Everyone</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Evolution of Multi-Objective Optimization at Pinterest Home feed</h1> Pinterest Engineering — Tue, 07 Apr 2026 16:01:02 GMT Homefeed: Jiacong He, Dafang He, Jie Cheng (former), Andreanne Lemay, Mostafa Keikha, Rahul Goutam, Dhruvil Deven Badani, Dylan Wang Content Quality: Jianing Sun, Qinglong Zeng ML Serving: Li Tang<h3>Introduction</h3>In feed recommendation, we recommend a list of items for the user to consume. It’s typically handled separately from the ranking model where we give probability predictions of user-item pairs.Pinterest’s feed recommendation follows a cascaded system design with retrieval [1][2], pre-ranking [3], ranking [4][5], and re-ranking. While most of these prior works focus on optimizing immediate actions for each candidate Pin, this work will primarily focus on how we build the final layer of the recommendation funnel for multi-objective optimization. This is a critical part of our recommendation system as it helps us balance short-term and long-term engagement, drive new use case adoption, and satisfy various business requirements. Throughout the years, we have made substantial improvements on this layer through both algorithmic and infrastructure upgrades. In this tech blog post, we will share our experiences, learnings and improvements we’ve made over the years on this critical layer.<h3>Overall System Design</h3><figure><figcaption>Figure 1. Cascaded Design of Pinterest Funnel.</figcaption></figure>Figure 1 illustrates the cascaded funnel design of our feed recommendation system from retrieval to ranking to the multi-objective optimization component. While earlier stages mostly optimize for certain positive actions (e.g., saves) given an impression, the multi-objective optimization layer tackles a different problem: determining the best composition of a feed served to the user. This is critical as users tend to have lower intent when visiting Home Feed and their browsing behavior will be significantly impacted by what they see. For example, visually repetitive content is less engaging and is likely to reduce the user’s session length and the likelihood that a user will revisit Pinterest.<h3>Multi-Objective Optimization Design</h3>In this section, we describe the detailed design of our multi-objective optimization layer.<h4>Diversification</h4>Feed diversification is an important factor for continued user satisfaction. We empirically found that when removing the feed-level diversity component, users’ immediate actions (e.g., saves) increase on day 1 but quickly turn negative by the second week. This also comes with a reduced session time and other negative downstream effects which significantly reduces the user’s long-term satisfaction. It is important to note that when users engage with less diverse content, engagement signals will also be affected, reinforcing the system to generate less diverse content.To achieve better short-term and long-term engagement, we applied a diversity-based re-ranking algorithm in our feed as the main part of the multi-objective optimization layer. It is also one of the most important parts of the multi-objective re-ranking system.<h4>V1: Determinantal Point Process (DPP)</h4>DPP is widely used in the industry for feed diversification [6][7]. In our first generation of feed diversification, we leveraged DPP as the main component.Mathematically, DPP is parametrized by a kernel matrix Lₙₓₙ where the diagonal entry Lᵢᵢ measures the relevance/quality of the i-th item, and the off-diagonal entries Lᵢⱼ = Lⱼᵢ measure the similarity between item i and j. Practically, we use learned embedding such as GraphSAGE [8] and categorical taxonomy as a lever to determine item and item similarity. Thus, DPP’s kernel matrix can be generalized to L = f₀(Λ) g𝜓(S) f₀(Λᵀ) where Λ is the diagonal matrix whose diagonal entries are relevance scores of items, f₀(·) is a monotonic increasing element-wise transformation.Our first version of the feed diversification algorithm was implemented in 2021 based on the DPP algorithm.Since its launch, it has become one of the most impactful components in our system. As the system becomes increasingly responsive through more real-time signal adoption such as in TransACT[5], we have found out that user satisfaction improves when they have more diverse feed recommendations through DPP. We conducted an ablation study by removing the DPP component and found that the user’s time spent impression reduced by over 2% after the first week.<h4>V2: Sliding Spectrum Decomposition</h4>Sliding Spectrum Decomposition (SSD) [9] is a position‑adaptive diversification method used in the recommendation system that views a candidate feed as a mixture of latent “spectra” (topics/intents/styles). As we render the feed top‑down, SSD repeatedly decomposes the local similarity structure within a sliding window and rebalances exposure: under‑represented spectra are promoted while over‑represented spectra are softly penalized. This yields locally smooth yet globally balanced diversity, complementing slate‑global methods like DPP.Mathematically, let X ∈ Rⁿˣᵈ be item embeddings and S ∈ Rⁿˣⁿ a symmetric similarity matrix built from learned representations (e.g., GraphSAGE). At position t with window size w, restrict S to the window S^(ᵗ) and compute a top-K spectral decomposition S^(ᵗ) ≈ U^(ᵗ) Λ^(ᵗ) U^(ᵗ)ᵀ. Let r ∈ Rⁿ be base relevance scores. SSD tracks cumulative exposure Eₖ(𝑡) per local spectrum k and defines an adjusted utility: Uᵢ(𝑡) = f(rᵢ) − β ∑ₖ₌₁ᴷ wₖ(𝑡)·(uₖ^(ᵗ)[i])² where f(·) is a monotone transform of relevance, β controls diversity strength, and wₖ(𝑡) increases with exposure relative to current spectral mass (e.g., wₖ(𝑡) ∝ Eₖ(𝑡) / (ε + λₖ^(ᵗ)). The next item is i⁎ = argmaxᵢ(Uᵢ(𝑡)); exposures are updated and the window slides.Compared to DPP, sliding spectrum decomposition has lower computational complexity given that it avoids Cholesky-style similarity matrix decompositions. The original paper introducing SSD algorithm (<a href="https://arxiv.org/pdf/2107.05204">link</a>) gave a comprehensive comparison between different variations of DPP algorithms vs SSD algorithms:<figure><figcaption>Table 1: Comparisons of greedy inference complexity for SSD and DPP with dense item embeddings. In general, we have 𝑁 > 𝑇 > 𝑤 and 𝑑 > 𝑤. [9]</figcaption></figure>Moreover, the implementation logic of sliding spectrum decomposition is built from standard linear-algebra blocks (windowed similarity, top-K eigen/SVD, weighted penalties, etc.) and can be implemented cleanly in PyTorch with straightforward operations. It avoids positive semi-definite enforcement, log-determinants, and fragile numerical issues common in DPP (e.g., jittered kernels, Cholesky failures), enabling a straightforward “PyTorch-style” model approach with vectorized scoring and lower serving latency.In early 2025, we launched the SSD algorithm, leveraging PyTorch for its diversification logic. This was executed on our company-wide model serving clusters. The SSD algorithm’s simplicity allowed us to incorporate more features for evaluating pairwise Pin similarities, ultimately leading to improved balance between engagement and diversification.<h4>Unified Soft-Spacing Framework</h4>With SSD it further enabled us to incorporate quality goals when evaluating pairwise pin similarities in the backward window. For content less aligned with our quality standards, we added a quality penalty score on top of the SSD objective for which we call it “soft spacing”, as it allowed us to avoid having these content clustered together while also balancing with engagement and diversification.We define the soft spacing penalty: qᵢ(t) = 𝟙[cᵢ ∈ R] ∑{d=1}^w (1/d) 𝟙[c{t−d} ∈ R]. It’s applied when item i belongs to the sensitive set R and nearby previously placed items in the backward window also belong to R, with each prior item inversely weighted by distance. We then subtracted the soft spacing penalty term to the adjusted utility Uᵢ(t) with a coefficient λ to balance with other objectives.This is an important next step for improving content quality on Pinterest and protecting users from content that warrants additional caution, where in the past we usually rely on strong enforcement like filtering which sometimes leads to less satisfying user experience if there is no backfill. In mid 2025 we launched the soft spacing penalty on content with elevated quality risk, to restrict its distribution and ensure the utmost quality standards at Pinterest. In late 2025 we further abstracted the logic via building an easy to use, config-based framework to make it more extendable to meet and adapt to quality needs.<h4>System Infrastructure Evolution</h4>At the launch of DPP, the main multi-objective optimization (blending) layer is composed of a sequence of “nodes.” Several Lightweight Reranking nodes first perform low-latency reordering to optimize for short-term engagement and coarse diversity. Candidate pins are then passed to the DPP node, where the more time-intensive DPP algorithm is applied. Before the system outputs the final recommendation list, additional heuristic reordering logic is still needed, such as the spacing strategies mentioned earlier. This chain of nodes is embedded within the Home Feed recommendation backend system. While this setup is relatively robust because it can directly leverage existing backend dependencies, it makes iteration on blending-layer logic challenging due to limited flexibility for local testing and the difficulty of experimenting with new features.With the introduction of SSD, a significant portion of the blending layer’s logic, including much of the diversification logic, has been migrated to PyTorch and is now hosted within the company’s model serving cluster. Our ongoing efforts aim to transfer more heuristic logic from the blending layer to the model server, thereby simplifying chain execution within the blending layer.Evolution of blending layer is exemplified by the graph below:<figure><figcaption>Figure 2. Homefeed Blender System Infrastructure Evolution.</figcaption></figure><h4>Evolution of Diversity Signals</h4>With DPP, our feed diversification stack relied primarily on categorical signals (taxonomy labels such as home decor, fashion, cooking, etc.) and on GraphSage as the primary mechanism for defining similarity between Pins.In early 2025, we migrated our diversification process to a CPU-served SSD algorithm implemented in PyTorch. This made it easier to incorporate richer embedding representations when computing pairwise Pin similarity. SSD’s lower serving latency, relative to DPP, allows us to use a broader set of signals. Specifically, SSD uses the following embeddings to represent Pins and drive diversification:Visual embeddings: capture visual redundancy and style similarity.Text embeddings: capture overlap in titles and descriptions.Graph embeddings (GraphSage): capture relatedness in the Pin graph, including co-engagement patterns and neighborhood similarity.In Q2 2025, we added soft-spacing capabilities to address a business need: reducing clustered content exposure without relying on brittle, one-size-fits-all hard-spacing rules. As part of this work, we incorporated content quality signals that identify content requiring additional caution, allowing SSD to demote a candidate when similar content has appeared within a preceding window.In Q3 2025, we upgraded SSD’s visual embedding to use PinCLIP image features [10]. PinCLIP provides a stronger multimodal visual representation, learned through image-text alignment with additional graph-aware objectives. Critically, this signal is also available in near real-time, which improves representation quality and, in turn, downstream similarity and diversification behavior, for recently ingested Pins.More recently, in Q4 2025, we added a Semantic ID signal [11] to address a practical gap: while embeddings are excellent at capturing how close two Pins are, they do not always provide a stable, category-like notion of semantics that is useful for controlling diversity. Semantic IDs provide a hierarchical representation derived through coarse-to-fine discretization of content representations, enabling us to reason more explicitly about semantic overlap between items. In SSD, we discourage recommending too many Pins with high Semantic ID prefix overlap by applying a penalty term. This improves both perceived diversity and engagement by reducing repeated content clusters.For future works, we are focusing on ensuring diversity across user specific interests and having a proper representation of the interests the user historically engaged with.<figure><figcaption>Figure 3: Diversity component timeline</figcaption></figure><h4>On-going and Future Works</h4>Currently, we have various different on-going works to optimize the final layer. This includes two major workstreams: 1) a unified generative post-ranking model that optimizes the final slate generation in an end-to-end manner 2) reinforcement learning based value model.. We will share more details in later blog posts.<h4>Acknowledgement</h4>We would like to thank all of our collaborators across Pinterest. Ruimin Zhu, Yaron Greif, Ludek Cigler, Jason Madeano, Alekhya, Jaewon Yang, Xianxing ZhangReference: [1] <a href="https://medium.com/pinterest-engineering/establishing-a-large-scale-learned-retrieval-system-at-pinterest-eb0eaf7b92c5">Establishing a Large Scale Learned Retrieval System at Pinterest</a> [2] <a href="https://medium.com/pinterest-engineering/advancements-in-embedding-based-retrieval-at-pinterest-homefeed-d7d7971a409e">Advancements in Embedding-Based Retrieval at Pinterest Homefeed</a> [3] <a href="https://medium.com/pinterest-engineering/pinterest-home-feed-unified-lightweight-scoring-a-two-tower-approach-b3143ac70b55">Pinterest Home Feed Unified Lightweight Scoring: A Two-tower Approach</a> [4]<a href="https://arxiv.org/abs/2209.08435"> Rethinking Personalized Ranking at Pinterest: An End-to-End Approach</a> [5] <a href="https://arxiv.org/abs/2306.00248">TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest</a> [6]<a href="https://arxiv.org/abs/1207.6083"> Determinantal point processes for machine learning</a> [7] <a href="https://jgillenw.com/cikm2018.pdf">Practical Diversified Recommendations on YouTube with Determinantal Point Processes</a> [8]<a href="https://arxiv.org/abs/1706.02216"> Inductive Representation Learning on Large Graphs</a> [9] <a href="https://arxiv.org/abs/2107.05204">Sliding Spectrum Decomposition for Diversified Recommendation</a> [10]: <a href="https://arxiv.org/pdf/2603.03544">PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest</a> [11] <a href="https://arxiv.org/pdf/2305.05065">Recommender Systems with Generative Retrieval</a><hr><a href="https://medium.com/pinterest-engineering/evolution-of-multi-objective-optimization-at-pinterest-home-feed-06657e33cd10">Evolution of Multi-Objective Optimization at Pinterest Home feed</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Building an MCP Ecosystem at Pinterest</h1> Pinterest Engineering — Thu, 19 Mar 2026 16:01:01 GMT Tan Wang | Software Engineer, Agent Foundations<figure></figure>Over the last year, Pinterest has gone from “MCP sounds interesting” to running a growing ecosystem of Model Context Protocol (MCP) servers, a central registry, and production integrations in our IDEs, internal chat surfaces, and AI agents. This post walks through what we’ve built so far, how we designed it, and where we’re taking MCP next.<h3>What Is MCP and Why Did We Care?</h3><a href="https://modelcontextprotocol.io/docs/getting-started/intro">Model Context Protocol (MCP)</a> is an open-source standard that lets large language models talk to tools and data sources over a unified client-server protocol, instead of bespoke, one-off integrations for every model and every tool. At Pinterest, we’re using MCP as the substrate for AI agents that can safely automate engineering tasks, not just answer questions. That includes everything from “read some logs and tell me what’s wrong” to “look into a bug ticket and propose a fix PR.”<h3>The Initial Architecture: Internal MCP + Registry</h3><h4>Hosted, Not Local</h4>Although MCP supports local servers (running on your laptop or personal cloud development box, communicating over stdio), we explicitly optimized for internal cloud-hosted MCP servers, where our internal routing and security logic can best be applied.Local MCP servers are still possible for experimentation, but the paved path is “write a server, deploy it to our cloud compute environment, list it in the registry.”<h4>Many Small Servers, Not One Giant One</h4>We debated a single monolithic MCP server vs. multiple domain-specific servers. We chose the latter: multiple MCP servers (e.g., Presto, Spark, Airflow) each own a small, coherent set of tools. This lets us apply different access controls per server and avoid crowding the model’s context.A common piece of feedback we received early on was that spinning up a new MCP server required too much work: deployment pipelines, service configuration, and operational setup before writing any business logic. To address this, we created a unified deployment pipeline that handles infrastructure for all MCP servers: teams define their tools and the platform handles deployment and scaling of their service. This lets domain experts focus on their business logic rather than figuring out deployment mechanics.<h4>The Internal MCP Registry</h4>The MCP <a href="https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/">registry</a> is the source of truth for which MCP servers are approved and how to connect to them. It serves two audiences. The web UI lets humans discover servers, the owning team, corresponding support channels, and security posture. The Web UI also shows the MCP server’s live status and visible tools. The API lets AI clients (e.g., our internal AI chat platform, AI agents on our internal communications platform, IDE integrations) discover and validate servers, and lets internal services ask “Is this user allowed to use server X?” before letting an agent call into it.This is also the backbone for governance: only servers registered here count as “approved for use in production.”<figure><figcaption>Figure 1: architectural diagram of Pinterest’s MCP ecosystem.</figcaption></figure><h3>What We Shipped</h3><h4>A Growing Fleet of MCP Servers</h4>We started by seeding a small set of high-leverage MCP servers that solved real pain points, then let other teams build on top of that.Representative examples (by usage):<ul><li>Presto MCP server: consistently our highest-traffic MCP server. Presto tools let agents (including AI-enabled IDEs) pull Presto-backed data on demand so agents can bring data directly into their workflows instead of context-switching into dashboards.</li><li>Spark MCP server: underpins our AI Spark debugging experience, used to diagnose Spark job failures, summarize logs, and help record structured root-cause analyses, turning noisy operational threads into reusable knowledge.</li><li>Knowledge MCP server: a general-purpose knowledge endpoint (used by our internal AI bot for company knowledge and Q&A and other agents to answer documentation and debugging questions across internal sources), so agents can reach for institutional knowledge with the same ease as calling a tool.</li></ul><h4>Integrations Into Pinterest Surfaces</h4>We didn’t want MCP to be a science project; it had to show up where engineers already work.Our internal LLM web chat interface is used by the majority of Pinterest employees daily. The frontend automatically performs OAuth flows where required, and returns a list of usable tools for the current user, scoped to respect security policies. Once connected, our AI chat agent binds MCP tools directly into its agent toolset so invoking MCP feels no different from calling any other tool.We also have AI bots embedded in our internal chat platform, which also exposes MCP tools. Like our LLM web chat interface, it handles authentication and authorization through the registry API. It also supports functionality such as restricting certain MCP tools to certain communication channels (for example, Spark MCP tools are only available in Airflow support channels).An overview of the flow from starting to build an MCP server to when it’s consumed by an end user:<figure><figcaption>Figure 2: end-to-end flow of developing an MCP server</figcaption></figure><h3>Security, Governance, and Policy</h3>Letting AI agents call tools that touch real systems and data raises obvious security questions. We’ve treated MCP as a joint project with Security from day one.<h4>Security Standards and Review</h4>We defined a dedicated MCP Security Standard. Every MCP server that is not a one-off experiment must be tied to an owning team, appear in the internal MCP registry, and go through review, yielding Security, Legal/Privacy, and (where applicable) GenAI review tickets that must be approved before production use. This set of reviews determines the security policies that are put in place around the MCP server, such as which user groups to limit access of the server to.<h4>AuthN and AuthZ</h4>At runtime, almost every MCP call is governed by two layers of auth: end-user JWTs and mesh identities.End-user flow (JWT-based)<ol><li>A user interacts with a surface like our web AI chat interface, an IDE plugin, or an AI bot.</li><li>The client performs an OAuth flow against our internal auth stack and sends the resulting JWT when it connects to the MCP registry and the target MCP server.</li><li>Envoy validates the JWT, maps it to X-Forwarded-User, X-Forwarded-Groups, and related headers, and enforces coarse-grained security policies (for example, “AI chat webapp in prod may talk to the Presto MCP server, but not to experimental MCP servers in dev namespaces”).</li><li>Inside the server, tools use a lightweight @authorize_tool(policy=’…”) decorator to enforce finer-grained rules (for example, only Ads-eng groups can call a get_revenue_metrics, even if the server itself is reachable from other orgs).</li></ol>Note that since some MCP servers can execute queries against sensitive internal data systems (like the Presto MCP server), we implemented business-group-based access gating. Rather than granting access to all authenticated Pinterest employees and contractors, some servers will:<ol><li>Extract business group membership from the user’s JWT token</li><li>Validate that the user belongs to an authorized group before accepting the connection (the list of approved groups is set during the initial review stage)</li><li>Selectively enable capabilities only for users whose roles require data access</li></ol>At Pinterest, this means that even though the Presto MCP server is technically reachable from broad surfaces like our LLM web chat interface, only a specific set of approved business groups (for example, Ads, Finance, or specific infra teams) can establish a session and run the higher-privilege tools. Turning on a powerful, data-heavy MCP server in a popular surface therefore doesn’t silently expand who can see sensitive data.Some servers require a valid JWT even for tool discovery. That gives us user-level attribution for every invocation and a clean way to reason about “who did what” when we look at logs.Service-only flows (SPIFFE-based)For low-risk, read-only scenarios, we can rely on SPIFFE-based auth (mesh identity only). Our internal service mesh still enforces security policies, but the server authorizes based on the calling service’s mesh identity instead of a human JWT. We reserve this pattern for cases where there’s no end user in the loop and the blast radius is tightly constrained.Contrast with the MCP OAuth StandardThe MCP specification defines an <a href="https://modelcontextprotocol.io/specification/draft/basic/authorization">OAuth 2.0 authorization flow</a> where users explicitly authenticate with each MCP server, typically involving consent screens and per-server token management. Our approach is different: users already authenticate against our internal auth stack when they open a surface like the AI chat interface, so we piggyback on that existing session. There is no additional login prompt or consent dialog when a user invokes an MCP tool. Envoy and our policy decorators handle authorization transparently in the background, giving us fine-grained control over who can call which tools without surfacing the complexity of per-server authorization flows to the end user.<h4>Human in the Loop</h4>Because MCP servers enable automated actions, the blast radius is larger than if a human manually wielded these tools. Our agent guidance therefore mandates human-in-the-loop before any sensitive or expensive action: agents propose actions using MCP tools, and humans approve or reject (optionally in batches) before execution. We also use <a href="https://modelcontextprotocol.io/specification/draft/client/elicitation">elicitation</a> to confirm dangerous actions. In practice, this looks like our AI agents asking for confirmation before applying a change to e.g. overwrite data in a table.<h3>Observability and Success Metrics</h3>We didn’t want MCP to become a black box. From the start, we designed it to be measured and observable. All MCP servers at Pinterest use a set of library functions that provide logging for inputs/outputs, invocation counts, exception tracing, and other telemetry for impact analysis out of the box. At the ecosystem level, we measure the number of MCP servers and tools registered, the number of invocations across all servers, and the estimated time-savings per invocation provided as metadata by server owners.These roll up into a single north-star metric: time saved. For each tool, owners provide a directional “minutes saved per invocation” estimate (based on lightweight user feedback and comparison to the prior manual workflow). Combined with invocation counts, we get an order-of-magnitude view of impact, which we treat as a directional signal of value. As of January 2025, MCP servers have ramped up to 66,000 invocations per month across 844 monthly active users. Using these estimates, MCP tools are saving on the order of 7,000 hours per month.<h3>Conclusion</h3>In the past year, Pinterest has successfully transitioned from an initial concept to a robust, production-ready ecosystem for the Model Context Protocol (MCP). By explicitly choosing an architecture of internal cloud-hosted, multiple domain-specific MCP servers connected via a central registry, we have built a flexible and secure substrate for AI agents. These high-leverage tools are integrated directly into employees’ daily workflows, meeting them where they work.Crucially, this entire system was built with a security-first mindset. Our two-layer authorization model using end-user JWTs and mesh identities, combined with a dedicated MCP Security Standard and business-group-based access gating on sensitive servers like Presto, ensures that powerful AI agents operate with the principles of least privilege and full auditability.The results are clear: the MCP ecosystem has already grown to over 66,000 invocations per month, delivering an estimated 7,000 hours of time saved monthly for our engineers. This success confirms the value of using an open-source standard to unify tool access for AI.Looking ahead, we will continue to expand the fleet of MCP servers, deepen integrations across more engineering surfaces, and refine our governance models as we empower more AI agents to safely automate complex engineering tasks, further boosting developer productivity at Pinterest.<h3>Acknowledgements</h3>This AI-enabled MCP ecosystem would not have been possible without:<ul><li>Nick Borgers, Kalpesh Dharwadkar, Amine Kamel from our security engineering team</li><li>Scott Beardsley, James Fish from our traffic engineering team</li><li>Leon Xu, Charlie Gu, Kingsley Ochu from our AI Agent Foundations team</li><li>Scott Herbert, Anthony Suarez, Kartik Paramasivam for their engineering sponsorship and guidance</li></ul><hr><a href="https://medium.com/pinterest-engineering/building-an-mcp-ecosystem-at-pinterest-d881eb4c16f1">Building an MCP Ecosystem at Pinterest</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Unified Context-Intent Embeddings for Scalable Text-to-SQL</h1> Pinterest Engineering — Fri, 06 Mar 2026 22:01:01 GMT Your Analysts Already Wrote the Perfect PromptAuthors: Keqiang Li, Bin YangIn our <a href="https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff">previous blog post</a>, we shared how Pinterest built Text-to-SQL with RAG-based table selection (Retrieval-Augmented Generation). That system introduced schema-grounded SQL generation and retrieval-augmented table selection. These were important first steps, but not enough for reliable analytics at Pinterest scale.The challenge was fundamental: with over 100,000 analytical tables and 2,500+ analytical users across dozens of domains, simple keyword matching and table summaries were not enough. When an analyst asks “What’s the engagement rate for organic content by country?”, they need more than a list of tables with similar names. They need the system to understand analytical intent, the business question behind the query, and surface patterns that have actually worked for similar analyses.This article describes how we evolved from basic Text-to-SQL to a production Analytics Agent that helps analysts discover tables, find reusable queries, and generate validated SQL from natural language. Now the most widely adopted agent at Pinterest, it was built on two key engineering choices:<ol><li>Unified context-intent embeddings — We transform historical analyst queries into context rich, full semantic representations that capture analytical intent — the business question a query was designed to answer, rather than raw SQL syntax. This enables semantic retrieval that understands meaning, not just keywords.</li><li>Structural and statistical patterns with governance-aware ranking — We extract validated join keys, filters, aggregation logic, and usage signals from query history, and combine them with governance metadata (table tiers, freshness, documentation quality) to rank results. This ensures the system surfaces not just relevant tables, but trustworthy ones grounded in patterns that have actually worked.</li></ol><figure></figure><h3>The Foundation: From 400K Tables to AI-Ready Data</h3>Before we could build an intelligent analytics assistant, we needed to solve a more basic problem: our data warehouse was a mess.A few years ago, Pinterest’s data warehouse had hundreds of thousands of tables, most with no clear owner or documentation. Our governance roadmap called for reducing the table footprint from roughly 400K to around 100K through standardization and cleanup.We launched a table governance and tiering program:<ul><li>Tier 1: Cross-team, production-quality tables with strict documentation and quality requirements.</li><li>Tier 2: Team-owned tables with lighter but still enforced standards.</li><li>Tier 3: Everything else, including staging, temporary, and legacy tables, subject to aggressive retention and deprecation policies.</li></ul>With these governance constructs, PinCat, Pinterest’s internal data catalog built on open source <a href="https://datahubproject.io/">DataHub</a>, became the system of record for:<ul><li>Table tier tags, owners, and retention policies</li><li>Column-level semantics via <a href="https://docs.datahub.com/docs/glossary/business-glossary">glossary terms</a> (reusable business concepts like user_id or pin_id)</li></ul>This governance work laid the groundwork for everything that followed. It gave us a clear map of “good” tables to prioritize and a structured way to express meaning at the column level, which are essential inputs for any AI system.<h3>Encoding Analytical Knowledge from Query History</h3>Here is where our approach diverges from traditional Text-to-SQL systems.Why not just use an LLM with standard RAG? Most approaches index tables by their documentation and maybe some sample queries, then retrieve tables with semantically similar descriptions when a user asks a question. This works for simple cases, but breaks down in an environment like ours:<ul><li>The analytical question does not match any table description’s wording</li><li>Multiple tables could answer the question, but only specific join patterns work</li><li>The “right” way to compute a metric involves Pinterest-specific conventions</li><li>Quality signals (table tiering), authoritative schemas, and established query patterns live in different systems, so no single search retrieves all the context needed</li></ul>Without systematic access to how analytics is actually done at Pinterest — the tables, joins, filters, and metric definitions that analysts rely on daily, success depends on chance rather than grounded knowledge.Our solution: encode analytical knowledge from query history along two complementary dimensions — unified context-intent embeddings that capture the meaning behind queries, and structural and statistical patterns that capture how queries are built and how well they perform.<figure></figure><h4>Analytical Intent as Unified Context-Intent Embeddings</h4>We convert each SQL query into a semantically rich natural-language description that captures the business question the query was designed to answer. This happens through a three-step pipeline:Step 1: Domain Context InjectionBefore we attempt to interpret a query, we inject Pinterest-specific semantic information alongside the raw SQL:<ul><li>Table and column descriptions from PinCat to add business context</li><li>Standardized glossary terms (e.g., “advertiser_id” maps to. g_advertiser_id in one table and adv_id in another)</li><li>Metric definitions (e.g., “engaged user” means specific action types)</li><li>Domain expertise such as data quality caveats or recommended date ranges</li></ul>At Pinterest’s scale, maintaining this context manually would be impractical. As we describe in Scaling Documentation with AI and Lineage, we use AI-generated documentation, join-based glossary propagation, and search-based semantic matching to keep this context rich and up to date automatically.This context is critical: without it, a downstream LLM would see only raw table and column names and miss the business meaning behind them.Step 2: SQL to TextWith domain context in hand, we use an LLM to translate each SQL query into a structured description of the query author’s original analytical intent. Rather than producing a simple one-line summary, the LLM generates three complementary outputs: a high-level summary that captures business purpose and domain, a set of analytical questions the query could help answer, and a detailed breakdown of the query’s logic in plain English.Consider this ads performance query:<pre>SELECT keyword, SUM(impressions) AS total_impressions, SUM(revenue) / NULLIF(SUM(IF(is_first_conversion, clicks, 0)), 0) AS cpc, (SUM(revenue) / NULLIF(SUM(IF(is_first_conversion, impressions, 0)), 0)) * 1000 AS cpm FROM ads.keyword_performance WHERE dt BETWEEN '2024-10-01' AND '2024-10-31' AND advertiser_id = 12345 AND keyword IS NOT NULL GROUP BY keyword ORDER BY total_impressions DESC</pre>Our SQL-to-text transformation produces:Summary: “Extracts ad performance metrics — total impressions, CPC, and CPM by keyword for a specific advertiser. CPC and CPM are calculated based on first-conversion events, focusing on ad effectiveness in acquiring new customers.”Analytical questions:<ul><li>What are the top-performing keywords by impressions for a given advertiser?</li><li>How cost-effective are ad campaigns based on CPC and CPM for different keywords?</li></ul>Detailed breakdown: Column definitions, transformation logic (CPC derived from first-conversion revenue divided by first-conversion clicks), filters applied, and the business purpose of optimizing keyword targeting within the advertising ecosystem.Two design choices make this process effective at scale. First, the analytical questions create a direct bridge between future user questions and indexed queries. When a new analyst asks “What’s the CPC for our top keywords?”, the system matches their question against questions it already knows how to answer — not just query descriptions. This is what enables intent-based retrieval to work across different phrasings, table names, and column structures.Second, the descriptions are kept deliberately generalizable: the LLM strips temporal specifics (exact dates, individual IDs) while preserving business-meaningful values like metric types and entity categories. A query originally written for “October 2024 keyword performance” generalizes to match future questions about “ad CPC by keyword” regardless of date range. Together, these choices turn years of analysts’ institutional SQL knowledge into a reusable, searchable knowledge base.Step 3: Text to EmbeddingThe natural-language description is then embedded into a vector representation. This enables intent-based retrieval: when a new question comes in, we embed it the same way and find historical queries that answered similar analytical questions, regardless of exact keyword matches. A question about “organic engagement by market” can match a query originally described as “non-promoted pin interaction rates by country” because the embeddings capture semantic similarity, not lexical overlap.<h4>Structural & Statistical Patterns</h4>While analytical intent captures what a query means, we also need to capture how queries are built and how well they perform. We extract two categories of hard facts from query history:Structural patterns are derived by parsing SQL queries:<ul><li>Join patterns: Which tables are joined, on which keys, and with what conditions</li><li>Common filters: Typical WHERE clauses and partition filters for each table</li><li>Aggregation patterns: How metrics are computed (COUNT DISTINCT vs SUM, grouping dimensions)</li><li>Subquery structures: Common CTEs (Common Table Expressions) and nested query patterns for complex analyses</li></ul>Statistical signals are aggregated from query execution metadata:<ul><li>Table co-occurrence frequency: How often tables are queried together signals analytical relationships</li><li>Query success rates: Patterns from successful queries are weighted higher than failed attempts</li><li>Usage recency and volume: Recent, frequently-used patterns reflect current best practices</li><li>Author expertise: Queries from experienced analysts in specific domains carry higher weight</li></ul>These statistical signals combine with governance metadata — table tiers, data freshness, documentation completeness, to form what we call governance-aware ranking. When retrieval returns candidate tables and patterns, the system does not rank by semantic similarity alone. It fuses similarity scores with trust signals: a Tier-1 table with active ownership and fresh data ranks higher than a semantically similar but deprecated or undocumented alternative. This ensures the system surfaces not just relevant tables, but trustworthy ones.Together, structural patterns and governance-aware ranking form a library of validated, trusted solutions that guide query generation. When the agent generates SQL, it does not guess at join keys or filters — it uses patterns that have been actively used and validated by Pinterest analysts thousands of times, drawn from the most reliable sources in the warehouse.<h4>How the Two Dimensions Work Together</h4>These two dimensions complement each other: analytical intent enables semantic retrieval by converting queries into meaning-rich embeddings, while structural and statistical patterns provide the concrete, validated SQL building blocks needed to act on that retrieval. The following diagram illustrates how a single SQL query flows through both dimensions to produce encoded knowledge:<figure></figure>To see this in practice, consider a common analytical task:The user asks: “What’s the engagement rate for organic Pins by country?”What the agent retrieves:<ol><li>Analytical Intent: By leveraging its unified context-intent embedding space, the agent can retrieve highly relevant queries based on intent semantics. This capability is robust against variations in table names, column structures, and specific filters (like “by country”), which would otherwise cause failures in traditional keyword-based search. Furthermore, the agent understands that “engagement rate” at Pinterest means specific action types (saves, clicks, closeups) divided by impressions, and “organic” excludes promoted content.</li><li>Structural & Statistical Patterns: Surfaces validated join keys (engagement queries typically join user_actions to pinson pin_id with specific filters for organic content), priortizes patterns from frequently-used, successful queries (98%+ success rate, high monthly usage), and applies proven aggregation logic.</li></ol>Result: The agent generates SQL that follows established patterns, uses correct join keys, and applies domain-specific business logic — all learned from the accumulated knowledge encoded in query history.<h4>The Self-Reinforcing Learning Cycle</h4>This setup works because of a core insight: your analysts already wrote the perfect prompt. Every SQL query an analyst has ever written, the tables they chose, the joins they constructed, the filters they applied, the metrics they computed, encodes hard-won domain expertise. Traditional Text-to-SQL systems ask an LLM to figure out these patterns from scratch for every question. We instead treat query history as a vast library of expert-authored analytical solutions, and unified context-intent embeddings are the key that makes this library searchable by meaning rather than syntax.And because every new query enriches the library, the system is self-reinforcing. As analysts across Pinterest write more queries, each one becomes a new entry in the knowledge base:<ul><li>New analytical patterns emerge as teams develop novel approaches to measurement</li><li>Metric calculation standards evolve and propagate across teams</li><li>Join conventions spread as validated patterns are reused</li><li>Domain-specific filters and aggregations become discoverable to analysts outside the original domain</li></ul>The analyst who figures out how to compute retention by acquisition channel doesn’t just answer their own question — they write a reusable recipe that any future analyst can discover by simply asking in plain English. The more analysts use the data warehouse, the more knowledge the agent absorbs, and the better it gets at helping the next analyst. In effect, every analyst at Pinterest is continuously teaching the system, making the combined expertise of over 2,500 analysts accessible to everyone rather than siloed within teams.<h4>Scaling Documentation with AI and Lineage</h4>Unified context-intent embeddings require rich documentation to inject domain context. But manual documentation alone was never going to keep pace with a warehouse of this size.We attacked the problem on three fronts.<h4>AI-Generated Table and Column Docs</h4>We built AI Table Documentation, a system that uses LLMs to generate table and column descriptions from multiple signals:<ul><li>Data lineage - upstream and downstream tables and their documentation</li><li>Existing PinCat docs, if present</li><li>Column-level glossary terms</li><li>Representative example queries from QueryBook (Pinterest’s collaborative SQL editor, where analysts write, run, and share queries)</li></ul>For highly curated Tier-1 tables, we kept humans in the loop. For Tier-2 tables, we flipped the ratio: LLMs draft, humans review. All AI-generated docs are clearly marked as such in PinCat, and owners are notified to review and edit over time.<h4>Column Semantics via Join-Based Lineage</h4>To make documentation reusable across tables, we invested heavily in glossary term propagation, which automatically infers column semantics from join patterns:<ul><li>We analyzed query logs to build a join graph between columns (e.g., data.pins_d.id joining to ad.ad_video_event_flat_spark.objectid)</li><li>When a well-documented column (with a glossary term like pid_id) repeatedly joins to an undocumented column, we propagate that glossary term to the undocumented side</li></ul>This join-derived lineage allowed us to auto-tag thousands of columns with high-quality glossary terms.<h4>Search-Based Propagation</h4>For cases where join patterns were sparse, we complemented lineage with search-based propagation: indexing glossary terms and column docs into a vector database, enabling semantic similarity search between column descriptions and existing glossary term definitions.Together, these efforts mean that as high-quality docs are added in one place, they automatically propagate to related columns and tables, dramatically reducing the manual documentation burden.The results have been significant. AI-generated table descriptions reduced manual documentation effort by approximately 40%, with user surveys rating over 75% of these descriptions as “usable” or better. Join-based lineage auto-tagged over 40% of columns in scope, and combined with search-based propagation, these efforts reduced overall manual documentation work by nearly 70% while keeping humans in the loop for critical assets.<h4>Infrastructure: Vector DB as a Service</h4>Building unified context-intent embeddings and generating AI documentation both produce vectors that need to be stored, searched, and kept up to date. As more teams across Pinterest started building LLM features — table search, Text-to-SQL, AI documentation, it became clear we were all reinventing the same infrastructure: custom indexes, ad hoc ingestion jobs, and brittle retrieval logic.To avoid a proliferation of one-off solutions, we built an internal Vector Database as a Service.<h4>Built on OpenSearch, Integrated with Our Data Stack</h4>After evaluating several options, we standardized on AWS OpenSearch for our internal productivity use cases. We paired it with existing infrastructure:<ul><li>Tables as the source of truth for vectorized datasets</li><li>Airflow to run index creation and ingestion DAGs</li></ul>Teams define a vector index via a simple JSON schema specifying the index alias, vector field dimensionality (e.g., 1536-dim embeddings), and source Hive table mappings. An Airflow workflow then validates the config, creates the index, and publishes metadata so other teams can discover and reuse existing knowledge bases.<h4>Scalable Indexing with Daily Updates</h4>The service handles millions of embeddings across tables, queries, column descriptions, and documentation, with daily incremental updates as new data assets and queries are created.It supports hybrid patterns that combine semantic similarity (vector distance) with traditional metadata filters. For example, you can search for “tables semantically similar to user_actions that are Tier 1 and contain impression data.”This pattern lets teams go from zero to a production-grade vector index in days instead of weeks, without having to solve embedding, ingestion, and monitoring from scratch.<h3>The Pinterest Analytics Agent: Putting It All Together</h3>With governance, documentation, query indexing, and vector infrastructure in place, we could finally build what many analysts actually wanted: a natural-language assistant that understands Pinterest’s data.The Pinterest Analytics Agent is a specialized LLM-driven system that:<ul><li>Answers questions like “What table should I use to analyze retention for organic content?”</li><li>Generates and validates SQL from natural language</li><li>Finds and reuses existing analytical assets where possible</li></ul>A core design principle is the asset-first approach: the agent should surface existing, trusted assets — tables, curated queries, dashboards, metric definitions before generating new SQL. Today, this is implemented for table and query discovery; as we index more asset types, the agent progressively expands what it can surface, promoting reuse and consistency across teams.<h3>Architecture Overview</h3>The agent’s architecture has four layers:<figure></figure>Agent Orchestration Layer: An LLM with Pinterest-specific prompts classifies tasks (documentation lookup, table discovery, query discovery, Text-to-SQL, execution) and decides which tools to call and in what order.MCP Integration Layer: A set of Model Context Protocol (MCP) tools providing a unified interface to table search (backed by vector DB + PinCat), query search (our query description index), knowledge search (internal docs), and Presto execution with EXPLAIN validation.Context Layer: The knowledge foundation, including PinCat schemas and table tiers, vector indexes of tables and queries, expert-curated docs and metric definitions, and usage patterns from query logs.Execution Layer: Presto for validated SQL with EXPLAIN-before-EXECUTE, tight LIMITs, and error-recovery loops.<h4>An End-to-End Query Flow</h4>When a user asks:“Show me weekly retention for new users in the US over the past three months.”The agent:1. Classifies the task as Text-to-SQL2. Retrieves context in parallel • Table search and ranking using our knowledge base for semantic search and statistic based ranking • Relevant historical queries from the query index (using unified context-intent embeddings) • Table metadata from PinCat (tiers, owners, freshness) • Any metric definitions or docs that mention retention3. Generates SQL with strict validation: • References only existing tables/columns (PinCat validation) • Uses column profiling data to ensure filter values match actual data (e.g., 'WEB’ not 'web'), avoiding “looks right but returns nothing” failures • Reuses known join keys and filters from historical queries • Runs EXPLAIN before executing; if it fails, iterates with fixes up to a bounded retry limit • Enforces a conservative LIMIT (100 rows or fewer) by default4. Returns results with transparency: • The SQL it ran • Tables and date ranges used • Source references (schemas, queries, docs) • Confidence indicators or warnings (e.g., suspicious joins, empty results)From the user’s perspective, they get a working analysis in minutes, and crucially, it is grounded in the same governed tables and metrics their teammates use, not a hallucinated subset of the warehouse.<h4>Resolving Conflicting Signals</h4>With multiple sources of context, conflicts are inevitable. A query pattern might suggest one join key while documentation recommends another. When multiple sources provide conflicting information, the agent follows a defined hierarchy:<ol><li>Expert-curated documentation (canonical guides, metric definitions) serves as the primary source of truth for business logic</li><li>Schema metadata from PinCat is authoritative for column names, types, and table structure</li><li>Query patterns provide guidance but are validated against schemas before use</li><li>General knowledge base supplements when specialized sources lack coverage</li></ol>This hierarchy ensures that carefully curated Pinterest-specific knowledge takes precedence over general information, while schema metadata provides the ultimate ground truth for what actually exists in the data warehouse. The result: the agent generates SQL that is both semantically correct (aligned with business intent) and syntactically valid (grounded in actual schemas).<h3>Impact and Adoption</h3>With the full system in production, the benefits span three areas:<ul><li>Speed: Analysts go from question to working SQL in minutes rather than hours of table exploration and debugging.</li><li>Cross-domain discovery: Query patterns developed by one team become accessible to all through the shared index.</li><li>Consistency: Generated queries follow established conventions and governed tables rather than ad-hoc approaches.</li></ul>Early adoption has validated these benefits. Within two months of launch, the Analytics Agent already covers 40% of our analyst population, with a goal to reach 50% by year-end. It is the #1 agent at Pinterest, with 10x the usage of the next most-used agent.Beyond the agent itself, the semantic search capabilities we built to power it have become widely adopted across the company: our MCP tools for table and query search rank among Pinterest’s most popular internal tools.<h3>Evaluation and What We’re Learning</h3>To measure the agent’s effectiveness, we built a benchmarking framework focusing on two core capabilities: finding the correct tables to answer an analytical question, and generating correct SQL. Early results show that the agent meets expectations for table discovery. SQL generation has room for improvement, and the hardest cases are teaching us where to invest next:<ul><li>Complex analytical logic: Multi-step calculations and window functions that require chaining multiple reasoning steps</li><li>Ambiguous business terms: Concepts not yet captured in documentation, where the agent must fall back on general knowledge</li><li>Cross-domain queries: Analyses spanning multiple domains that may surface conflicting join patterns or metric definitions</li><li>Schema evolution: Recently deprecated tables whose patterns still appear in the index</li></ul>We mitigate these through human review, EXPLAIN validation before execution, and continuous index updates. We continue to expand test coverage with SME-verified answers, improve our evaluation judges, and incorporate real user interactions to create more representative test cases. As the agent gains new capabilities, we will add corresponding test coverage to ensure quality across all supported functionality.<h3>Looking Ahead</h3>This multi-year journey demonstrates that effective AI-powered analytics requires systematic infrastructure investment, not just plugging an LLM into existing tools.Several lessons have already proven out:Governance and AI reinforce each other. A disciplined tiering and documentation program made AI assistance viable; the AI systems, in turn, made large-scale governance and documentation tractable.Query history is valuable. Systematically indexing and semantically enriching queries gave us a reusable knowledge base that powers table and query search, Text-to-SQL, and documentation alike.Unified context-intent embeddings beat simple RAG. By capturing analytical intent (domain-enriched, semantically embedded query descriptions) alongside structural and statistical patterns (validated joins, filters, co-occurrence, and success rates), we achieve far higher relevance than keyword matching or simple table summaries.Specialization beats generic agents. Grounding the agent in Pinterest’s schemas, metrics, and assets through MCP tools and a rich context layer produces significantly more reliable results than a generic “LLM + search” stack.Looking ahead, we are expanding the agent’s capabilities across several dimensions:<ul><li>Broader asset discovery: Extending our asset-first principle beyond tables and queries to dashboards, datasets, metrics definitions, curated query libraries, and workflow artifacts, surfacing trusted, pre-existing answers before generating new queries, and making the full breadth of Pinterest’s analytical assets discoverable through natural language.</li><li>Deeper product integration: Embedding the agent directly into <a href="https://www.querybook.org/">QueryBook</a> and Superset so analysts can get assistance in context, without switching tools.</li><li>Richer analysis capabilities: Moving beyond SQL generation to include visualization recommendations, Python-based analysis, and the ability to create dashboards and charts directly.</li><li>Interoperability with other agents: As AI assistants proliferate across the organization, enabling our analytics agent to collaborate with agents in other domains.</li></ul>These same foundations - governance, semantic indexing, and unified context-intent embeddings will continue to be the core of how we make Pinterest’s data understandable and useful to everyone.<h3>Acknowledgements</h3>The Analytics Agent was a cross-functional initiative spanning multiple data platform teams at Pinterest. We thank<ul><li>Product and Integration - Laura Palmer for product leadership and testing - Aaron Wang for product integration - Adam Podraza for documentation and prompting</li><li>Platform and Evaluation - Kingsley Ochu and Charlie Gu for LLM/Agent infrastructure support - Chris Moradi for the measurement and evaluation framework - Jin Hyuk Chang, Kevin Singleton and Gerardo Gonzalez for supporting Vector DB Service</li><li>Data Governance - Ashish Singh, Felix Loesing, Aaron Wang, Yi Yin, Keith Regier, Bohdan Demydov for support on data governance in Pinterest to help lay the groundwork for this work</li><li>Leadership - Anirudh Koul for bridging teams and resources. - Aman Gairola, Bryant Xiao and Jooseong Kim for the continued support for investment in this area</li></ul><hr><a href="https://medium.com/pinterest-engineering/unified-context-intent-embeddings-for-scalable-text-to-sql-793635e60aac">Unified Context-Intent Embeddings for Scalable Text-to-SQL</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Unifying Ads Engagement Modeling Across Pinterest Surfaces</h1> Pinterest Engineering — Tue, 03 Mar 2026 20:01:02 GMT Authors: Duna Zhan | Machine Learning Engineer II; Qifei Shen | Senior Staff Machine Learning Engineer; Matt Meng | Staff Machine Learning Engineer; Jiacheng Li | Machine Learning Engineer II; Hongda Shen | Staff Machine Learning Engineer<h3>Introduction</h3>Pinterest ads show up across multiple product surfaces, such as the Home Feed, Search, and Related Pins. Each surface has different user intent and different feature availability, but they all rely on the same core capability: predicting how likely a user is to engage with an ad.Before this project, the ads engagement stack relied on three independent production models, one per surface. Although the models were initially derived from a similar design, they diverged over time in several core components, including user sequence modeling, feature crossing modules, feature representations, and training configurations. This fragmentation led to persistent operational and modeling inefficiencies:<ul><li>Low iteration velocity: Platform-wide improvements required duplicating work across multiple codepaths, and hyperparameters tuned for one surface often could not transfer to others.</li><li>Redundant training cost: Similar ideas had to be validated separately on each model, substantially increasing experimentation and training overhead.</li><li>High maintenance burden: Operating, debugging, and evolving three materially different systems was significantly more complex than maintaining a unified stack.</li></ul>These challenges motivated the development of a unified engagement framework to gradually consolidate surface-specific models while retaining the flexibility needed for each surface.In this post, we present our approach to unifying two previously separate engagement models into a single architecture with surface-specific calibration and lightweight surface-specialized components. We also describe several efficiency optimizations such as projection layers and request-level broadcasting, which reduce infrastructure costs. Overall, the unified model not only resolves the iteration, cost, and maintenance issues described above, but also strengthens representation learning by combining complementary features and modeling choices across surfaces, leading to significant online metric improvements.<h3>Methodology: modeling & architecture evolution</h3><h4>Unification strategy and guiding principles</h4>We treated model unification as a major architectural change and followed three principles to avoid common failure modes:<ol><li>Start simple: Establish a pragmatic baseline by merging the strongest existing components across surfaces.</li><li>Iterate incrementally: Introduce surface-aware modeling (e.g., multi-task heads, surface-specific exports) only after the baseline demonstrates clear value.</li><li>Maintain operational safety: Design for safe rollout, monitoring, and fast rollback at every step.</li></ol>We also set explicit milestones based on serving constraints. Since the cost of Related Pins (RP), Home Feed (HF), and Search (SR) differ substantially, we first unified Home Feed and Search (similar CUDA throughput characteristics) and expanded to Related Pins only after throughput and efficiency work stabilized.<h4>Baseline unified model</h4>As a first step, we built a baseline unified model by:<ul><li>Unioning features across the three surface models,</li><li>Merging existing modules into a single architecture, and</li><li>Combining training datasets across surfaces.</li></ul>This baseline delivered promising offline improvements, but it also materially increased training and serving cost. As a result, additional iterations were required before the model was production-ready.<h4>Architecture refinement for Home Feed and Search</h4>Because RP had a substantially higher cost profile, we focused next on unifying HF and SR. We incorporated key architectural elements from each surface such as MMoE [1] and long user sequences [2]. When applied in isolation (e.g., MMoE on HF alone, or long sequence Transformers on SR alone), these changes did not produce consistent gains, or the gain and cost trade-off was not favorable. However, when we integrated these components into a single unified model and expanded training to leverage combined HF+SR features and multi-surface training data, we observed stronger improvements with a more reasonable cost profile.The diagram below shows the final target architecture: a single unified model that serves three surfaces, while still supporting the development of surface-specific modules (for example, surface-specific tower trees and late fusion with surface-specific modules within those tower trees). During serving, each surface-specific tower tree and its associated modules will handle only that surface’s traffic, avoiding unnecessary compute cost from modules that don’t benefit other surfaces. As a first step, the unified model currently includes only the HF and SR tower trees.<figure></figure><h4>Surface-specific calibration</h4>Since the unified model serves both HF and SR traffic, calibration is critical for CTR prediction. We found that a single global calibration layer could be suboptimal because it implicitly mixes traffic distributions across surfaces.To address this, we introduced a view type specific calibration layer, which calibrates HF and SR traffic separately. Online experiments showed this approach improved performance compared to the original shared calibration.<h4>Multi-task learning and surface-specific exports</h4>Using a single shared architecture for HF and SR CTR prediction limited flexibility and made it harder to iterate on surface-specific features and modules. To restore extensibility, we introduced a multi-task learning design within the unified model and enabled surface-specific checkpoint exports. We exported separate surface checkpoints so each surface could adopt the most appropriate architecture while still benefiting from shared representation learning.This enabled more flexible, surface-specific CTR prediction and established a foundation for continued surface-specific iteration.<h4>Model and serving efficiency improvements</h4>Infrastructure cost is mainly driven by traffic and per-request compute, so unifying models does not automatically reduce infra spend. In our case, early unified versions actually increased latency because merging feature maps and modules made the model larger. To address this issue, we paired it with targeted efficiency work.We simplified the expensive compute paths by using DCNv2 to project the Transformer outputs into a smaller representation before downstream crossing and tower tree layers, which reduced serving latency while preserving signal. We also enabled fused kernel embedding to improve the inference latency and TF32 to speed up training speed.On the serving side, we reduced redundant embedding table look up work with request-level broadcasting. Instead of repeating heavy user embedding lookups for every candidate/request in a batch, we fetch embeddings once per unique user and then broadcast them back to the original request layout, keeping model inputs and outputs unchanged. The main trade-off is an upper bound on the number of unique users per batch; if exceeded, the request can fail, so we used the tested unique user number to keep the system reliable.<h3>Evaluation</h3>In offline experiments, we observed improvements across HF and SR, and validated the performance gains by online experiments. As shown in the table below, we observed significant improvements on both online and offline metrics [3].<figure></figure><h3>Conclusion</h3>Unifying ads engagement modeling isn’t simply a matter of replacing three separate models with one. The real objective is to build a single, cohesive framework that can share learning wherever it reliably generalizes across surfaces, while still making room for surface-specific features and behavioral nuances when they genuinely matter. At the same time, the framework has to remain efficient enough to serve at scale. Ultimately, by consolidating the core approach and eliminating repeated effort, we reduce duplicated work and put ourselves in a position to ship improvements faster and more consistently.In the next milestone, we plan to unify the RP surface for the engagement model to create a more consistent experience and consolidate the model. The primary challenge will be model efficiency, so we will integrate additional efficiency improvements to meet our performance targets and achieve this goal.<h3>Acknowledgements</h3>This work represents a result of collaboration of the ads ranking team members and across multiple teams at Pinterest.Engineering Teams:<ul><li>Ads Ranking: Yulin Lei, Randy Carlson, Erika Sun (former), Zhixuan Shao, Kungang Li</li><li>Ads ML Infra: Sihan Wang, Yuying Chen, Anton Kustov, Xinyi Zhang</li><li>Leadership: Jamieson Kerns, Ling Leng (former), Jinfeng Zhuang (former), Dongtao Liu (former), Liangzhe Chen, Degao Peng, Zhifang Liu, Caijie Zhang, Shu Zhang (former), Haoyang Li (former), Xiaofang Chen (former), Yang Tang</li></ul><h3>References</h3>[1] Li, Jiacheng, et al. “<a href="https://medium.com/pinterest-engineering/multi-gate-mixture-of-experts-mmoe-model-architecture-and-knowledge-distillation-in-ads-08ec7f4aa857">Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development</a>”. Pinterest Engineering Blog.[2] Lei, Yulin, et al. “<a href="https://medium.com/pinterest-engineering/user-action-sequence-modeling-for-pinterest-ads-engagement-modeling-21139cab8f4e">User Action Sequence Modeling for Pinterest Ads Engagement Modeling</a>”. Pinterest Engineering Blog.[3] Pinterest Internal Data, US, 2025.<hr><a href="https://medium.com/pinterest-engineering/unifying-ads-engagement-modeling-across-pinterest-surfaces-4b5cd3d99e67">Unifying Ads Engagement Modeling Across Pinterest Surfaces</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> </main></body></html>