ML Label Definitions in Search and Ads

Published in

@ Promoted

8 min readSep 16, 2023

To get the maximum performance of scaled recommendation and ad ML systems, model complexity is not the only consideration. What you optimize, or “label definitions,” are also critical to achieving that last +5% in profits at scale. When spending engineering months to reduce model loss by -0.1%, carefully modeling what you have optimized and why can be comparably better investments. Labeling improvements do not cost inference infrastructure costs to deploy and gains compound with model complexity improvements like larger models and more features. For ads, label definitions also determine ad pricing. Large advertisers will demand (and may audit) definitions of billable ad metrics, and advanced ad delivery systems like ROAS optimization will be more tractable to design when ad prices are well-defined.

Good ML Labeling has the following characteristics:

Semantics: predictions mean something
Casualty: a decision made by an inference “causes” what is predicted
State: limit interference to what’s uncertain

Semantics: Why Semantic Inferences

Promoted’s model predictions, or “inferences,” are semantic, or have a human-intelligible meaning that is useful analytically. This means inferences should have:

units of measurement
statistical conditional formulas p() and E()
calibration with real measurements, i.e., the sum of observed clicks should equal the sum of predicted clicks.

Without inference semantics, there is no systematic way for engineers to improve, combine, or monitor these inference systems other than with continuous trial-and-error tuning in live production with user response metrics. We call these “biology experiments,” because the management of this process resembles running cell culture experiments where the true mechanisms cannot be manipulated or known exactly. The results are highly variable and take a long time. This is an inefficient state for an engineering organization compared to solving an equation.

The “semantic inference” approach contrasts with most recommendation systems, where inference definitions are loosely understood and do not have units. Sometimes, this is by technical necessity. For example, “embedding vector similarity” has no “meaning” other than the relative meaning that higher scores are “better” than lower scores in the context of a single search request. Another example is a ranking score produced by a learn-to-rank (LTR) model. While these techniques can produce an ordering of listings by “goodness” to achieve increased business metrics like increased sales or clicks in search as verified by production A/B testing, there is no observable user response metric to validate the correctness of each inference directly, and there is no clear analytical method to combine these inferences into higher-level models.

Prediction semantics become critical when building ad systems because the inference value, not just its relative value, is used to compute ad prices. The price will not be intelligible if the inference does not align with the billable or optimized event. This applies to any term included in an ad bid or “utility function,” including “quality scores.”

Even for non-ad systems, it’s easier to reason about well-defined events. Example benefits include:

Well-understood methods of using a more common event (like checkout) as a proxy for a more rare but correlated event (like purchase) in both utility functions (proxy objectives) and multi-task modeling labels (e.g., augment sparse purchase labels with similar but more common checkout labels)
Sensible calibration models enforce event predictions to observed predictions continuously, which help smooth sudden changes in optimization system performance from either user behavior or system failure
Arithmetic composition of inferences to achieve other higher-level models
More sensibly designed data-driven attribution models

Prioritizing Inference Semantics: In small inference systems, inference semantics are less important because at small volumes of user responses, small improvements in quality are not statistically measurable, and the improvement from no quality control to any reasonable quality control system is large. Therefore, inference semantics may be neglected in original search and recommendation system designs. At large scales, and especially at large e-commerce and tech companies, inference semantics become relatively important in accomplishing realized user metric improvements in sales and clicks compared to increasingly marginal quality improvements from machine learning achieved with more complex models. Furthermore, inference semantics are an orthogonal improvement to model improvements, facilitate the use of more complex models through model stacking, make debugging and analyzing the operation of inference systems more manageable, and, unlike increasing model complexity, do not incur recurring infrastructure costs in computation or latency. The trade-off is that it can be organizationally infeasible to change inference semantics in a large organization across many systems without strong design leadership, synchronization across multiple teams, and potential confusion in measuring impact when previous KPIs are questioned.

Promoted’s Semantic Inference for “Click” or “Navigate”

Here is our formula. We have similar formulas for all events that we predict.

p(click) = p(any(navigate|joinwindow)|insertion with IAB impression @ relativePosition=0)

A Navigate (aka click) is when a user actively and conscientiously selects a response insertion with a tap or mouse click to “Navigate” to another view, like a product details page. The event is not the Page View of the destination page navigated-to, although this event may be used to construct or infer a navigate event if necessary. “Navigate” comes from iOS nomenclature.

Definitions:

P: a calibrated probability between 0% and 100%. Predictions over 100% indicate an error.
Any: a deduplicated navigate, or, there is “any” navigate attributed to this insertion. We deduplicate to the insertion. Per 2009 IAB Guidelines, we use the “One-Click-Per-Impression Method.”
Joinwindow: A click has a limited amount of time to occur before it’s no longer counted as a “valid click” for optimization purposes. For example, one hour. For example, you may load a screen, go to sleep, and then click a search result. We discard that click for optimization purposes. Implementing an unbounded join window is technically infeasible (there will be some limit in practice).
Insertion: all training examples (and model inferences) must be mapped to a “response insertion,” or a decision from the backend sent to the user for display. It may not necessarily be displayed (that is an “impression”)
IAB impression: all training examples must be an IAB impression. In the absence of an attributed navigate, this requires sufficient visibility logging. If a navigate is attributed to the response insertion by insertion ID, this is an IAB impression (because the user must have seen the insertion to use it for navigation). This is not true for inferred attribution of navigates.
relativePosition=0: this is the first position on the page, or the “offset” for paged examples. At inference time, all insertions already share the same paging offset. It’s only the relative position from the offset that cannot be known at inference time for independent point-estimate L2 ranking. To set a pageOffset while setting the position to 0 at inference time is not a good approximation because this combination would never be seen in training data. If multiple position features (e.g., relative and absolute) need to be set as if allocating in the first position in the current page (i.e. 0 and pageOffset respectively).

Promoted’s Semantic Inference for “Purchase”

A Purchase is our default “objective.” If there is only one “conversion” event, we ask customer to map them to “purchase.”

p(conv) = p(any(conv) * f(credit) | insertion with IAB impression with any(click) @ relativePosition0)

Definitions:

Definitions are similar to “Navigate” above. However, unlike Navigate, a single Purchase may be joined to many response insertions according to a multi-touch attribution model. There may be multiple attribution models that assign different credit values for the same purchase-insertion join in the future.

f(credit): is the value assigned by the multi-touch attribution model. If f(credit) = 0, then the purchase may not be joined to a response insertion (but not necessarily).
any(conv) : Was there any distinct purchase decision for this item for this user? A “distinct” purchase decision is a heuristic defined as 24h deduplication windows for all purchases of the same contentID by the same authenticated userID. For optimization purposes, purchases outside this join window for the same item for the same user with new order IDs are “new” purchase decisions with their own attribution logic. Multiple purchases and the post-purchase value is modeled separately and is not described here.
Post-click “With any click”: We decompose attributed purchase credit to “clicks” and “non-clicks.” Most or all of the attribution credit is assigned to “click.” When we assign credit to non-clicks, the total purchase attribution summed over all attributed clicks may be less than 100%.
View-Through: The fraction of conversion credit attributed to non-clicked insertions.

Causality: Solving for Incrementality

All predictions are trying to capture “causality.” We want to predict the probability that showing an insertion will cause a response. It’s possible to predict the probability of an event without a causal relationship (e.g., a sale that would happen anyway). Such “correlations” inferences are unlikely to cause sales to increase when used to decide what content to deliver to users in live production.

We use concepts in ads measurement to model “incrementality” which we apply to our ML labeling systems for inference. This makes sense: you predict what you measure, and to predict incremental value, you need an a model of incrementality.

Sometimes, in ad systems, causality modeling is disregarded to boost “credit assignment” in return-on-ad-spend (ROAS) calculations. This merely shifts the assignment to credit ads more from the same pool of sales from the e-commerce app or marketplace versus creating new sales as true causality modeling would. We measure this by measuring all sales across both ads and organic systems. We strongly prefer casualty modeling across native ads and organic insertions when possible.

Causality is primarily a consideration for “conversions,” like purchases. We use a combination of last-click, multi-touch, and data-driven attribution models across insertions, sometimes with weights for the duration of time since the last user action on an insertion from the time to the conversion.

State: limiting interference to the uncertain

Stateful Actions: For some “stateful” events like “save” or “checkout,” once the user has performed this action on an item, additional actions by the same user on the same item are either ambiguous (multiple checkouts on the same item isn’t necessarily better or a higher order value) or undefined (you cannot “save” an already saved item. In a real-time inference system like Promoted’s, this can be a problem when using stateful events in utility functions (i.e., ranking or quality scores). Once the state is achieved, it immediately becomes a user feature, but in the next inference, the event is undefined and the inference may converge to 0. For example, “goodness” is defined to include “probability to like,” then “goodness” will decrease when a user likes an item! The correction is to use the known state (e.g., “user_liked”) as a conditional with its own weight in utility functions.

Known State: Promoted estimates post-purchase value like sales revenue, the probability of continuing subscriptions or consummating a delayed transaction like a vacation rental, and cancelations. However, not all of the “future value state” is uncertain. For example, the purchase price of an item is known. If we fit a model to “post-purchase” value across a mixture of purchases for different items, then the model will likely overfit primarily on features we already know at the inference time, like “price.” To correct this, we decompose post-purchase value predictions to only the “uncertain” part of the value, like the user’s probability of completing a delayed transition or canceling. We then combine the known state (like price) with the inferred state (like the probability of completing the transaction) analytically for optimization and bidding. For example: p(completePurchase)*price.