Ensuring High Availability of Ads Realtime Streaming Services

Published in

Pinterest Engineering Blog

7 min readSep 28, 2021

Sreshta Vijayaraghavan | Tech Lead, Ads Indexing Platform

The Pinterest Ad Business has grown multi-fold in the past couple years, with respect to both advertisers and users. As we scale our revenue, it becomes imperative to:

Distribute advertiser spend smoothly over the course of the day
Avoid over-spending beyond the advertiser’s daily / lifetime budget
Maximize advertiser value

Background

To meet these goals, we maintain 3 real-time streaming services with low latency and high uptime requirements. Here’s an overview of how they work together:

Fig. 1. Simplified overview of the Ad systems interaction. The Ads Server retrieves ads and sends insertion / billable events to the Spend Aggregator, which sends the attributed Spend Events to the 3 real-time streaming services.

Ads Server: a set of services that serves the most relevant Ads to the user and logs the associated insertion and billable events received such as ad impressions and ad actions
Spend Aggregator: a Flink based streaming service that receives a stream of insertions and billable events in real time, aggregates them, and emits a stream of spend events attributed to the respective advertisers
Three KafkaStream based Real-Time Streaming Services:
1) Budget Enforcer: prevents over-spending beyond the advertiser’s daily / lifetime budget. Receives a real time stream of spend events and stops advertiser spend based on their daily / lifetime budgets and current cumulative spend
2) Pacer: ensures advertisers spend smoothly through the course of the day. Receives a real time stream of spend events and paces ad spend throughout the day based on total budget and current cumulative spend
3) Bidder: ensures we provide maximum return for the advertiser. Receives a real time stream of events from Pacer and dynamically adjusts the advertiser bid through the day based on performance, total budget and current cumulative spend
Ads Retriever: retrieves eligible Ads with available budget (based on the caps enforced by the Budget Enforcer) that can be shown at the current time of day (based on the dynamic controller from the Pacer) and have a chance at winning the auction (based on the dynamic bid from the Bidder) from the entire Ad Corpus. For more details on how the Ad Corpus is built see this post

Challenge

Since the Budget Enforcer, Pacer and Bidder play a crucial role in ensuring we retrieve and serve the best ads to users, we need to maintain these services with low latency and high uptime. Failure to do so can lead to two opportunity loss scenarios for Pinterest:

Over Delivery: Not capping the advertiser spend when we hit the budget could lead to overspending since we a) cannot charge the advertiser beyond their budget and b) could have inserted other valid Ads in those slots
Under Delivery: Not pacing or bidding at the most optimal values through the day could lead to an Ad not spending its budget in full for the day. Hence we under spend and lose the opportunity to have shown more of that specific Ad to users

In practice, it is very difficult to maintain these services at 100% uptime given the high volume of events and nature of distributed system failures.

Solution

To tackle this, we built a high availability (HA) mode by running two identical pipelines of each service hot-hot, with an automatic failure detection and switch-over mechanism.

Fig. 2. Two identical pipelines (Primary and Standby) run in parallel for each real-time streaming service. Each receives events from a dedicated Spend Aggregator and emits Health Metrics to a Health Monitor service which uploads Health Stats to a distributed file used by the Ads Server.

As depicted, we run a Primary and a Standby pipeline of each service in parallel. These are hot-hot and fully isolated from each other to provide high availability in the true sense by running them on two separate Kafka servers. The deploy cycles between the two pipelines are staggered to ensure that uncaught buggy code is not released to both concurrently. Hence, if one pipeline is backed-up on events or fails, the other continues to run unaffected. A Health Monitor (another lightweight KafkaStream based application) consumes “health metrics” emitted continuously by each real time service and aggregates them to periodically (every 2–5 minutes) provide a notification of system health.

Determining “Healthy” Status

We use two metrics from each service to measure the health:

Input QPS: Denotes the number of Spend Events received per second from the Spend Aggregator by the service
Event Delay: Denotes the latency between the time the Spend Event was generated by the Spend Aggregator versus the time it was received by the consuming service

The Health Monitor aggregates these two metrics over windowed time ranges and emits a Health metric per pipeline (primary / standby) per service (Budget Enforcer / Pacer / Bidder) to a distributed config file (backed by Zookeeper). A lower than normal input QPS or higher than normal event delay indicates faults in ingestion by the service or emission by the Spend Aggregator and hence deems the pipeline unhealthy.

Input QPS Event Delay Healthy Normal Normal True Lower than Normal Normal False Normal Higher than Normal False Lower than Normal Higher than Normal False

Automatic Pipeline Switchover

The Ads Retriever consumes the output of each service from both pipelines in parallel irrespective of their health. We now need a mechanism to let the Ads Retriever know which pipeline’s values should be applied to the retrieved Ad candidates in order to trim them based on budget, pacing and bid.

We could do this via a manual approach. i.e. whenever a pipeline is deemed unhealthy it alerts the engineer oncall and the engineer then performs a manual switch on the Ads Retriever. However, this adds to operational burden, especially if such a scenario happens during non business hours. Hence we decided to go one step further and automate this switch over.

We maintain a lightweight background thread on the Ads Server that continuously reads the latest health metrics emitted by the Health Monitor from the distributed config file. Prior to sending a retrieval request to the Ads Retriever, it performs a lightweight check based on the following matrix to determine which pipeline’s values should be used and appends this metadata to the retrieval request. The Ads Retriever uses this metadata and applies the respective values from the corresponding pipeline ingested.

Primary Standby Use Healthy Healthy Primary Unhealthy Healthy Standby Healthy Unhealthy Primary Unhealthy Unhealthy Primary

Avoiding Flip-Flop

In the first iteration of launch we noticed that due to the nature of KafkaStream based applications and how they recover in the event of failure, the system occasionally flip-flopped between using the Primary and Standby pipelines within a very short interval of time. It roughly followed the sequence:

T0: Primary detected to be unhealthy -> Switch to Standby
T0 + 2min: Primary detected to be healthy -> Switch to Primary
T0 + 4min: Primary detected to be unhealthy -> Switch to Standby
T0 + 6min: Primary detected to be healthy -> Switch to Primary

This constant churn in switching between the two pipelines leads to operational and oncall burden. To overcome this, we implemented a “grace period” mechanism in the Ads Server’s automatic switchover library. The grace period ensures that a pipeline is healthy for a minimum sustained period of time before the switch to it occurs. i.e. in the example above, unless Primary is detected to be healthy for a sustained period (say 10 mins), the switch back will not occur, avoiding the flip-flop scenario described.

Kafka State Store Reuse Bug

While setting up the Health Monitor service, we discovered a tricky Kafka Streams bug

Symptom: We observed large spikes in event delay within the Health Monitor whenever it received the latest health metric from upstream. i.e. (Time at which health metric was received — Time at which health metric was published) was very high.

Potential Causes:

Upstream Service Delay: close monitoring and several metrics revealed there was no delay within the upstream service in emitting the health metric. Hence this was ruled out.
Kafka Propagation Delay: checking the Kafka topics, brokers and coordinators showed no backed-up events that could cause pile-up and delay. Hence this was ruled out.

Root Cause: After several in-depth rounds of debugging, we realized this was due to a vicious cycle between the events received and the events processed internally by the Health Monitor.

Fig. 3. Prior to Fix: Health Metric input is repartitioned and stored in the Kafka State Store. This triggers processing and the processed value is stored back in the same State Store.

As the first step, the multi-partition health metric input stream is repartitioned into a single-partition event stream for easier internal processing with low resource consumption and infra cost. This repartitioned stream of events is persisted in an internal State Store which then triggers a process cycle. The process cycle does the necessary computations and stores the computed values back into the same State Store which re-triggers the cyclic processing of the same event.

Due to the way in which KafkaStream applications are configured / setup and the lack of easy access / visibility into the internals of Kafka State Stores, this reuse path was not apparent. To solve this, we explicitly separated the State Stores used to store the repartitioned events and the processed events.

Fig. 4. After Fix: Health Metric input is repartitioned and stored in a Kafka State Store. This triggers processing and the processed value is stored in another State Store.

Impact

The Budget Enforcer was the first use case onboarded to have this automatic failure detection and switchover mechanism. It has proven highly effective in ensuring reliability with low maintenance and operational overhead. More so during the revenue heavy holiday periods when the volume of incoming events and ad spend are typically higher, this high availability framework is crucial to ensure the best experience for both advertisers and users. The Pacer and Bidder will follow suit and the framework is extendable to support more use cases as needed in the future.

Acknowledgements

A huge thank you to the entire team for all the feedback during ideation, design, implementation and testing to deliver a long lasting and robust solution. Special thanks to Shawn Nguyen, Di An, Mingsi Liu, Chengcheng Hu, Shu Zhang, Aniket Ketkar, Collins Chung, Timothy Koh for their continued support in seeing this to completion.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.