System Design — Ad Click Aggregator

7 min readMay 1, 2024

Ad Click Aggregator is a system that tracks user clicks. Tracked data is aggregated for advertisers’ analysis so that they can analyze the performance of their ads and optimize their campaigns. For our purposes, we will assume these are ads displayed on a website or app, like Facebook.

Requirements

High Level Design

NOTE

Please refer to the Appendix to learn about the details of the software services and technologies we used in the design.

1. Users will be redirected to the target on click

When user clicks on ad which was placed by the Ad placement service, we’ll send a request to our /click endpoint. Our server can then track the click and respond with a redirect to the advertiser’s website via a 302 (redirect) status code.

2) Advertisers can query ad click metrics over time at 1 minute intervals

Advertiser needs to query metrics about their ads to see how are they performing.

Separate Analytics database with Batch processing

One way is to store the raw events in an event database. Then, in batches, we process the raw events and aggregate them into a separate database that is specially optimized for querying.
We can use Cassandra, which is optimized for a large number of writes. Cassandra is not optimized for aggregations or range queries, which is why we need to perform batch processing to pre-aggregate. We’ll use Spark for the batch processing to produce aggregations. For range queries, we need something that is optimized for reads. OLAP databases like Redshift, Snowflake, or BigQuery are good choices.

Issues with the approach is latency. Batch processing introduces significant delays. Advertisers will always be querying old data.

RealTime Analytics with Stream processing

To address the latency issue of Batch processing, we can introduce a stream for real time processing. System allows us to proecss events as they come in, rather than waiting for a batch job to run.

When a click comes, click processing service will immediately write the event to a stream like Kafka or Kinesis (aws’ Stream ingestion service).
A stream processor like Flink or Spark Streaming reads the events from the stream and aggregate them in real-time.
This works by keeping a running count of click totals in memory and updating them as new events come in. When we reach the end of a time window, we can flush the aggregated data to our OLAP database.
Advertisers can query the OLAP DB to get metrics on their ads in near real-time.

Real Time analytics with Stream processing

Challenge:
If we aggregate on minute boundaries and also run our Spark jobs every minute, the latency is about the same. However, the difference is that now we have levers to play with. It’s much more feasible for us to decrease the Flink aggregation window than it is to decrease the frequency of the Spark jobs given the overhead of running Spark jobs. This means we could aggregate clicks every couple of seconds with this streaming approach to decrease latency, while it would be challenging to run Spark jobs every couple of seconds.

Deep Dives

How to scale to 10K clicks per second

Click processor can be easily scaled horizontally by adding more instances.
Kinesis Event Stream: Both Kafka and Kinesis are distributed and can handle a large number of events per second.
Stream processor: Stream processors like Flink can be scaled horizontally by adding more tasks or jobs. We can have separate Flink jobs reading from each shard and performing the aggregation for the AdIds in that shard.
OLAP DB: The OLAP can be scaled horizontally by adding more nodes. While we could shard by AdId, we may also consider sharding by AdvertiserId instead. By doing so, all the data for a given advertiser will be on the same node.

How can we ensure that we don’t lose any click data?

Click data matters a lot. If we lose click data, we lose money. We need to ensure that our data is correct. Guaranteeing correctness and low latency are often at odds.
We can utilize a hybrid approach of batch and stream processing to ensure correctness. We can dump the raw click events into a data lake like S3. A periodic batch processing job can then run to read the raw click events and re-aggregate them. Later, we can compare the results of the batch job to the results of the stream processor to validate that they match.

Prevent abuse from users clicking on ads multiple times

Ad placement service generates a unique impression ID for each ad.
The impression ID is signed with a secret key and sent along with the ad.
When a user clicks on the ad, the impression ID is sent to the Click Processor along with the click data.
The Click Processor verifies the signature of the impression ID.
The Click Processor checks if the impression ID exists in a cache. If it does, then it’s a duplicate and we can ignore it. If it does not, then we put the click in the stream and add the impression ID to the cache.

Enforcing Click Idempotency using ImpressionId signed with Secretkey

Appendix

Stream

A “stream” refers to a sequence of data elements made available over time for indefinite length.

The key characteristics of a stream include:
Sequentiality: Data in a stream is processed in the order it is received.
Unbounded Data: Unbounded data does not have known size. They dont have defined beginning or end and can continue indefinitely.
RealTime processing: Streams are often processed on-the-fly, enabling actions or insights to be derived from incoming data almost immediately after it is generated.

Stream Processing

Batch processing refer to computational analysis of bounded dataset. Bounded datasets are available and retrievable as a whole from some form of storage. We know the size of the dataset at start of computational process and the duration of that process is limited in time.
Stream processing is concerned with the processing of data as it arrives to the system. Given the unbounded nature of data streams, the stream processor need to run constantly for as long as the stream is delivering new data. That might be theoretically forever.

Stream processing is processing of input of sequence of signals of indefinite length, observed over time in real time.

Amazon Kinesis Data Streams

Kinesis data streams are all about ingestion and they have temporary data storage. Few examples of use cases are

Clickstream Events (Ad clicks, Video likes etc)
Telemetry data: e.g Sensor data.
Log files: Logs are coming from bunch of different systems and you want to correlate, analyze and deliver them somewhere.

Once data is in stream, you may want to get some insights from it. For that you can use Kinesis Data Analytics.

OLAP (On Line Analytical Processing) databases

The OLAP DB supports the complex queries by allowing for fast retrieval of aggregated data, slicing and dicing through different dimensions, and performing rapid on-the-fly aggregations. Because the data is already pre-aggregated to some extent by Flink, the response time for such queries is much quicker than if they had to be computed from raw data on the fly. This capability is crucial for business intelligence tasks where timely insights can inform strategic decisions.

Glossary

Ad Placement Service: A service responsible for placing ads on the website and associating them with the correct redirect URL.
Apache Cassandra: Cassandra is NoSQL distributed database. It is not optimized for range queries or aggregations.
Apache Spark: It is a distributed computing engine which is optimized for batch processing.
Apache Flink: Stateful Computations over Data Streams. Apache Flink is a processing engine for stateful computations over unbounded data streams.
Spark Streaming: Spark Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.