How to Handle Late Arriving Data, Part 1

IAS Tech Blog
IAS Tech Blog

--

By Matt Chudoba

At Integral Ad Science, we process high volumes of partner data through our platform every day, so that we can quickly provide meaningful insights to customers. Our partners control when the data is made available and sometimes there are delays due to API outages or planned downtime.

This two-part series will share an overview of our data intake and processing pipelines, focusing on how we built automatic monitoring and tools to handle late arriving data most effectively.

This post addresses how we designed our architecture to accommodate late arriving data, while the next outlines specific steps in the pipeline that handle reprocessing late arriving data.

Designing the Pipeline

Our pipelines use Airflow and various AWS services, including S3, EMR, and DynamoDB. Four key requirements influenced the pipeline architecture we built to handle late arriving data:

  1. It is critical that the entire pipeline design supports re-running against inputs that have already been processed.
  2. Automatically processing late arriving data is essential. By contrast, manually monitoring for late data, kicking off scripts, and verifying the results is costly, error-prone, and tedious.
  3. Thresholds for reprocessing data need to be configurable per data source and focused on reducing infrastructure costs.
  4. For compliance and tracking purposes, real-time alerts are required to indicate when late arriving data is processed and the impact on overall results.

When collecting partner data, some partners push data to the IAS platform via an API, while others require data to be pulled in batches throughout the day. We created two separate components to manage these different data sources: a Receiver Pipeline and a Processing Pipeline. The structure of these data sources is similar, so we generalized both components to work for all partners with minimal code duplication.

Receiver Pipeline

The receiver pipeline collects data from various sources and partitions the data into buckets based on timestamps. Partitions are created for each source, date, and hour of data stored in S3.

For partners that push data to IAS through an API, we stream the data through several lightweight steps to clean, enrich, and partition it.

For partners that require multiple daily data pulls, the IAS platform downloads the data in batches based on a set schedule. Our platform pulls data by downloading files rather than querying individual records from an API. When downloading files, we use DynamoDB as a cache to track files that have already been processed.

With both intake processes, partitioned data lands in an S3 bucket. We create matching partitions for the raw data in an AWS Glue database. For each partner, a Glue table is partitioned by date and hour, which is used in the backend processing pipeline to query the data from S3.

Backend Processing Pipeline

Once data is available in S3 and Glue, it is ready to process. We perform several steps to aggregate the data and generate business metrics. Regardless of how the receiving pipeline fetches raw data (push vs. pull), the backend pipeline processes it in batches. By leaving a few hours of buffer between receiving the last hour of data and processing, this helps minimize the amount of late arriving data, with the exception of data that arrives several hours or days late.

We use Airflow to build this part of the pipeline because it provides:

  1. Flexible scheduling and orchestration: Airflow makes it easy to schedule our jobs to run on regular intervals using cron syntax. It also provides powerful syntax for orchestrating separate tasks within the pipeline to work together and run in parallel.
  2. Replayability: This is a critical feature that allows us to rerun our pipeline at any point for any batch that was already processed. For example, when late data arrives for a previous date, we can kick off the Airflow job for that exact date again.
  3. Monitoring, error handling, and resilience: Easily monitoring progress throughout the day, tracking errors, and retrying steps that fail is critical.

We set up Airflow to integrate with various AWS services. Given the scale of data IAS processes, we use AWS EMR with Apache Spark.

The Airflow pipeline includes a sequence of tasks that orchestrate Spark jobs to process raw data. Each Spark job reads data from S3 using Glue tables created by the receiver pipeline. Once processing is complete, the final output is written to an S3 bucket and loaded into the IAS data warehouse.

We designed a series of pipelines that work together to gather data from our partners, aggregate it, and load it into a data warehouse for consumption. By using Airflow and partitioning the data, we designed the architecture to allow for late arriving data. In the next part of this blog, I will go into detail about the steps required to reprocess the late arriving data.

--

--