Data Pipelines— Backfill Strategy

99P Labs
99P Labs
Published in
8 min readDec 17, 2021

Written By: Nithin Santhanam

Introduction

“Backfilling” data is a term that refers to the process of retroactively processing historical data in a data pipeline. It is an important component to any actively growing data pipeline. In these types of pipelines, data is ingested every day and the pipeline is applies ETL logic on any ingested data found during the day. In the base case you can target the date partition containing the data from the day and read all relevant files inside the partition. However, scenarios arise where you need to apply an ETL to more than just the current day’s data.

Some example scenarios like:

· You’ve recently implemented a new daily ingesting data pipeline, which applies the ETL to data you are receiving from a source, but now you additionally need to apply the ETL to all data prior to when the pipeline was successfully productionized

· You’ve changed your ETL process and need to reprocess your original raw data using the new ETL process

· Fixing a bug found in your ETL process

These scenarios require a backfilling process that is capable of applying your ETL logic to the desired date range.

General Steps Involved in this Process:

· Accept/Read configurable user input within a given desired date range

· Apply this user info to your pipeline and translate the context into programmatic logic

· Read all data in the desired date range

The focus of this blog post is the exploration of various different algorithmic approaches to implement backfilling logic into our pipelines and compare their runtime efficiency.

Why is Backfilling Efficiency Important?

In the era of big data, having a good backfilling method is crucial. For smaller datasets or one-off jobs, the backfilling strategy is not as critical. However, in the case where a dataset contains billions to trillions of records and is constantly growing, the impact of the strategy becomes much more significant. Data teams have limited resources in terms of cost, time, storage, processing, etc. and therefore it becomes important to do larger scale jobs as efficiently as possible. The goal for our team is generally to limit requests, reads, writes, etc. when possible.

Infrastructure Assumptions:

Our team has architected some data processing strategies and tools that we use in all our data pipelines. The platform is in the process of integrating data streaming pipelines, but the type of backfilling mechanisms discussed in this post are designed to execute with batch jobs.

Some components of our platform architecture:

1. Scalable Cloud Storage

2. Scalable Cloud Computation tools to execute data processing (e.g. AWS EMR, Azure HDInsight)

3. Available Spark Cluster to execute data processing and analysis

4. Consistent Partitioning Strategy

a. Generally all of our pipelines partition by dataset/table/year/month/day/…

Backfilling Algorithms:

Listed below are the 2 algorithms that were tested in this project from a conceptual level:

Option 1: Path Creation + Check if Path Exists

1. Create Potential paths within time range

a. Create a list of dates between time range

b. Iterate through list and create path using bucket, directories and date objects to generate path for each

2. Check 1 by 1 if paths exist through general try/except framework

a. If an error is thrown, move to next path

b. Otherwise add file path to list of valid input paths

Option 2: Date Partition Based

1. List all date partitions in cloud storage

2. Parse dates from file paths

3. Filter paths using parsed dates

Explanation and Reasoning of Algorithms:

I thought it would be interesting to test to general philosophies that I would describe as “top-down” vs. “bottom-up” ideas. The first option I consider as “top-down” because essentially with this algorithm we know the date range we are looking for and therefore from the start we limit our searches by creating all the possible paths and then we confirm if objects exist in those paths. The second option I consider as “bottom-up” because you begin by listing all the available options and then determine which, if any, of those options fall in your date range.

Ultimately, the biggest reason I wanted to test these two styles of algorithms was because of their apparent limiting factors as time and data to search increases. Inherently, the top-down approach runtime is limited by the number of days you are attempting to backfill and the bottom-down approach is limited by the number of date partitions present in your data lake.

What this means is that as time progresses, if you attempt to backfill x number of days of data, the runtime of option 2 will increase while the runtime of option 1 will only vary on the amount of data being backfilled and has no dependency on the number of date partitions in cloud storage.

One important observation to make is that regardless of some of the implementation details we choose to make in our search algorithms, it is not remotely desirable for the runtime to be dependent on the entire domain space. Instead we would ideally like to create a “guided” search mechanism that allows us to search for specific items, and be independent of the entire domain.

Runtime Equations:

There are two key influencers in these algorithm runtimes: the amount of total data stored in our data lake and the number of days we are attempting to backfill. In our case the smallest unit of time we use for partitioning is a day so the number of date partitions in our data lake is the variable to represent the amount of data in our data lake.

Testing Methodology:

The general methodology we used was to run each of these algorithms on a dataset, varying the number of days to backfill. We then observe how the runtime of each behaves as the days increase.

I attempted to make the testing methodology as robust as possible. Different pipelines ingest varying amounts of data per day and this introduces some potential variance in results relative to which pipeline is being analyzed. For this reason, I opted to test the backfill algorithms on 3 separate pipelines that I believe represented 3 generally different data cases. We applied the test methodology to each of these datasets and recorded the results for analysis later.

Pipelines:

· Pipeline 1:

o API request driven

o Consistent number of files (> 1 ~ 24, per date partition)

· Pipeline 2:

o Database query driven

o 1 File per date partition

· Pipeline 3:

o Event driven

o Large varying number of files per partition (6000> files> 0, per date partition)

Using the pipelines above, I was able to get an idea of how the algorithms behave based on consistency and volume of files per date partition.

To avoid any outlier runtimes, I also added a repetition component to the testing, running the algorithms using each dataset 50 times, and then the mean, median, max and min of the results. The hope was that adding this step would smoothen out the results.

We formatted the output as a table where we would see the time taken for each algorithm to execute on the amount of backfill days. Each dataset outputs a table, as shown below, which we analyze and plot to draw conclusions.

Conclusions

For Pipeline 1 and Pipeline 3, option 2: “Listing Partitions” was shown to be the more time efficient algorithm by a significant margin. Particularly in cases where there is a relatively small amount of date partitions available, this algorithm performs very well. Additionally, the runtime increases as the number of backfill days increases but having 365 days in a year and a relatively small number of years in the data creates natural bounds on the runtime, making it more than manageable.

For pipeline 2, option 1: “Create Paths + Check” was shown to be the more efficient algorithm by a couple of seconds, which is more or less negligible. The characteristics of pipeline 2, (consistent amount of partitions and a single file in those partitions) most likely led to much lower run times for the Path Checking algorithm than in the other pipelines with different characteristics. This observation could help us narrow down exactly what modifications to make in this portion of the algorithm. I believe these results would hold true for any amount of backfill days and for any of our data pipelines.

Logically I believe that option 1: “Create Paths + Check”, mathematically should return quicker runtime but the algorithm contains a couple of unexpected bottlenecks that led to a significant inflation in the runtime. The biggest contribution to runtime for both algorithms was the initialization to make the request to our cloud provider to retrieve the corresponding request response. Since the Listing Partitions algorithm makes a single large request, it requires a single initialization to retrieve the entire response. The Path Creation algorithm avoids making that single big request but instead makes many smaller requests during the process of checking if that partition path exists in our cloud storage. Each of these smaller requests requires its own initialization step which leads to inflated runtimes throughout each iteration.

If we refer back to our runtime equations presented earlier in the post:

We are able to see that in the first option, the runtime for the list request accumulate for each iteration in the summation as the amount of backfill days increases. Meanwhile, the list request of the second option has a fixed runtime independent of any variables.

I believe the strategy of option 1 “Create Paths + Check” is more logically sound but the mechanism to check for existing paths will need to be modified or rethought. I would absolutely favor basing the algorithm runtime on the amount of data being processed, which can be focused, instead of the unbounded and increasing number of date partitions that exist in cloud storage.

I appreciate any reader who took the time to walk through this small experiment with me. I’m always trying to improve my approach to how I look at problems on both a large and small scale so the team and I would appreciate feedback from anyone interested! You can get in contact with our team here.

--

--