Optimizing Backfills

4 min readAug 12, 2022

I looked up backfill and got landfill. Close enough.

Introduction

We’re currently doing various backfills on account of a weird production bug we’ve run into the last few days (I promise our system is normally stable even though my history of blogs may indicate otherwise (where’s the fun in posting about things that work anyway?)). However, this has involved more effort than necessary to get our data back on track. Can we do better?

Current Situation

We use our daily data reconciliation job to identify any mismatches we may have across the different layers of our processing. Generally when we have to reprocess, there are a few different scenarios that might occur:

Backfill an entire day of data
Backfill a subset of a day of data (perhaps only one or more of the feeds is giving an issue, whereas the others are fine)
Backfill multiple days of data in a row (this could be to apply new schema changes to previous days of data)

For each of these scenarios, we have a different Airflow DAG that handles the creation of the EMR cluster, submission of the job, and then termination of said cluster upon completion. These DAGs each have their own EMR configurations, stored in YAML files that are loaded at runtime, because they require varying amounts of nodes to run.

However, this approach is frustrating on account of all the DAG maintenance we have. If we need to make changes across our backfill DAGs (like adding a tag or migrating to a new instance type), we need to make that change in many places. Furthermore, we’re unable to properly cover the scenario we’re facing now: reprocessing various (noncontinuous) days of data for various feeds. To handle that, and to reduce strain on our oncall support, we need to be more flexible.

Achieving Flexibility

The ideal way to handle this would be to create a configuration file that defines all the backfilling that needs to be done. This could be a JSON or YAML file that the DAG reads in at runtime and then parses appropriately to create the necessary steps. All of these steps would then run on one cluster so that only one DAG submission would be needed to handle our backfills. Of course, this DAG could also handle the aforementioned backfill scenarios as well, so we’d be all set.

One caveat would be the EMR configuration. Since we won’t know at runtime what type of backfills we’d be doing (and you could be doing different types of backfill for a given day), we likely need to give the EMR cluster a config that could support processing an entire day of data, just in case that is what’s needed. To keep the costs in check, we’d use managed scaling as needed. EMR Serverless, when it becomes a reality for us (soon, I hope) would help as well.

We can also apply this logic to our “replay” processing that we do in our microservice architecture. While we don’t use EMR here (as we just send messages to our SQS queue which serves as the input to our ECS processing), we can still read the configuration file to make it easy to support all the different backfill scenarios, instead of forcing a limited set of scenarios through the arguments we pass to the replay DAG (which in turns call our Lambda that re-ingests those SQS messages). This gives us ultimate freedom in our reprocessing.

A Step Further

We don’t have console write access to S3, so what’s the best way to store our backfill configurations there so that the DAG can read them in accordingly? One option would be to maintain the config in a version-controlled file, and then use a CI/CD job to upload that file to S3. It’s a working approach, but you have to go through Jenkins or the like each time the file should be updated. Can we do better?

I’ve talked in the past about automating ad-hoc DAGs and that concept definitely could apply here as well. If we uploaded our configurations to S3 and then a Lambda function picked it up and called the DAG to run, providing that S3 location as an argument to the DAG, then it’d be completely event-based.

How would those files get uploaded to S3? A Jenkins job would be tedious, so the best route would be to use Databricks. If we need to add a new S3 bucket to send these files to, that’s just a matter of modifying our federated role. But we should be able to use our pre-existing auto insights Lambda as we have been for making API calls within Databricks. This already has integrations with Slack and PagerDuty APIs, so Airflow would be a logical next step. As long as we pass the DAG ID, authentication token, and the arguments the DAG needs to run, we’d be all set.

Conclusion

I’m not sure if it’s the result of more closely finding our gaps or a lack of sleep (likely the case) that’s caused my focus on automation as of late. Making the oncall process smooth should be a goal, and simplification and automating whatever can be automated will move us in that direction.