Member-only story
Orchestration
Writing Your First DAG? Use SQL For More Accurate Data Availability Checks
Avoid the dreaded problems of data downtime and duplication by using simple SQL queries to establish data availability.
Upstream Task Failed: Where’s My Data?
Between two programming languages (yeah I consider SQL a programming language), cloud apps and third-party tools, the two functions I use in my data pipelines is familiar to all:
Copy. Paste.
Though I’m being a bit facetious, when you’re building out data infrastructure, especially at a younger stage of organizational maturity, there will be pipelines and scripts that are derivative of earlier work.
In these cases there is little to no shame in reusing pre-existing code. However, you can fall into the trap I fell into, especially with my Airflow DAG creation. Assuming prewritten code would apply to my use case and failing to write data availability checks unique to my build.
Recently, as I’ve been converting more of my work from isolated functions and VMs to orchestrated processes, I’ve been thinking more deeply about the logic I need to integrate to “kick off” a DAG.
The queries themselves are simple–usually no more than 3–5 lines, but I spend more time understanding the parameters before writing the SQL itself.