Member-only story

Orchestration

Writing Your First DAG? Use SQL For More Accurate Data Availability Checks

Avoid the dreaded problems of data downtime and duplication by using simple SQL queries to establish data availability.

Zach Quinn
Learning SQL
Published in
5 min readDec 11, 2023

--

Upstream Task Failed: Where’s My Data?

Between two programming languages (yeah I consider SQL a programming language), cloud apps and third-party tools, the two functions I use in my data pipelines is familiar to all:

Copy. Paste.

Though I’m being a bit facetious, when you’re building out data infrastructure, especially at a younger stage of organizational maturity, there will be pipelines and scripts that are derivative of earlier work.

In these cases there is little to no shame in reusing pre-existing code. However, you can fall into the trap I fell into, especially with my Airflow DAG creation. Assuming prewritten code would apply to my use case and failing to write data availability checks unique to my build.

Recently, as I’ve been converting more of my work from isolated functions and VMs to orchestrated processes, I’ve been thinking more deeply about the logic I need to integrate to “kick off” a DAG.

The queries themselves are simple–usually no more than 3–5 lines, but I spend more time understanding the parameters before writing the SQL itself.

--

--

Zach Quinn
Zach Quinn

Written by Zach Quinn

Journalist—>Sr. Data Engineer; new stories weekly.

No responses yet