ELT with Airflow: a no-install click-and-play blueprint project running in codespaces

Tamara Janina Fingerlin
Apache Airflow
Published in
4 min readMar 28, 2023

I’m in a hurry, where is the code? Here: https://github.com/TJaniF/airflow-elt-blueprint

Are you a data professional who knows their way around Python code, and wants to get started with Airflow without reading a tutorial or having to install any packages? Or maybe you’ve used Airflow before and want to explore new features in a pre-built sandbox environment? Or are you considering learning Airflow but want to see it do something cool first without any previous knowledge?

This project was made for you! 😊

Airflow UI showing the DAGs present in the blueprint repository.

The Airflow ELT Blueprint repository contains a fully functional extract-load-transform (ELT) pipeline which ingests climate- and weather-related data from two different sources into MinIO, loads and transforms the data in a DuckDB instance using the Astro Python SDK, and powers a Streamlit App.

In order to run and explore the project yourself, you only need to follow 4 simple steps:

Step 1: Create a codespace

Thanks to the work of Faisal Hoda, you can use GitHub codespaces to run Airflow with the Astro CLI, without having to install or download anything locally. Just fork the repo and create a codespace with at least four cores; the Astro CLI will automatically start an Airflow environment.

Step 2: Enter your name and city

Navigate to include/global_variables/global_variables.py and provide your values for MY_NAME and MY_CITY.

Step 3: Run the pipeline

Once Airflow is running, you can access the Airflow UI at the Local Address of the forwarded port 8080.

Screenshot of a codespace running the blueprint repo, showing the Local Address for the 8080 port.

In the UI, unpause the DAGs (Directed Acyclic Graphs, Airflows workflow unit) starting from the first DAG and continuing to the DAG named start by clicking on the toggle to the left of the DAG names.

As soon as the start DAG runs, the other DAGs will run based on Data-driven scheduling using Datasets. In the Datasets tab, you can view the interdependence between DAGs and Datasets.

Screenshot of the Datasets tab of the blueprint repository.

Step 4: View the Streamlit app

As you can see in the image above, the last DAG to run is the run_streamlit_app DAG, it will stay in a running state creating a Streamlit app based upon the data that the pipeline ingested and transformed.

Open the Local Address of the forwarded port 8501 to view the interactive Streamlit App.

Screenshot of the upper half of the Global Climate and Local Weather Streamlit App.

(Some of the) Airflow features shown

The blueprint repository uses a variety of Airflow features:

  • Datasets: as explained above, Datasets are used to create data-driven dependencies between DAGs.
  • Dynamic task mapping: Several tasks in this repository are dynamically mapped to adjust the number of needed mapped task instances depending on inputs at runtime.
  • Astro Python SDK: The create_reporting_table DAG uses the transform operator of the Astro Python SDK to make creating a reporting table from a SQL Select query simple.
  • Custom operators and hooks: When interacting with MinIO the blueprint repository uses custom MinIO operators, which are stored locally in include/custom_operators/minio.py the operators use a custom Airflow hook to interact with MinIO using credentials saved in an Airflow connection.
  • Custom re-useable task group: The pattern of checking if a MinIO bucket of a specific name already exists and, if not, creating the bucket occurs several times in this repository. A great use case too, instead of redefining the tasks every time, use a reusable task group. Explore the task group code at include/custom_task_groups/create_bucket.py .
  • Airflow XCom and Airflow Variables: Small amounts of data like the current weather in the user-defined city are passed from one task to another using XCom. The coordinates of the city are saved as an Airflow Variable.

Have fun!

Use this project and the patterns and features shown within it as a blueprint for your own work or to try out new things (bonus points for sharing what you create). For data professionals new to Airflow, I hope this repo has sparked a desire to learn more and join the great community of Airflow users!

Disclaimer: This project and blog post was created with ❤️ by the DevRel team at Astronomer, the commercial developer behind Apache Airflow. All tools and features shown are fully open-source and free for you to use.

--

--