ELT with Airflow: a no-install click-and-play blueprint project running in codespaces
I’m in a hurry, where is the code? Here: https://github.com/TJaniF/airflow-elt-blueprint
Are you a data professional who knows their way around Python code, and wants to get started with Airflow without reading a tutorial or having to install any packages? Or maybe you’ve used Airflow before and want to explore new features in a pre-built sandbox environment? Or are you considering learning Airflow but want to see it do something cool first without any previous knowledge?
This project was made for you! 😊
The Airflow ELT Blueprint repository contains a fully functional extract-load-transform (ELT) pipeline which ingests climate- and weather-related data from two different sources into MinIO, loads and transforms the data in a DuckDB instance using the Astro Python SDK, and powers a Streamlit App.
In order to run and explore the project yourself, you only need to follow 4 simple steps:
Step 1: Create a codespace
Thanks to the work of Faisal Hoda, you can use GitHub codespaces to run Airflow with the Astro CLI, without having to install or download anything locally. Just fork the repo and create a codespace with at least four cores; the Astro CLI will automatically start an Airflow environment.
Step 2: Enter your name and city
Navigate to include/global_variables/global_variables.py
and provide your values for MY_NAME
and MY_CITY
.
Step 3: Run the pipeline
Once Airflow is running, you can access the Airflow UI at the Local Address of the forwarded port 8080.
In the UI, unpause the DAGs (Directed Acyclic Graphs, Airflows workflow unit) starting from the first DAG and continuing to the DAG named start
by clicking on the toggle to the left of the DAG names.
As soon as the start
DAG runs, the other DAGs will run based on Data-driven scheduling using Datasets. In the Datasets
tab, you can view the interdependence between DAGs and Datasets.
Step 4: View the Streamlit app
As you can see in the image above, the last DAG to run is the run_streamlit_app
DAG, it will stay in a running state creating a Streamlit app based upon the data that the pipeline ingested and transformed.
Open the Local Address of the forwarded port 8501 to view the interactive Streamlit App.
(Some of the) Airflow features shown
The blueprint repository uses a variety of Airflow features:
- Datasets: as explained above, Datasets are used to create data-driven dependencies between DAGs.
- Dynamic task mapping: Several tasks in this repository are dynamically mapped to adjust the number of needed mapped task instances depending on inputs at runtime.
- Astro Python SDK: The
create_reporting_table
DAG uses thetransform
operator of the Astro Python SDK to make creating a reporting table from a SQL Select query simple. - Custom operators and hooks: When interacting with MinIO the blueprint repository uses custom MinIO operators, which are stored locally in
include/custom_operators/minio.py
the operators use a custom Airflow hook to interact with MinIO using credentials saved in an Airflow connection. - Custom re-useable task group: The pattern of checking if a MinIO bucket of a specific name already exists and, if not, creating the bucket occurs several times in this repository. A great use case too, instead of redefining the tasks every time, use a reusable task group. Explore the task group code at
include/custom_task_groups/create_bucket.py
. - Airflow XCom and Airflow Variables: Small amounts of data like the current weather in the user-defined city are passed from one task to another using XCom. The coordinates of the city are saved as an Airflow Variable.
Have fun!
Use this project and the patterns and features shown within it as a blueprint for your own work or to try out new things (bonus points for sharing what you create). For data professionals new to Airflow, I hope this repo has sparked a desire to learn more and join the great community of Airflow users!
Disclaimer: This project and blog post was created with ❤️ by the DevRel team at Astronomer, the commercial developer behind Apache Airflow. All tools and features shown are fully open-source and free for you to use.