Automating An ETL Pipeline With Apache-Airflow

David K
5 min readJun 6, 2023

--

Prerequisites: Docker and PostgreSQL

Introduction:

Hey All! This article is a continuation of my previous article on building an ETL pipeline here. For today, we’d focus on automating the data pipeline we created, in order to alleviate the manual execution of codes to get our pipeline running.

In today’s data-driven world, automating the extraction, transformation, and loading (ETL) process is crucial for efficient data workflows. A powerful tool that simplifies ETL automation is Apache Airflow. In this article, we will explore what Airflow is, how it works, and the concept of Directed Acyclic Graphs (DAGs) that underpins its functionality.

What is Apache Airflow?
Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows. It allows you to define ETL pipelines as DAGs, where tasks and their dependencies are represented as nodes and edges in a graph-like structure.

How does Airflow work?
Airflow consists of several key components:

1. Scheduler: The scheduler takes care of triggering tasks based on their dependencies and schedules defined in the DAGs.

2. Executors: Executors execute tasks based on the scheduler’s instructions. Airflow supports different executor types These executor types determine how tasks are executed.

3. Metadatabase: Airflow relies on a metadata database, such as PostgreSQL or MySQL, to store task execution metadata, DAG definitions, and other related information.

4. Web UI: Airflow provides a user-friendly web interface that allows users to monitor and manage workflows, view task logs, and access other administrative features.

Understanding Directed Acyclic Graphs (DAGs):
DAGs are at the core of Airflow’s workflow management. A DAG represents a collection of tasks and their dependencies, forming a directed graph structure. Each task represents a unit of work, and the dependencies define the order in which tasks should be executed. By defining tasks and their dependencies within a DAG, you can orchestrate complex ETL processes and automate data workflows efficiently.

To understand a DAG better, this diagram represents a Python code with three functions. The diagram also illustrates the dependencies of these functions. In simple terms, the data transformation function depends on the fetch data function. And saving data to the database depends on the data transformation function. These three functions represented in a directional flow diagram are termed a DAG in Airflow. If a previous step fails, a subsequent step will also fail because they are interrelated. Later on we’d see what a Dag looks like in code, and how it is written and structured.

Getting Started With Airflow:

To have a better understanding of Airflow and see it in action, let’s get our hands dirty with a hands-on project, shall we?

1. Install Docker: Ensure you have Docker installed on your machine. Docker allows you to encapsulate and run applications in isolated containers, providing a consistent environment for Airflow. Please Install docker here.

2. Clone the GitHub Repository: Visit my Github repository here, and clone it to your local machine. This repository contains an example ETL pipeline implemented with Airflow. You can use Github Desktop to easily clone this. Please refer to the Github documentation on how to achieve this here.

3. Open PostgreSQL and run a quick select query to make sure the table created in part one (1) of this article still exists. Keep PostgreSQL up and running, as we would need it to confirm that the table is automatically being populated.

4. Modify the database config files: In the cloned repository you should have two files. One in the Dags folder named “db_config_airflow.txt” and one in the first level of the directory named “postgres_db_credentials.txt”. Kindly open these files and set your database credentials. Please go through the Readme file as it contains very important information on the project. If you are using Docker for this project, please note that your localhost in the “db_config_airflow.txt” file should be set to -> host.docker.internal

NB: host.docker.internal allows docker to communicate with your host. This is pivotal since we’re not working with an online database, but a locally hosted one.

5. Start Airflow with Docker-Compose: Navigate to the repository’s directory and use Docker Compose to start the Airflow environment. You can do this by using the command-prompt to navigate to the directory of the cloned github repository on your machine. After, make sure Docker is up and running. Once Docker is up and running, type the command “docker-compose up”. See the image below to make sure you’re on the right path.

The text highlighted in yellow gives a pictorial view of step 3. After hitting enter, you should have some outputs as I have in the picture above, showing you that airflow is running.

6. Explore the DAG: Once the Airflow environment is up and running, access the Airflow web UI. To access the web UI, type localhost:8080 in any web browser. It should load up a web page showing you the images below. To run the DAG, first toggle the on/off button to turn the DAG on. Finally, select the “play” button. Please have a look at the images below to make sure we’re on the same page.

Toggle the on/off button to turn on the DAG
Click the play button (I have surrounded it by a red box to make it easily noticeable) to start running the DAG

After a few seconds/minutes, you should see, your PostgreSQL database being incrementally populated with jokes from the jokes API every 2 minutes. You can simply run a select * query on the table to confirm this. Voila!

Key Points To Note:

It is worth noting that the docker-compose file references a Github repository which provides a Docker image for running Apache Airflow. The Docker image is also based on the official python:3.7-slim-buster image, which itself includes a minimal Python installation. Essential libraries for this project have already been included in the Python data ecosystem, and therefore we don’t have to worry about manually installing the libraries for this project.

However, in a case where you have to install a library for your project, you can achieve this by adding a requirements.txt file to the same directory as the docker-compose file and modifying the docker-compose file. However, if you are a beginner with this, don’t stress yourself so much over understanding it all at a goal. Just keep exploring incrementally and in no time, you’d master this whole Airflow world of Docker images, DAGs, and Docker compose files. For now, just focus on understanding the foundational concepts of Airflow, how it works, and the implementation of this Airflow project. Remember to have fun as you do this!

Conclusion:

In this article, you have learned how to automate an ETL pipeline with Airflow. Kindly explore more about Airflow to improve your knowledge and skills, and delve deeper into Airflow’s capabilities. Happy automating!

Congratulations on making it this far. Connect with me on LinkedIn here, follow me on Github here, and feel free to leave any questions in my inbox.

--

--