Hello World! using Apache-Airflow

(An Illustration of the Apache-Airflow Fundamentals)

Mukesh Kumar
Accredian
5 min readAug 17, 2022

--

Written in collaboration with Hiren Rupchandani

Preface

In the previous stories, you learned how to set up the airflow in Windows (using WSL), Ubuntu, and macOS operating systems. It’s finally time to show you how to create your first DAG in airflow!

In this story, you go through some essential concepts which are required to keep in mind while writing a DAG. You will learn the working of DAG components in a diffused and combined form, which you can schedule on a specific time and duration in airflow.

Creating a Python File

  • Enable your virtual environment and navigate to your airflow directory containing the dags folder and some other files.
  • Open your favorite editor and create a new file with the name “hello_world_dag.py”.

Importing the modules

  • To create a proper pipeline in airflow, we need to import the “DAG” module and a python operator from the “operators.python” module in the airflow package.
  • We will also import the “datetime” module to schedule the dags.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

Creating a DAG object

  • Next, we will instantiate a DAG object to nest the tasks in the pipeline. We pass on a “dag_id” string which is the unique identifier of the dag.
  • We recommend keeping the python file and dag_id with the same name and once done, we will assign the “dag_id” as “hello_world_dag”.
  • We will also set a “start_date” parameter which indicates the timestamp from which the scheduler will attempt to backfill.
  • It is followed by a “schedule_interval” parameter which indicates the interval of subsequent DAG Runs created by the scheduler. It is in the form of a “datetime.timedelta” object or a cron expression. Airflow has a few cron presets available such as ‘@hourly’, ‘@daily’, ‘@yearly’, etc. You can read more about them here.
  • If the “start_date” is set as January 1, 2021, with a “schedule_interval” of hourly, then the scheduler will start a DAG Run on an hourly basis until the present hour or the “end_date” (optional parameter) has been reached. It is called catchup, and we can turn it off by keeping its parameter value as False.
  • After setting these parameters, our DAG initialization should look like this:
with DAG(dag_id="hello_world_dag",
start_date=datetime(2021,1,1),
schedule_interval="@hourly",
catchup=False) as dag:

Creating a Task

  • According to the airflow documentation, an object instantiated from an operator is called a task. In airflow, we have various types of operators, but for now, we will only focus on the PythonOperator.
  • A PythonOperator is used to call a python function inside your DAG. We will create a PythonOperator object that calls a python function which will return “Hello World” upon its call.
  • Like a DAG object has “dag_id”, a PythonOperator object has an identifier called “task_id.
  • It also has the “python_callable” parameter, which takes the name of the callable function as its input.
  • After setting the parameters, our task should look like this:
task1 = PythonOperator(
task_id="hello_world",
python_callable=helloWorld)

Creating a Callable Function

  • We also need to create a function that will be called by the PythonOperator as shown below:
def helloWorld():
print(‘Hello World’)

Setting Dependencies

  • We can set the task dependencies by writing the task names along with >> or << to indicate the downstream or upstream flow.
  • Since we have a single task here, we don’t need to indicate the flow and proceed with writing the task name.

Voila, it’s a DAG file

After compiling all the elements of the DAG, our final code should look like this:

A DAG file

Execution of the DAG in Webserver UI

  • To see the file running, activate the virtual environment and start your airflow webserver and scheduler.
  • Go to http://localhost:8080/home (or your dedicated port for airflow), and you should see the following on the web server UI:
  • The DAG should run successfully. You can check the graph view or tree view by hovering over Links and selecting Graph or Tree options.
Graph View of the DAG
  • You can also view the task’s execution information using logs. To do so, click on the task, which will lead you to the following dialog box:
Task Information
  • Next, click on the Log button and you will be redirected to the task’s log.
Task Log

Congratulations! We have made our first DAG using airflow. In the coming stories, we will show you how to create a proper DAG with multiple tasks and set dependencies between them.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).

& That’s it. I hope you liked the explanation of Hello World! using Apache-Airflow and learned something valuable. Please let me know in the comment section if you have anything to share with me. I would love to know your thoughts.

Follow me for more forthcoming articles based on Python, R, Data Science, Machine Learning, and Artificial Intelligence.

If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more valuable content.

What’s next?

--

--

Mukesh Kumar
Accredian

Data Scientist, having a robust math background, skilled in predictive modeling, data processing, and mining strategies to solve challenging business problems.