Automate AWS Tasks Thanks to Airflow Hooks
This article is a step-by-step tutorial that will show you how to upload a file to an S3 bucket thanks to an Airflow ETL (Extract Transform Load) pipeline. ETL pipelines are defined by a set of interdependent tasks.
A task might be “download data from an API” or “upload data to a database” for example. A dependency would be “wait for the data to be downloaded before uploading it to the database”. After an introduction to ETL tools, you will discover how to upload a file to S3 thanks to boto3.
A bit of context around Airflow
Airflow is a platform composed of a web interface and a Python library. This project has been initiated by AirBnB in January 2015 and incubated by The Apache Software Foundation since March 2018 (version 1.8). The Airflow community is really active and counts more than 690 contributors for a 10k stars repository.
Also, The Apache Software foundation recently announced Airflow as a top-level project. This gives us a measure of the community and project management health so far.
Build your pipeline step by step
Step 1 : Install Airflow
As for every Python project, create a folder for your project and a virtual environment.
# Create your virtual environment
virutalenv venv
source venv/bin/activate# Create your Airflow project folder
mkdir airflow_project
cd airflow_project
You also need to export an additional environment variable as mentioned in the 21st of November announcement.
export SLUGIFY_USES_TEXT_UNIDECODE=yes
Eventually, run the commands of the Getting Started part of the documentation that are pasted below.
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip install apache-airflow
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
# visit localhost:8080 in the browser and enable the example dag in the home page
Congratulations! You now have access to the Airflow UI at http://localhost:8080 and you are all set to begin this tutorial.
Note: Airflow home folder will be used to store important files (configuration, logs, database among others).
Step 2 : Build your first DAG
A DAG is a Directed Acyclic Graph that represents the tasks chaining of your workflow. Here is the first DAG you are going to build in this tutorial.
On this schematic, we see that task upload_file_to_S3
may be executed only once dummy_start
has been successful.
Note: Our ETL is only composed of a L (Load) step in this example
As you can see in $AIRFLOW_HOME/airlow.cfg
, the value of the dags_folder
entry indicates that your DAG must be declared in folder $AIRFLOW_HOME/dags
. Also, we will call upload_file_to_S3.py
the file in which we are going to implement our DAG:
# Create the folder containing your DAGs definition
mkdir airflow_home/dags# Create your DAG definition file
touch airflow_home/dags/upload_file_to_S3.py# then open this file with your favorite IDE
First, import the required operators from airflow.operators
. Then, declare two tasks, attach them to your DAG my_dag
thanks to the parameter dag
. Using the context manager allows you not to duplicate the parameter dag
in each operator. Finally, set a dependency between them with >>
.
Now that we have the spine of our DAG, let’s make it useful. To do so, we will write a helper that uploads a file from your machine to an S3 bucket thanks to boto3.
Step 3 : Use boto3 to upload your file to AWS S3
boto3 is a Python library allowing you to communicate with AWS. In our tutorial, we will use it to upload a file from our local computer to your S3 bucket.
…
Read the full article on Sicara’s blog here.