A simple guide to start using Apache Airflow 2 on Google Cloud

Antonio Cachuan
Apache Airflow
Published in
7 min readMar 29, 2021
Photo by Chris Ried on Unsplash

If you are wondering how to start working with Apache Airflow for small developments or academic purposes here you will learn how to. Well, deploying Airflow on GCP Compute Engine (self-managed deployment) could cost less than you think with all the advantages of using its services like BigQuery or Dataflow.

This scenario supposes you need a stable Airflow instance for a Proof of Concept or for learning space.

Why not Cloud Composer?
Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow. I suggest going with this if you or your team require a full production or development environment since Composer demands a minimum of 3 nodes (Compute Engines) and other GCP services so the billing could be an obstacle if you are starting your learning path on Airflow.

Important
Google Cloud offers $300 in credits for first time users.

First, let's start with some concepts and then go deep with a simple Airflow deployment on a compute engine instance.

Concepts

What is Apache Airflow?

Apache Airflow is an Open Source Platform built using Python to program and monitor workflows.

DAG

Airflow web site showing the DAG graph (rectangles connecting each other)
DAG graphically represented

A DAG is a collection of all the tasks organized in a way that reflects their relationships and dependencies.

Tasks

A Task is a unit of work within a DAG. Graphically, it’s a node in the DAG. Some examples are the implementation of a PythonOperator which executes a piece of Python code, or BashOperator, which executes a Bash command.

What is new in Apache Airflow 2?

Airflow 2 was launched in December 2020 with a bunch of new functionalities here are some important changes:

  • Full REST API: For example to externally trigger a DAG run also the API implements CRUD operations.
  • New UI: a refreshed UI with a simple and modern look, also include the Auto-refresh feature.
  • Better Scheduler: faster Scheduler and the support to run multiples Schedulers.
  • Smart Sensors: It allows to reduce the number of occupied workers by over 50%

For more details and changes regarding authoring DAGs in Airflow 2.0, check out Tomasz Urbaszek’s article for the official Airflow publication, Astronomer’s post, or Anna Anisienia’s article on Towards Data Science.

Deploying Apache Airflow

Remember that this scenario supposes you need a stable Airflow instance for a Proof of Concept or for a learning environment.

Important: The following guide is not recommended for production environments. I suggest you to visit this document for more details about deploying Airflow in production

  1. Create a Service Account

You’ll need to create a Service Account, so your Airflow instance can access the different GCP services in your project.

First, go to IAM & Admin then Service Accounts

Google Cloud Console entering the Service Accounts menu

Then enter the service account name and click on Create

Setting the name of the service account

Give minimum access to BigQuery with the role of BigQuery Job User and Dataflow with the role of Dataflow Worker.

Adding the BigQuery role to the service account
Adding the Dataflow role to the service account

Click on Done, and after that look for your new Service Account and then click the three dots and go to the Manage keys.

Service account page

Click Add Key/Create new key/Done. This will download a JSON file.

Interface for creating a service account json key

Finally, keep this JSON file and change the name to key-file.json. It’s the key to work with BigQuery and Dataflow.

a piece of code of a the service account key in json format.

2. Create a Compute Engine instance

Let’s deploy a Debian instance with the minimum requirements for this case.

  • e2-standard-2 (2vCPU, 8GB memory)
  • Debian 10
  • 50 GB HDD

Additionally, allow HTTPS and HTTP traffic and select the Service Account created

Image showing the interface for creating a compute engine

3. Installing Airflow

On the console click on SSH to start a terminal.

Google Cloud Interface showing a compute engine instance

On the terminal let's install python and update the catalog

sudo apt update
sudo apt -y upgrade
sudo apt-get install wget
sudo apt install -y python3-pip
Installing python on the terminal

I’ll use miniconda to be able to create a virtual environment

mkdir -p ~/miniconda3 
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Close your terminal, and open a new one.

Create our virtual environment, and activate it

mkdir airflow-medium
cd airflow-medium
pwd #important for setting AIRFLOW HOME variable export AIRFLOW_HOME=/home/acachuan/airflow-medium
conda create --name airflow-medium python=3.8
conda activate airflow-medium
Installing miniconda

Install Airflow and extra libraries

AIRFLOW_VERSION=2.0.1
PYTHON_VERSION=3.8
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow[gcp]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
pip install pyspark==2.4.5
pip install cryptography==2.9.2
Compute Engine terminal
airflow version
Compute engine console

4. First-time setup

Only once you need to init your metadata database and register at least one admin user.

airflow db init
airflow users create -r Admin -u username -p mypassword -e example@mail.com -f yourname -l lastname
Installing Airflow using the terminal

5. First execution

Yes! We are near to have an Airflow instance on GCP, We need to whitelist our IP for port 8080.

Go to Firewall/Create Firewall Rule

Google Cloud Console accesing VPC network an then Firewall

Create the airflow-port rule

The interface when ypu create a Compute Engine

Go to Compute Engine/VM instances and click on airflow-poc

The Google Cloud Console showing the Compute Engine created

Add the firewall rule

Opening the airflow port

Go back to the terminal and start the Web Server

airflow webserver -p 8080

Open another terminal and start the Scheduler

export AIRFLOW_HOME=/home/acachuan/airflow-medium 
cd airflow-medium
conda activate airflow-medium
airflow db init
airflow scheduler
Running Airflow Terminal on the terminal

Go to your Google Cloud Console and copy the external IP to your clipboard.

Finally, on your browser go to https://COMPUTE-ENGINE-IP:8080 and login with the user and password you have created when the DB was initialized.

Login on the Airflow web site

It’s done! Our Airflow 2 instance is running!

Airflow web site

6. Next executions

For future executions, we want that our Airflow starts immediately after the Compute Engine start.

Create a Cloud Storage Bucket

A Google Cloud Storage. user interface

Create a script, upload the file and keep it as a backup

#!/bin/bash
export AIRFLOW_HOME=/home/antoniocachuan/airflow-medium
cd /home/acachuan/airflow-medium
conda activate /home/antoniocachuan/miniconda3/envs/airflow-medium
nohup airflow scheduler >> scheduler.log &
nohup airflow webserver -p 8080 >> webserver.log &
Airflow init script uploaded to Google Cloud Storage

Only once copy your start file from your bucket.

gsutil cp gs://BUCKET-NAME/airfow-start.sh .

Now, each time you need to start your server just run it.

Starting Airflow on the terminal

7. Set up access to GCP resources

It’s time to upload our key-file.json to our instance and move to the location

/home/antoniocachuan/airflow-medium/secure/key-file.json

Then set up the connection on the Airflow website

The user interface on the Airflow website, Clicking on Admin and then Connections.

Complete with the file type and with your GCP project id and then click on save

The user interface for Google Cloud Connection on the Airflow website.

That's all! You have a basic Airflow environment ready to orchestrate processes on BigQuery or Dataflow.

Conclusions

Apache Airflow is a fantastic orchestration tool and deploying it on GCP enables the power to interact with services like BigQuery, Dataproc.

Caveats
Like I said at the beginning, this article is for development or simplistic environments. As you develop more pipelines Airflow would need more resources triggering scaling problems.

On the other hand, since you only use a compute engine you don’t need to keep the machine running all day as a consequence the billing is cheaper, in this case, $25 per month than Cloud Composer around $300 minimum.

PS if you have any questions, or would like something clarified, ping me on LinkedIn I like having a data conversation 😊

--

--

Antonio Cachuan
Apache Airflow

Google Cloud Professional Data Engineer (2x GCP). When code meets data, success is assured 🧡. Happy to share code and ideas 💡 linkedin.com/in/antoniocachuan/