Airflow on GCP (May 2020)

Jostein Leira
Compendium
Published in
10 min readMay 15, 2020

--

This is a complete guide to install Apache Airflow on a Google Cloud Platform (GCP) Virtual Machine (VM) from scratch.

An alternative is to use Cloud Composer, the managed version that Google offers.

This is an updated version of my post on installing Airflow on a GCP VM published in April 2019. A year has passed and I have picked up some new tricks since then. New versions of Airflow, Ubuntu, PostgreSQL, and Python have arrived and I have decided to replaced Nginx with Google’s load balancer.

Modules

  • Apache Airflow — version 1.10.10
  • PostgreSQL 10 — Managed Cloud SQL version
  • Python 3.7 — The latest version that Airflow works with.
  • Ubuntu 20.04 LTS
  • Google’s load balancer with managed HTTPS certificate
  • Systemd services — Automatic startup

Architecture

A sketch of the components used and their connections. Static IP, load balancer, HTTPS certificate, VM, and Cloud SQL.

The core of the system is the VM, on which we install Airflow. We use a Cloud SQL instance as our database.

The load balancer is only used as an HTTPS terminator, serving and automatically renewing certificates. If you prefer to use Nginx with Let’s Encrypt instead, see Installing Airflow on a GCP VM (April 2019).

To be able to connect the Cloud load balancer with our virtual machine, the VM must reside in an instance group. You may only connect a GCP load balancer to an instance group, and not directly to a specific VM.

Installation overview

  1. Network setup (virtual private cloud)
  2. Virtual Machine
  3. Firewall rules (allow ssh access)
  4. Instance group
  5. Install the required packages
  6. Installing the database (Cloud SQL)
  7. Configure Airflow
  8. Automatic startup
  9. Static IP address
  10. Load balancer

Network setup

If you already have your infrastructure in place or just want to use the default network, skip this part.

The reason for setting up a new Virtual Private Cloud (network) to isolate our VM.

Screenshot of the Google console, showing VPC networks and the “Create VPC network” button.
Google Cloud Console — VPC network — Create VPC Network

In the GCP Console, choose VPC Networks and + Create VPC Network.

Partial screenshot of Create a VPC network form. Enter airflow-network in the Name field.
Create a VPC network — Enter a name

Name your new network airflow-network.

Partial screenshot of Create a VPC network form. New subnet. Name: airflow-subnet. Region:europe-west1. Address: 10.0.0.0/8
Create a VPC network — New subnet

I have used the names airflow-network and airflow-subnet. Names do not matter, as long as you are able to identify them later.

Select a region near your end-users. Be aware that there are two types of network tiers; standard and premium. Standard is optimized for cost and premium for speed. Standard is unavailable in some regions. See https://cloud.google.com/network-tiers/docs/overview.

Screenshot of the newly created network and subnet.
The new airflow-network and airflow-subnet.

Virtual Machine

Create a new Virtual Machine from the “Compute Engine” | “VM instances” menu with the following settings:

Name: airflow-vm
Region/Zone: europe-west1/europe-west1-b (same region as your network)
Machine type: g1-small (it is easy to change later if you need bigger)
Boot disk: Ubuntu 20.04 LTS minimal image
Disk size: 10 GB (you may increase disks later, but never shrink)
Firewall: Allow HTTP traffic (will change this later)
Networking: Select the airflow-network and airflow-subnet

Create an instance form

Firewall rules

For administration and installation, I need ssh access to the new VM. Since we create our own custom VPC, we need to add a firewall rule to allow access.

Choose “VPC network” | “Firewall rules” | “Create Firewall rule”.

Use:
Name: allow-ssh-access
Network: airflow-network
Targets: All instances in the network
Source IP-ranges: 0.0.0.0/0 (all)
Specified protocols and ports: tcp 22

Hit the “Create” button. You may later disable/enable this rule depending on access needs.

The plan is to serve airflow on TCP port 8080. During setup and testing, I modify the existing firewall rule airflow-network-allow-http, and add port 8080.

Partial screenshot of modifying existing firewall rule. Add port 8080.
Modify the existing firewall rule. Add TCP port 8080.

Create another firewall rule to allow traffic from the load balancer to the backend. Traffic from the load balancer may come from addresses in the IP ranges 130.211.0.0/22 and 35.191.0.0/16.

You should end up with the following rules:

Screenshot of new and modified firewall rules.
Resulting firewall rules for the airflow-network.

For now, both airflow-network-allow-http and allow-loadbalancer overlap, since both allow traffic to port 8080 on the VM. The idea is to disable airflow-network-allow-http when we are finished.

Instance group

We create an unmanaged instance group and add our newly created VM. Select “Create instance group” under the menu element “Compute Engine” | “Instance groups”:

Enter
Name: airflow-instance-group
Region/zone: europe-west1/europe-west1-b
Network/subnet: airflow-network/airflow-subnet
Expand “Specify port name mapping” (see below)
VM instances: airflow-vm

Screenshot of “Create instance group” form.
Compute Engine | Instance groups | Create instance group

Expand the “Specify port name mapping” and define the port name “http8080” with port numbers “8080”:

Partial screenshot of expanded form to specify port name.
Specify a new port name “http8080”.

This port name is used by the load balancer later.

Hit the “Create” button. You should end up with a new instance group containing one instance.

Screenshot of the result after creating a new instance group.
Result after creating a new instance group.

Install the required packages

The commands below install the needed packages, creates a Python virtual environment under /srv, creates an airflow user that the server will run as, and set owner and permissions.

sudo su
apt update
apt upgrade
apt install software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt install python3.7 python3.7-venv python3.7-devadduser airflow --disabled-login --disabled-password --gecos "Airflow system user"cd /srv
python3.7 -m venv airflow
cd airflow
source bin/activate
# With an activated virtual environment
pip install --upgrade pip
pip install wheel
pip install apache-airflow[postgres,crypto]==1.10.10
chown airflow.airflow . -R
chmod g+rwx . -R

Installing the database

Using a managed Cloud SQL database without a public IP address. Keeping
costs down, choosing all minimum resources (1 CPU, 3.75 GB RAM, 10 GB
HDD).

We want to end up with:
database instance (server) with instance ID: airflow-db
database name: airflow
database user name: airflow-user
database password for the airflow-user: <db-password>
database IP address: <db-server-ip>

From the Google console, go to “Storage” | “SQL” | “Create new
instance” and then “Choose PostgreSQL”.

Choose “Private IP” and associate
it to the airflow-network. Hit “Allocate and connect”, before hitting the “Create” button at the bottom of the form.

Wait for the new server instance being created.

When the new database instance is up and running, you should create a new database airflow and a new user airflow-user. Enter the airflow-db instance and select “Databases” and then “Create database”.

Enter “airflow” in the popup and hit the create button.

Then create a new user. Select the menu item Users, then the “Create user account” button.

Enter the user name airflow-user and choose a new password, referred to as <db-password> below.

At this point, you should be able to connect to the new database. You need to locate the “Private IP address” from the database instance details overview pane.

$ psql -h <db-server-ip> -U airflow-user -d airflow
Password for user airflow-user: <db-password>
psql (12.2 (Ubuntu 12.2-4), server 11.6)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384)
Type "help" for help
airflow=>

Configure Airflow

Running airflow initdb will create config files and a metadata database.
Since we are using PostgreSQL database we have to modify the database
connection string in the config file afterward and then rerun the airflow
initdb
command.

sudo su airflow
cd /srv/airflow
source bin/activate
export AIRFLOW_HOME=/srv/airflow
airflow initdb

Review the /srv/airflow/airflow.cfg file and change the following:

# This is not the complete airflow.cfg
# but only show configurations you need to change
sql_alchemy_conn = postgresql+psycopg2://airflow-user:<db-password>@<db-server-ip>/airflow
default_impersonation = airflowload_examples = Falserbac = Trueenable_proxy_fix = True

After editing the config file, then initialize PostgreSQL database (activated virtual environment):

airflow initdb

Add a new airflow admin user (activated virtual environment):

airflow create_user -r Admin -u jon -e jon@exampl.com -f Jon -l Snow

You can now start the airflow webserver:

airflow webserver -p 8080

You may now access the webserver on http://<VM’s External IP>:8080

You may start the scheduler with the command:

airflow scheduler

Automatic startup

Set up the Airflow webserver and scheduler with systemd to allow for
automatic start on server boot.

[Unit]
Description=Airflow webserver daemon
After=network.target
[Service]
Environment=”PATH=/srv/airflow/bin”
Environment=”AIRFLOW_HOME=/srv/airflow”
User=airflow
Group=airflow
Type=simple
ExecStart=/srv/airflow/bin/airflow webserver -p 8080 — pid /srv/airflow/webserver.pid
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target

Create a file airflow-webserver.service with the content listed above. Then run the following commands:

sudo cp airflow-webserver.service /lib/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable airflow-webserver.service
sudo systemctl start airflow-webserver.service
sudo systemctl status airflow-webserver.service

Do the same with airflow-scheduler.service:

[Unit]
Description=Airflow scheduler daemon
After=network.target
[Service]
Environment=”PATH=/srv/airflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin”
Environment=”AIRFLOW_HOME=/srv/airflow”
User=airflow
Group=airflow
Type=simple
ExecStart=/srv/airflow/bin/airflow scheduler
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target

and run the commands:

sudo cp airflow-scheduler.service /lib/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable airflow-scheduler.service
sudo systemctl start airflow-scheduler.service
sudo systemctl status airflow-scheduler.service

You should then be able to start and stop the services with the systemctl command:

sudo systemctl start airflow-webserver
sudo systemctl stop airflow-webserver
sudo systemctl start airflow-scheduler
sudo systemctl stop airflow-scheduler

Try to reboot your server an see that both the airflow scheduler and the webserver start automatically.

sudo reboot now

Static IP address

Reserve an external static IP address. This will be where the users connect to from the internet, and we need to set up a DNS A record pointing to this IP address. This IP address will be the front end of the load balancer and is not directly attached to the server.

In the Google console go to “VPC Networks” | “External IP addresses” | “Reserve static address”. Enter a name (airflow-static-ip) and region (europe-west1). Do not attach the IP address to anything yet. We will attach it to the load balancer later.

Hit the “Reserve” button.

Create a DNS A-record pointing to the static IP address created above. This is done in your DNS provider’s setup. I’m using the DNS name airflow.leira.net:

You should, of course, use your own IP address and DNS name. Please note that the DNS A record must be in place before we can create an HTTPS certificate.

Checking that everything works for now by running:

ping airflow.leira.netPinging airflow.leira.net [35.206.141.120] with 32 bytes of data

You should not get any reply, but you should see the correct IP address.

Load Balancer

In the Google console, select “Network services” | “Load balancing” | “Create load balancer”.

Select “Start configuration” in the box “HTTP(S) Load Balancing”.

Select “From Internet to my VMs” in the first step:

Then give the load balancer a name of your choice, and click on the first bullet point “Backend configuration”, then “Backend services” | “Create a backend service”:

While entering the health check drop-down, another form is shown.

Named port: http8080
New backend:
Instance group: airflow-instance-group
Port numbers: 8080

We do not have to do anything in the “Host and path rules”, so skip this now.

Enter the “Frontend configuration”.

Choose a name and select:

Protocol: HTTPS
IP address: airflow-static-id (that we created above)
Certificate: Create a new certificate (again, before you do this, your domain name must have an A record pointing to the static IP address).

A new pop-up:

Enter a name, and choose “Create Google-managed certificate”.

Enter your DNS name pointing to the static IP address and hit the “Create” button.

Hit the “Done” button in the frontend configuration form.

Then hit the “Create” button in the “New HTTP(S) load balancer” form.

You should then end up with a load balancer like this:

Entering the newly created load balancer, we can see that the certificate is not ready yet. In this case, it took about 10 minutes before it was ready.

Conclusion

I hope this post will help you.

Not solved. How to automatically redirect from HTTP (port 80) to HTTPS (port 443). It should be a setting in the load balancer, and I hope it will come in the future.

Please give me feedback.

Miscellaneous

I use the Environment systemd config file to set the path used by the scheduler. I only use the BashOperator and DummyOperator, and set up my Python jobs with a path to the wanted python version for my operators:

task = BashOperator(
task_id=’my_task’,
bash_command=f’/srv/<virtualenv>/bin/python3 src/transfer.py’,

)

I do not need to run my operators with the same Python version or virtual environment. Each may use their own, or share.

--

--

Jostein Leira
Compendium

Python software developer and Google Cloud Certified Professional Cloud Architect