Airflow on GCP (May 2020)
This is a complete guide to install Apache Airflow on a Google Cloud Platform (GCP) Virtual Machine (VM) from scratch.
An alternative is to use Cloud Composer, the managed version that Google offers.
This is an updated version of my post on installing Airflow on a GCP VM published in April 2019. A year has passed and I have picked up some new tricks since then. New versions of Airflow, Ubuntu, PostgreSQL, and Python have arrived and I have decided to replaced Nginx with Google’s load balancer.
Modules
- Apache Airflow — version 1.10.10
- PostgreSQL 10 — Managed Cloud SQL version
- Python 3.7 — The latest version that Airflow works with.
- Ubuntu 20.04 LTS
- Google’s load balancer with managed HTTPS certificate
- Systemd services — Automatic startup
Architecture
The core of the system is the VM, on which we install Airflow. We use a Cloud SQL instance as our database.
The load balancer is only used as an HTTPS terminator, serving and automatically renewing certificates. If you prefer to use Nginx with Let’s Encrypt instead, see Installing Airflow on a GCP VM (April 2019).
To be able to connect the Cloud load balancer with our virtual machine, the VM must reside in an instance group. You may only connect a GCP load balancer to an instance group, and not directly to a specific VM.
Installation overview
- Network setup (virtual private cloud)
- Virtual Machine
- Firewall rules (allow ssh access)
- Instance group
- Install the required packages
- Installing the database (Cloud SQL)
- Configure Airflow
- Automatic startup
- Static IP address
- Load balancer
Network setup
If you already have your infrastructure in place or just want to use the default network, skip this part.
The reason for setting up a new Virtual Private Cloud (network) to isolate our VM.
In the GCP Console, choose VPC Networks and + Create VPC Network.
Name your new network airflow-network.
I have used the names airflow-network and airflow-subnet. Names do not matter, as long as you are able to identify them later.
Select a region near your end-users. Be aware that there are two types of network tiers; standard and premium. Standard is optimized for cost and premium for speed. Standard is unavailable in some regions. See https://cloud.google.com/network-tiers/docs/overview.
Virtual Machine
Create a new Virtual Machine from the “Compute Engine” | “VM instances” menu with the following settings:
Name: airflow-vm
Region/Zone: europe-west1/europe-west1-b (same region as your network)
Machine type: g1-small (it is easy to change later if you need bigger)
Boot disk: Ubuntu 20.04 LTS minimal image
Disk size: 10 GB (you may increase disks later, but never shrink)
Firewall: Allow HTTP traffic (will change this later)
Networking: Select the airflow-network and airflow-subnet
Firewall rules
For administration and installation, I need ssh access to the new VM. Since we create our own custom VPC, we need to add a firewall rule to allow access.
Choose “VPC network” | “Firewall rules” | “Create Firewall rule”.
Use:
Name: allow-ssh-access
Network: airflow-network
Targets: All instances in the network
Source IP-ranges: 0.0.0.0/0 (all)
Specified protocols and ports: tcp 22
Hit the “Create” button. You may later disable/enable this rule depending on access needs.
The plan is to serve airflow on TCP port 8080. During setup and testing, I modify the existing firewall rule airflow-network-allow-http, and add port 8080.
Create another firewall rule to allow traffic from the load balancer to the backend. Traffic from the load balancer may come from addresses in the IP ranges 130.211.0.0/22 and 35.191.0.0/16.
You should end up with the following rules:
For now, both airflow-network-allow-http and allow-loadbalancer overlap, since both allow traffic to port 8080 on the VM. The idea is to disable airflow-network-allow-http when we are finished.
Instance group
We create an unmanaged instance group and add our newly created VM. Select “Create instance group” under the menu element “Compute Engine” | “Instance groups”:
Enter
Name: airflow-instance-group
Region/zone: europe-west1/europe-west1-b
Network/subnet: airflow-network/airflow-subnet
Expand “Specify port name mapping” (see below)
VM instances: airflow-vm
Expand the “Specify port name mapping” and define the port name “http8080” with port numbers “8080”:
This port name is used by the load balancer later.
Hit the “Create” button. You should end up with a new instance group containing one instance.
Install the required packages
The commands below install the needed packages, creates a Python virtual environment under /srv, creates an airflow user that the server will run as, and set owner and permissions.
sudo su
apt update
apt upgradeapt install software-properties-common
add-apt-repository ppa:deadsnakes/ppaapt install python3.7 python3.7-venv python3.7-devadduser airflow --disabled-login --disabled-password --gecos "Airflow system user"cd /srv
python3.7 -m venv airflow
cd airflow
source bin/activate# With an activated virtual environment
pip install --upgrade pip
pip install wheel
pip install apache-airflow[postgres,crypto]==1.10.10chown airflow.airflow . -R
chmod g+rwx . -R
Installing the database
Using a managed Cloud SQL database without a public IP address. Keeping
costs down, choosing all minimum resources (1 CPU, 3.75 GB RAM, 10 GB
HDD).
We want to end up with:
database instance (server) with instance ID: airflow-db
database name: airflow
database user name: airflow-user
database password for the airflow-user: <db-password>
database IP address: <db-server-ip>
From the Google console, go to “Storage” | “SQL” | “Create new
instance” and then “Choose PostgreSQL”.
Choose “Private IP” and associate
it to the airflow-network. Hit “Allocate and connect”, before hitting the “Create” button at the bottom of the form.
Wait for the new server instance being created.
When the new database instance is up and running, you should create a new database airflow and a new user airflow-user. Enter the airflow-db instance and select “Databases” and then “Create database”.
Enter “airflow” in the popup and hit the create button.
Then create a new user. Select the menu item Users, then the “Create user account” button.
Enter the user name airflow-user and choose a new password, referred to as <db-password> below.
At this point, you should be able to connect to the new database. You need to locate the “Private IP address” from the database instance details overview pane.
$ psql -h <db-server-ip> -U airflow-user -d airflow
Password for user airflow-user: <db-password>
psql (12.2 (Ubuntu 12.2-4), server 11.6)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384)
Type "help" for helpairflow=>
Configure Airflow
Running airflow initdb will create config files and a metadata database.
Since we are using PostgreSQL database we have to modify the database
connection string in the config file afterward and then rerun the airflow
initdb command.
sudo su airflow
cd /srv/airflow
source bin/activate
export AIRFLOW_HOME=/srv/airflowairflow initdb
Review the /srv/airflow/airflow.cfg file and change the following:
# This is not the complete airflow.cfg
# but only show configurations you need to change
sql_alchemy_conn = postgresql+psycopg2://airflow-user:<db-password>@<db-server-ip>/airflowdefault_impersonation = airflowload_examples = Falserbac = Trueenable_proxy_fix = True
After editing the config file, then initialize PostgreSQL database (activated virtual environment):
airflow initdb
Add a new airflow admin user (activated virtual environment):
airflow create_user -r Admin -u jon -e jon@exampl.com -f Jon -l Snow
You can now start the airflow webserver:
airflow webserver -p 8080
You may now access the webserver on http://<VM’s External IP>:8080
You may start the scheduler with the command:
airflow scheduler
Automatic startup
Set up the Airflow webserver and scheduler with systemd to allow for
automatic start on server boot.
[Unit]
Description=Airflow webserver daemon
After=network.target[Service]
Environment=”PATH=/srv/airflow/bin”
Environment=”AIRFLOW_HOME=/srv/airflow”
User=airflow
Group=airflow
Type=simple
ExecStart=/srv/airflow/bin/airflow webserver -p 8080 — pid /srv/airflow/webserver.pid
Restart=on-failure
RestartSec=5s
PrivateTmp=true[Install]
WantedBy=multi-user.target
Create a file airflow-webserver.service with the content listed above. Then run the following commands:
sudo cp airflow-webserver.service /lib/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable airflow-webserver.service
sudo systemctl start airflow-webserver.service
sudo systemctl status airflow-webserver.service
Do the same with airflow-scheduler.service:
[Unit]
Description=Airflow scheduler daemon
After=network.target[Service]
Environment=”PATH=/srv/airflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin”
Environment=”AIRFLOW_HOME=/srv/airflow”
User=airflow
Group=airflow
Type=simple
ExecStart=/srv/airflow/bin/airflow scheduler
Restart=always
RestartSec=5s[Install]
WantedBy=multi-user.target
and run the commands:
sudo cp airflow-scheduler.service /lib/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable airflow-scheduler.service
sudo systemctl start airflow-scheduler.service
sudo systemctl status airflow-scheduler.service
You should then be able to start and stop the services with the systemctl command:
sudo systemctl start airflow-webserver
sudo systemctl stop airflow-webserversudo systemctl start airflow-scheduler
sudo systemctl stop airflow-scheduler
Try to reboot your server an see that both the airflow scheduler and the webserver start automatically.
sudo reboot now
Static IP address
Reserve an external static IP address. This will be where the users connect to from the internet, and we need to set up a DNS A record pointing to this IP address. This IP address will be the front end of the load balancer and is not directly attached to the server.
In the Google console go to “VPC Networks” | “External IP addresses” | “Reserve static address”. Enter a name (airflow-static-ip) and region (europe-west1). Do not attach the IP address to anything yet. We will attach it to the load balancer later.
Hit the “Reserve” button.
Create a DNS A-record pointing to the static IP address created above. This is done in your DNS provider’s setup. I’m using the DNS name airflow.leira.net:
You should, of course, use your own IP address and DNS name. Please note that the DNS A record must be in place before we can create an HTTPS certificate.
Checking that everything works for now by running:
ping airflow.leira.netPinging airflow.leira.net [35.206.141.120] with 32 bytes of data
You should not get any reply, but you should see the correct IP address.
Load Balancer
In the Google console, select “Network services” | “Load balancing” | “Create load balancer”.
Select “Start configuration” in the box “HTTP(S) Load Balancing”.
Select “From Internet to my VMs” in the first step:
Then give the load balancer a name of your choice, and click on the first bullet point “Backend configuration”, then “Backend services” | “Create a backend service”:
While entering the health check drop-down, another form is shown.
Named port: http8080
New backend:
Instance group: airflow-instance-group
Port numbers: 8080
We do not have to do anything in the “Host and path rules”, so skip this now.
Enter the “Frontend configuration”.
Choose a name and select:
Protocol: HTTPS
IP address: airflow-static-id (that we created above)
Certificate: Create a new certificate (again, before you do this, your domain name must have an A record pointing to the static IP address).
A new pop-up:
Enter a name, and choose “Create Google-managed certificate”.
Enter your DNS name pointing to the static IP address and hit the “Create” button.
Hit the “Done” button in the frontend configuration form.
Then hit the “Create” button in the “New HTTP(S) load balancer” form.
You should then end up with a load balancer like this:
Entering the newly created load balancer, we can see that the certificate is not ready yet. In this case, it took about 10 minutes before it was ready.
Conclusion
I hope this post will help you.
Not solved. How to automatically redirect from HTTP (port 80) to HTTPS (port 443). It should be a setting in the load balancer, and I hope it will come in the future.
Please give me feedback.
Miscellaneous
I use the Environment systemd config file to set the path used by the scheduler. I only use the BashOperator and DummyOperator, and set up my Python jobs with a path to the wanted python version for my operators:
task = BashOperator(
task_id=’my_task’,
bash_command=f’/srv/<virtualenv>/bin/python3 src/transfer.py’,
…
)
I do not need to run my operators with the same Python version or virtual environment. Each may use their own, or share.