Compendium
Published in

Compendium

Install Apache Airflow on a Google Cloud Platform Virtual Machine

Three Wind Turbines — Photo: Vera Kratochvil (CC0 Public Domain)

Installing Airflow from scratch is an alternative to the managed version Cloud Composer that Google offers. Here are my installation notes.

A new, updated version can be found here: Airflow on GCP (May 2020)

Modules

Ubuntu 18.04 LTS — already up and running, not covered
Python 3.6 — Default for Ubuntu 18.04 LTS
Apache Airflow 1.10.1
PostgreSQL 9.6 — Managed Cloud SQL version
Nginx 1.14.0 — Used as front-end and TLS termination
Let’s encrypt HTTPS certificate
Systemd services — Automatic startup

Installing Airflow

The commands below install the needed packages, creates a Python virtual environment under /srv, creates an airflow user that the server will run as and set owner and permissions.

Installing Database

Using a managed Cloud SQL database without a public IP address. Keeping costs down, choosing all minimum resources (1 CPU, 3.75 GB RAM, 10 GB HDD).

We want to end up with:

  • database instance (server) with instance ID: airflow-db
  • database name: airflow
  • database user name: airflow-user
  • database password for the airflow-user: <db-password>
  • database IP address: <db-server-ip>

From the Google console, go to STORAGE | SQL | select “Create new instance” and then “Choose PostgreSQL”. Choose “Private IP” and associate to an existing network that your VM can access.

Database installation. Note the password and select an associated network that your VM can access.

After you finish the database wizard you should create a new database “airflow” and a new user “airflow-user”.

Create the new “airflow” database in your airflow-db instance.
Create a new user account “airflow-user” and take note of the password.

You need to locate the “Private IP address” from the instance details overview pane.

At this point, you should be able to connect the new database.

$ psql -h <db-server-ip> -U airflow-user -d airflow
Password for user airflow-user:
psql (10.6 (Ubuntu 10.6–0ubuntu0.18.04.1), server 9.6.10)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES128-GCM-SHA256, bits: 128, compression: off)
Type “help” for help.
airflow=>airflow=> grant usage on schema public to “airflow-user”;

Initialize the database and config file

Running airflow initdbwill create config files and a metadata database. Since we are using PostgreSQL database we have to modify the database connection string in the config file afterward and then rerun the airflow initdb command.

sudo su airflow
cd /srv/airflow
source bin/activate
export AIRFLOW_HOME=/srv/airflow
export AIRFLOW__WEBSERVER__RBAC=true
airflow initdb

Review the /srv/airflow/airflow.cfgfile and change the connection string and impersonation user:

Review the file /srv/airflow/webserver_config.pyand ensure the following is set:

After editing the config files, then initialize PostgreSQL database (activated virtual environment):

airflow initdb

Add a new airflow admin user (activated virtual environment):

airflow create_user -r Admin -u jon -e jon@exampl.com -f Jon -l Snow

You can now start the airflow web server:

airflow webserver -p 8080

and scheduler:

airflow scheduler

Nginx and Let’s encrypt certificates

To use Nginx as a web front-end with HTTPS termination and Let’s encrypt certificates your Airflow server need to have:

  • a public IP address
  • a DNS entry for your servers domain name (airflow.example.com) pointing to the public IP address
  • a firewall that allows ingress traffic from anywhere to port 80 and 443

If all the above is true, then you can install the following repositories and packages:

sudo su# Nginx as TLS terminator and reverse proxy for Airflow
apt install nginx
# Needed added repositories and packages for certbot
apt update
apt install software-properties-common -y
add-apt-repository ppa:certbot/certbot -y
apt update
apt upgrade
apt
install python-certbot-nginx -y

Run once on a new machine to create a new dhparam.pem file:

sudo su
cd /etc/ssl/private
openssl dhparam -out dhparam.pem 4096
chmod o-rwx dhparam.pem

Edit /etc/nginx/nginx.conf and add/edit the following in the SSL Settings in the http section:

ssl_session_timeout 10m;
ssl_session_cache shared:SSL:10m;
ssl_protocols TLSv1.2;
ssl_ciphers ‘EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH’;
ssl_dhparam /etc/ssl/private/dhparam.pem;
ssl_prefer_server_ciphers on;

Create a new Nginx config file for your airflow site
sudo nano /etc/nginx/sites-available/airflow.example.comshown below:

This configuration is depending on Airflow web server listening on 127.0.0.1:8080. Apply your configuration change by reloading nginx:

sudo service nginx reload

Then run certbotin interactive mode:

sudo certbot --nginx

After Certbot has run successfully, you should end up with two server sections, one for HTTP and one for HTTPS. You need to add the location section:

If successful, check the nginx config and reload nginx.

sudo nginx -t
sudo service nginx reload

You should now be able to access your web site with HTTPS. Check your test score here: https://www.ssllabs.com/ssltest/analyze.html?d=airflow.example.com&hideResults=on&latest

Systemd services

Set up the Airflow web server and scheduler with systemd to allow for automatic start on server boot.

Copy the above file to /lib/systemd/system:

sudo cp airflow-webserver.service /lib/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable airflow-webserver.service
sudo systemctl start airflow-webserver.service
sudo systemctl status airflow-webserver.service

To the same with airflow-scheduler.service:

You should then be able to start and stop the services with the systemctlcommand:

sudo systemctl stop airflow-webserver
sudo systemctl stop airflow-scheduler
sudo systemctl stop nginx

I use the Environment to set the path used by the scheduler. I only use the BashOperator and DummyOperator, and set up my Python jobs with:

task = BashOperator(
task_id='my_task',
bash_command=f’/srv/<virtualenv>/bin/python3 src/transfer.py’,
...
)

Misc

Please give me feedback. It is a complicated installation and with many moving parts.

Notes/Problems

Had to create the directory /home/airflow, even if I used the installation directory /srv/airflow, some service always wanted to check the /home/airflow directory. I never found a way to make airflow not check that directory.

You should add a firewall rule to only allow traffic from your own, known, sites.

--

--

--

The possibilities in technology are innumerable and we know that we can use our knowledge to make a big difference in people’s lives.

Recommended from Medium

Kyve testnet

Code Standards I follow

Belajar Android Developer Fundamentals (Version 2)

Software update - you love it, or you hate it. But still, you need to apply them.

Software update — you love it, or you hate it. But still, you need to apply them.

Introduction: A DEX is a place where transactions happen in and on the crypto space.

COVID-19 time and Alexa Skill, part I

DEEZ is now RUGPROOF!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jostein Leira

Jostein Leira

Python software developer and Google Cloud Certified Professional Cloud Architect

More from Medium

Apache Airflow on GKE

Managing GCP Filestore in Production — Backups & Monitoring

Data Exfiltration Protection with Databricks on GCP

Migrating Data from Azure Blob to GCP Cloud Storage