Installing Apache Airflow on Ubuntu/AWS

A key component of our Kraken Public Data Infrastructure, to automate ETL workflows for public water and street data, is a cloud hosted instance of Apache Airflow.

To understand the significance of Airflow to build a data infrastructure, I recommend, first reading this post authored by Maxime Beauchemin and his description of Data Engineering.

ARGO exists to build, operate, and maintain public data infrastructure. We currently do this for a coalition of California water utilities who administer water to over 22 million residents of Southern California.

A Data infrastructure is curiously analogous to water infrastructure.

Consider that untreated water from ground wells or snowpack require serious physical infrastructure in the form of pipes, aqueducts and treatment facilities to guide its flow into the taps of our homes and businesses.

Similarly, water usage data that comes in different shapes and sizes from the various water retailers need to be refined towards powering a shared analytics platform.

Water pipes. Lots of them and big ones too! Put in place in the late 1920s to serve Southern California
Kraken explained by David Marulli, part of the ARGO core Team.

For us this means, automating a series of steps to securely Extract water data from the source, Transforming this data by relying on a trusted community of data parsers, and then Loading this refined data into the SCUBA Database that powers our core suite of analytics that are made available to the CaDC’s subscribing utilities.

Airflow was originally built by Airbnb’s data engineering team and subsequently open sourced into Apache Airflow.

ARGO is one amongst many data organizations that use Airflow for core operations.

To that end, we wanted to give back to the community and ensure that installing this fine piece of software is accessible by more purpose-driven and public data organizations.

Data pipes via The Airflow command center GUI. We manage our data pipelines from here.

What follows is a complete step-by-step installation of Apache Airflow on AWS.

Full credit to putting these instructions together goes to our Public Data Warrior, Xia Wang who was key to researching, testing, and implementing Airflow in its early stages.

Create an Ubuntu Server Instance on Amazon Web Services

We are not going into detail on how to create an AWS instance. Amazon’s instructions provide that.

We use a Linux Ubuntu Server 16.04 LTS . It is also important to create at least a t2.medium type AWS instance. Anything smaller is not recommended.

Install Python, pip, Airflow and dependencies

Install Python and pip

Assuming we have a clean slate ubuntu server. First we need to install python and the python package management tool pip.

To install Python 2.7:

sudo apt-get install python-setuptools

To install pip:

sudo apt-get install python-pip

There 2 commands will give us the bare minimum to kickstart the airflow installation.

Note: if the default installed pip is not the up-to-date version, you may want to consider updating it:

sudo pip install --upgrade pip

Install relational database (postgres) and configure the database

Airflow is shipped with a sqlite database backend. But to be able to run the data pipeline on the webUI, we need to have a more powerful database backend, and configure the database so that airflow has access to it. In our case we decided to install the postgresql database.

sudo apt-get install postgresql postgresql-contrib

So far as we know, the most recent versions of postgresql (8 and 9) don’t have compatibility issues with airflow.

Now that we’ve installed the postgresql database, we need to create a database for airfow, and grant access to the EC2 user. To create a database for airflow, we need to access the postgresql command line tool psql as postgres' default superuser postgres:

sudo -u postgres psql

Then we will receive a psql prompt that looks like postgres=#. We can type in sql queries to add a new user (ubuntu in our case), and grant it privileges to the database.

`CREATE ROLE ubuntu;
GRANT ALL PRIVILEGES on database airflow to ubuntu;
ALTER ROLE ubuntu SUPERUSER;
ALTER ROLE ubuntu CREATEDB;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ubuntu;`

psql -d airflow

Type in

\conninfo

will tell us the connection information.

One last thing we need to configure for the postgresql database is to change the settings in pg_hba.conf. Using the query:

SHOW hba_file;

will tell return the location of the pg_hba.conf file (it's likely in /etc/postgresql/9.*/main/). Open the file with a text editor (vi, emacs or nano), and change the ipv4 address to 0.0.0.0/0 and the ipv4 connection method from md5 (password) to trust if you don't want to use a password to connect to the database. In the meantime, we also need to configure the postgresql.conf file to open the listen address to all ip addresses:

listen_addresses = '*'.

And we need to start a postgresql service

sudo service postgresql start

And any time we modify the connection information, we need to reload the postgresql service for the modification to be recognized by the service:

sudo service postgresql reload

Install airflow and configure it

Before installing, we can set up the default airflow home address:

export AIRFLOW_HOME=~/airflow

Next step is to install the system and python packages for airflow. A general tip: if an error message arises during the installation, pay attention to which package failed the process, and try to install the dependency for that package, and try again.

First install the following dependencies:

  • sudo apt-get install libmysqlclient-dev (dependency for airflow[mysql] package)
  • sudo apt-get install libssl-dev (dependency for airflow[cryptograph] package)
  • sudo apt-get install libkrb5-dev (dependency for airflow[kerbero] package)
  • sudo apt-get install libsasl2-dev (dependency for airflow[hive] package):

After installing these dependencies, we can install airflow and its packages. (You can modify these packages depending on need. Celery and RabbitMQ are needed to use the Web-based GUI)

sudo pip install airflow[async,devel,celery,crypto,druid,gcp_api,jdbc,hdfs,hive,kerberos,ldap,password,postgres,qds,rabbitmq,s3,samba,slack]

Update 11/2018 (🙏 Andrew Stroup) Error while install airflow: By default one of Airflow’s dependencies installs a GPL re this error:

raise RuntimeError(“By default one of Airflow’s dependencies installs a GPL “
RuntimeError: By default one of Airflow’s dependencies installs a GPL dependency (unidecode). To avoid this dependency set SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when you install or upgrade Airflow. To force installing the GPL version set AIRFLOW_GPL_UNIDECODE

After successfully installing airflow and packages, we start up Airflow’s database:

airflow initdb

….to set up the first-time configs. An airflow.cfg file is generated in the airflow home directory. We should open it with a text editor, and change some configurations in the [core] section:

  • for the executor, we should use CeleryExecutor instead of SequentialExecutor if we want to run the pipeline in the webUI:

executor = CeleryExecutor

  • for the backend DB connection, we should pass along the connection info of the postgresql database airflow we just created:

sql_alchemy_conn = postgresql+psycopg2://ubuntu@localhost:5432/airflow

If you don’t want the example dags to show up in the webUI, you can set the load_examplesvariable to False. Save and quit.

And to prepare for the next steps, we also need to set up the broker_url and celery_result_backend in the [celery] section:

broker_url = amqp://guest:guest@localhost:5672//

celery_result_backend = amqp://guest:guest@localhost:5672// (can use the same one as broker_url)

For the configuration file to be loaded, we need to reset the database:

airflow initdb

If the previous steps were followed correctly, we can now call the airflow webserver, and access the webUI:

airflow webserver

To access the webserver, configure the security group of your EC2 instance and make sure the port 8080 (default airflow webUI port) is open to your computer. Open a web browser, copy and paste your EC2 instance ipv4 address, followed by :8080, and the webUI should pop up. However, we are still half way through. Close the browser and the airflow webserver.

Rabbitmq and Celery

Rabbitmq is the core component supporting airflow on distributed computing systems. To install the rabbitmq, run the following command:

sudo apt-get install rabbitmq-server

And change the configuration file /etc/rabbitmq/rabbitmq-env.conf:

NODE_IP_ADDRESS=0.0.0.0

And start a rabbitmq service:

sudo service rabbitmq-server start

Celery is the python api for rabbitmq. However, by the time this installation was carried out, airflow 1.8 has compatibility issues with celery 4.0.2 due to the librabbitmq library. So make sure to install celery version 3 instead.

sudo pip install 'celery>=3.1.17,<4.0'

If you accidentally installed celery 4.0.2, you need to uninstall it before installing the lower version:

sudo pip uninstall celery

Otherwise, there will be a confusing error message when you call the airflow worker: Received and deleted unknown message. Wrong destination?!?

The airflow webUI

Now airflow and the webUI is ready to shine. Let’s see how to put dags (directed acyclic graphs: task workflow) in it and run them. We need to create a dags file in the airflow home directory: mkdir dags. Write some test dags and put them in the dags directory. Reload the dags:

airflow webserver
airflow scheduler
airflow worker

For the airflow webUI to work, we need to start a webserver and click the run button for a dag. Under the hood, the run button will trigger the scheduler to distribute the dag in a task queue (rabbitmq) and assign workers to carry out the task. So we need to have all the three airflow components (webserver, scheduler and worker) running. Since we installed the scheduler and the worker on the same EC2 instance, we had memory limitations and were not able to run all three components at once, we opened up the airflow webserver and airflow scheduler first, clicked the run button for the test dag, closed the airflow webserverand opened the airflow worker. The scheduler assigned the tasks in the queue to the workers, and the workers carried out the tasks. The scheduler and the workers recorded their activities in their respective logs in the airflow home directory. After the workers finished the task, we terminated the workers, and reopened the webserver. And the test dag in the webUI became marked successful.

A word about a few useful optional arguments:

  • -D: this argument can make the process run as a Daemon in the background
  • -c: this argument controls the maximum number of workers that can be triggered by airflow worker. The default number can also be set in the airflow.cfg file. It is handy when there is working memory limitation on the server.

Application Configuration

Disable automatic catchup

By default, Airflow will try to backfill jobs that it may have missed. In our use case, this is not the desired behavior. To change this, open the airflow.cfg file, find the catchup_by_defaultvariable and set its value to False.

Safety and Environment Variables Configuration

The following sections describe how to perform important steps for configuring the Airflow installation.

Enable Authentication

It is strongly recommended that you enable web authentication on your Airflow server. This is easily done in the following two steps.

  1. In the airflow.cfg file, find the [webserver] section and add the lines below.
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
Note: The file contains a line with authenticate set to False. Be sure to remove that line.
  1. Navigate to the airflow directory and open a Python interpreter.
$ cd ~/airflow
$ python

Run the commands shown below to create a user.

import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'new_user_name'
user.email = 'new_user_email@example.com'
user.password = 'set_the_password'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()

Now when you access the server in your browser, you will first have to authenticate on a login page.

Add Postgres Connections

Before you add any of your connections, it is strongly recommended that you enable encryption so that your database passwords and API keys are not stored in plain text.

Enabling Encryption

Install required libraries and packages

sudo apt-get install gcc libffi-dev python-devel libssl-dev
sudo pip install cryptography
sudo pip install airflow[crypto]

Generate an encryption key

FERNET_KEY=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print FERNET_KEY")
echo $FERNET_KEY

Edit the ~/airflow/airflow.cfg file, replacing the placeholder value for the fernet_keywith your key.

# Secret key to save connection passwords in the db
fernet_key = <YOUR_KEY_HERE>

Restart the Airflow server

cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p 8080 &

Set up a Postgres connection

Note: Before beginning, make sure to add the airflow Security Group on AWS to one of the Security Groups authorized to access the RDS instance you will be connecting to.

Adding a Postgres connection is easy. In the Web UI, click Admin -> Connections -> Create. Enter the settings for your connection as shown by the example below.

Conn Id: some-db
Conn Type: Postgres
Host: something-something.us-region-2.rds.amazonaws.com
Schema: something
Login: user
Password: ********
Port: 5432

Click Save.

That’s it. You should now see your database connection appear under Data Profiling -> Ad Hoc Query.

Set Airflow Variables and Environment Variables

Airflow Variables

In the Web UI, click Admin -> Variables. Create the following keys and add their corresponding values. The AWS user represented by the key will need read/write access to the bucket specified.

  • bucket_name
  • aws_access_key_id
  • aws_secret_access_key

Environment Variables

Make a directory within the home directory for parsing.

mkdir ~/parse

Save this as an environment variable by adding the following line to your .bashrc file:

export PARSE_DIR=~/parse

Run source ~/.bashrc to reload the shell.