A key component of our Kraken Public Data Infrastructure, to automate ETL workflows for public water and street data, is a cloud hosted instance of Apache Airflow.
To understand the significance of Airflow to build a data infrastructure, I recommend, first reading this post authored by Maxime Beauchemin and his description of Data Engineering.
The Rise of the Data Engineer
I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.
ARGO exists to build, operate, and maintain public data infrastructure. We currently do this for a coalition of California water utilities who administer water to over 22 million residents of Southern California.
A Data infrastructure is curiously analogous to water infrastructure.
Consider that untreated water from ground wells or snowpack require serious physical infrastructure in the form of pipes, aqueducts and treatment facilities to guide its flow into the taps of our homes and businesses.
Similarly, water usage data that comes in different shapes and sizes from the various water retailers need to be refined towards powering a shared analytics platform.
For us this means, automating a series of steps to securely Extract water data from the source, Transforming this data by relying on a trusted community of data parsers, and then Loading this refined data into the SCUBA Database that powers our core suite of analytics that are made available to the CaDC’s subscribing utilities.
Airflow was originally built by Airbnb’s data engineering team and subsequently open sourced into Apache Airflow.
ARGO is one amongst many data organizations that use Airflow for core operations.
To that end, we wanted to give back to the community and ensure that installing this fine piece of software is accessible by more purpose-driven and public data organizations.
What follows is a complete step-by-step installation of Apache Airflow on AWS.
Full credit to putting these instructions together goes to our Public Data Warrior, Xia Wang who was key to researching, testing, and implementing Airflow in its early stages.
Create an Ubuntu Server Instance on Amazon Web Services
We are not going into detail on how to create an AWS instance. Amazon’s instructions provide that.
We use a Linux Ubuntu Server 16.04 LTS . It is also important to create at least a
t2.medium type AWS instance. Anything smaller is not recommended.
Install Python, pip, Airflow and dependencies
Install Python and pip
Assuming we have a clean slate ubuntu server. First we need to install
python and the python package management tool
To install Python 2.7:
sudo apt-get install python-setuptools
To install pip:
sudo apt-get install python-pip
There 2 commands will give us the bare minimum to kickstart the airflow installation.
Note: if the default installed
pip is not the up-to-date version, you may want to consider updating it:
sudo pip install --upgrade pip
Install relational database (postgres) and configure the database
Airflow is shipped with a sqlite database backend. But to be able to run the data pipeline on the webUI, we need to have a more powerful database backend, and configure the database so that airflow has access to it. In our case we decided to install the postgresql database.
sudo apt-get install postgresql postgresql-contrib
So far as we know, the most recent versions of postgresql (8 and 9) don’t have compatibility issues with airflow.
Now that we’ve installed the postgresql database, we need to create a database for airfow, and grant access to the EC2 user. To create a database for airflow, we need to access the postgresql command line tool
psql as postgres' default superuser
sudo -u postgres psql
Then we will receive a psql prompt that looks like
postgres=#. We can type in sql queries to add a new user (ubuntu in our case), and grant it privileges to the database.
`CREATE ROLE ubuntu;
GRANT ALL PRIVILEGES on database airflow to ubuntu;
ALTER ROLE ubuntu SUPERUSER;
ALTER ROLE ubuntu CREATEDB;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ubuntu;`
psql -d airflow
will tell us the connection information.
One last thing we need to configure for the postgresql database is to change the settings in
pg_hba.conf. Using the query:
will tell return the location of the
pg_hba.conf file (it's likely in
/etc/postgresql/9.*/main/). Open the file with a text editor (vi, emacs or nano), and change the ipv4 address to
0.0.0.0/0 and the ipv4 connection method from md5 (password) to
trust if you don't want to use a password to connect to the database. In the meantime, we also need to configure the
postgresql.conf file to open the listen address to all ip addresses:
listen_addresses = '*'.
And we need to start a postgresql service
sudo service postgresql start
And any time we modify the connection information, we need to reload the postgresql service for the modification to be recognized by the service:
sudo service postgresql reload
Install airflow and configure it
Before installing, we can set up the default airflow home address:
Next step is to install the system and python packages for airflow. A general tip: if an error message arises during the installation, pay attention to which package failed the process, and try to install the dependency for that package, and try again.
First install the following dependencies:
sudo apt-get install libmysqlclient-dev(dependency for airflow[mysql] package)
sudo apt-get install libssl-dev(dependency for airflow[cryptograph] package)
sudo apt-get install libkrb5-dev(dependency for airflow[kerbero] package)
sudo apt-get install libsasl2-dev(dependency for airflow[hive] package):
After installing these dependencies, we can install airflow and its packages. (You can modify these packages depending on need. Celery and RabbitMQ are needed to use the Web-based GUI)
sudo pip install airflow[async,devel,celery,crypto,druid,gcp_api,jdbc,hdfs,hive,kerberos,ldap,password,postgres,qds,rabbitmq,s3,samba,slack]
Update 11/2018 (🙏 Andrew Stroup) Error while install airflow: By default one of Airflow’s dependencies installs a GPL re this error:
raise RuntimeError(“By default one of Airflow’s dependencies installs a GPL “RuntimeError: By default one of Airflow’s dependencies installs a GPL dependency (unidecode). To avoid this dependency set SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when you install or upgrade Airflow. To force installing the GPL version set AIRFLOW_GPL_UNIDECODE
After successfully installing airflow and packages, we start up Airflow’s database:
….to set up the first-time configs. An
airflow.cfg file is generated in the airflow home directory. We should open it with a text editor, and change some configurations in the [core] section:
- for the executor, we should use CeleryExecutor instead of SequentialExecutor if we want to run the pipeline in the webUI:
executor = CeleryExecutor
- for the backend DB connection, we should pass along the connection info of the postgresql database
airflowwe just created:
sql_alchemy_conn = postgresql+psycopg2://ubuntu@localhost:5432/airflow
If you don’t want the example dags to show up in the webUI, you can set the
False. Save and quit.
And to prepare for the next steps, we also need to set up the
celery_result_backend in the [celery] section:
broker_url = amqp://guest:guest@localhost:5672//
celery_result_backend = amqp://guest:guest@localhost:5672// (can use the same one as
For the configuration file to be loaded, we need to reset the database:
If the previous steps were followed correctly, we can now call the airflow webserver, and access the webUI:
To access the webserver, configure the security group of your EC2 instance and make sure the port 8080 (default airflow webUI port) is open to your computer. Open a web browser, copy and paste your EC2 instance ipv4 address, followed by
:8080, and the webUI should pop up. However, we are still half way through. Close the browser and the airflow webserver.
Rabbitmq and Celery
Rabbitmq is the core component supporting airflow on distributed computing systems. To install the rabbitmq, run the following command:
sudo apt-get install rabbitmq-server
And change the configuration file
And start a rabbitmq service:
sudo service rabbitmq-server start
Celery is the python api for rabbitmq. However, by the time this installation was carried out, airflow 1.8 has compatibility issues with celery 4.0.2 due to the librabbitmq library. So make sure to install celery version 3 instead.
sudo pip install 'celery>=3.1.17,<4.0'
If you accidentally installed celery 4.0.2, you need to uninstall it before installing the lower version:
sudo pip uninstall celery
Otherwise, there will be a confusing error message when you call the
Received and deleted unknown message. Wrong destination?!?
The airflow webUI
Now airflow and the webUI is ready to shine. Let’s see how to put dags (directed acyclic graphs: task workflow) in it and run them. We need to create a
dags file in the airflow home directory:
mkdir dags. Write some test dags and put them in the
dags directory. Reload the dags:
For the airflow webUI to work, we need to start a webserver and click the run button for a dag. Under the hood, the run button will trigger the
scheduler to distribute the dag in a task queue (rabbitmq) and assign
workers to carry out the task. So we need to have all the three airflow components (
worker) running. Since we installed the
scheduler and the
worker on the same EC2 instance, we had memory limitations and were not able to run all three components at once, we opened up the
airflow webserver and
airflow scheduler first, clicked the run button for the test dag, closed the
airflow webserverand opened the
airflow worker. The scheduler assigned the tasks in the queue to the workers, and the workers carried out the tasks. The scheduler and the workers recorded their activities in their respective logs in the airflow home directory. After the workers finished the task, we terminated the workers, and reopened the webserver. And the test dag in the webUI became marked successful.
A word about a few useful optional arguments:
-D: this argument can make the process run as a Daemon in the background
-c: this argument controls the maximum number of workers that can be triggered by
airflow worker. The default number can also be set in the
airflow.cfgfile. It is handy when there is working memory limitation on the server.
Disable automatic catchup
By default, Airflow will try to backfill jobs that it may have missed. In our use case, this is not the desired behavior. To change this, open the
airflow.cfg file, find the
catchup_by_defaultvariable and set its value to
Safety and Environment Variables Configuration
The following sections describe how to perform important steps for configuring the Airflow installation.
It is strongly recommended that you enable web authentication on your Airflow server. This is easily done in the following two steps.
- In the
airflow.cfgfile, find the
[webserver]section and add the lines below.
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
Note: The file contains a line with
False. Be sure to remove that line.
- Navigate to the airflow directory and open a Python interpreter.
$ cd ~/airflow
Run the commands shown below to create a user.
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'new_user_name'
user.email = 'firstname.lastname@example.org'
user.password = 'set_the_password'
session = settings.Session()
Now when you access the server in your browser, you will first have to authenticate on a login page.
Add Postgres Connections
Before you add any of your connections, it is strongly recommended that you enable encryption so that your database passwords and API keys are not stored in plain text.
Install required libraries and packages
sudo apt-get install gcc libffi-dev python-devel libssl-dev
sudo pip install cryptography
sudo pip install airflow[crypto]
Generate an encryption key
FERNET_KEY=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print FERNET_KEY")
~/airflow/airflow.cfg file, replacing the placeholder value for the
fernet_keywith your key.
# Secret key to save connection passwords in the db
fernet_key = <YOUR_KEY_HERE>
Restart the Airflow server
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p 8080 &
Set up a Postgres connection
Note: Before beginning, make sure to add the
airflowSecurity Group on AWS to one of the Security Groups authorized to access the RDS instance you will be connecting to.
Adding a Postgres connection is easy. In the Web UI, click Admin -> Connections -> Create. Enter the settings for your connection as shown by the example below.
Conn Id: some-db
Conn Type: Postgres
That’s it. You should now see your database connection appear under Data Profiling -> Ad Hoc Query.
Set Airflow Variables and Environment Variables
In the Web UI, click Admin -> Variables. Create the following keys and add their corresponding values. The AWS user represented by the key will need read/write access to the bucket specified.
Make a directory within the home directory for parsing.
Save this as an environment variable by adding the following line to your
source ~/.bashrc to reload the shell.