Installing Apache Airflow on Ubuntu/AWS
A key component of our Kraken Public Data Infrastructure, to automate ETL workflows for public water and street data, is a cloud hosted instance of Apache Airflow.
To understand the significance of Airflow to build a data infrastructure, I recommend, first reading this post authored by Maxime Beauchemin and his description of Data Engineering.
ARGO exists to build, operate, and maintain public data infrastructure. We currently do this for a coalition of California water utilities who administer water to over 22 million residents of Southern California.
A Data infrastructure is curiously analogous to water infrastructure.
Consider that untreated water from ground wells or snowpack require serious physical infrastructure in the form of pipes, aqueducts and treatment facilities to guide its flow into the taps of our homes and businesses.
Similarly, water usage data that comes in different shapes and sizes from the various water retailers need to be refined towards powering a shared analytics platform.
For us this means, automating a series of steps to securely Extract water data from the source, Transforming this data by relying on a trusted community of data parsers, and then Loading this refined data into the SCUBA Database that powers our core suite of analytics that are made available to the CaDC’s subscribing utilities.
Airflow was originally built by Airbnb’s data engineering team and subsequently open sourced into Apache Airflow.
ARGO is one amongst many data organizations that use Airflow for core operations.
To that end, we wanted to give back to the community and ensure that installing this fine piece of software is accessible by more purpose-driven and public data organizations.
What follows is a complete step-by-step installation of Apache Airflow on AWS.
Full credit to putting these instructions together goes to our Public Data Warrior, Xia Wang who was key to researching, testing, and implementing Airflow in its early stages.
Create an Ubuntu Server Instance on Amazon Web Services
We are not going into detail on how to create an AWS instance. Amazon’s instructions provide that.
We use a Linux Ubuntu Server 16.04 LTS . It is also important to create at least a t2.medium
type AWS instance. Anything smaller is not recommended.
Install Python, pip, Airflow and dependencies
Install Python and pip
Assuming we have a clean slate ubuntu server. First we need to install python
and the python package management tool pip
.
To install Python 2.7:
sudo apt-get install python-setuptools
To install pip:
sudo apt-get install python-pip
There 2 commands will give us the bare minimum to kickstart the airflow installation.
Note: if the default installed pip
is not the up-to-date version, you may want to consider updating it:
sudo pip install --upgrade pip
Install relational database (postgres) and configure the database
Airflow is shipped with a sqlite database backend. But to be able to run the data pipeline on the webUI, we need to have a more powerful database backend, and configure the database so that airflow has access to it. In our case we decided to install the postgresql database.
sudo apt-get install postgresql postgresql-contrib
So far as we know, the most recent versions of postgresql (8 and 9) don’t have compatibility issues with airflow.
Now that we’ve installed the postgresql database, we need to create a database for airfow, and grant access to the EC2 user. To create a database for airflow, we need to access the postgresql command line tool psql
as postgres' default superuser postgres
:
sudo -u postgres psql
Then we will receive a psql prompt that looks like postgres=#
. We can type in sql queries to add a new user (ubuntu in our case), and grant it privileges to the database.
`CREATE ROLE ubuntu;
GRANT ALL PRIVILEGES on database airflow to ubuntu;
ALTER ROLE ubuntu SUPERUSER;
ALTER ROLE ubuntu CREATEDB;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ubuntu;`
psql -d airflow
Type in
\conninfo
will tell us the connection information.
One last thing we need to configure for the postgresql database is to change the settings in pg_hba.conf
. Using the query:
SHOW hba_file;
will tell return the location of the pg_hba.conf
file (it's likely in /etc/postgresql/9.*/main/
). Open the file with a text editor (vi, emacs or nano), and change the ipv4 address to 0.0.0.0/0
and the ipv4 connection method from md5 (password) to trust
if you don't want to use a password to connect to the database. In the meantime, we also need to configure the postgresql.conf
file to open the listen address to all ip addresses:
listen_addresses = '*'
.
And we need to start a postgresql service
sudo service postgresql start
And any time we modify the connection information, we need to reload the postgresql service for the modification to be recognized by the service:
sudo service postgresql reload
Install airflow and configure it
Before installing, we can set up the default airflow home address:
export AIRFLOW_HOME=~/airflow
Next step is to install the system and python packages for airflow. A general tip: if an error message arises during the installation, pay attention to which package failed the process, and try to install the dependency for that package, and try again.
First install the following dependencies:
sudo apt-get install libmysqlclient-dev
(dependency for airflow[mysql] package)sudo apt-get install libssl-dev
(dependency for airflow[cryptograph] package)sudo apt-get install libkrb5-dev
(dependency for airflow[kerbero] package)sudo apt-get install libsasl2-dev
(dependency for airflow[hive] package):
After installing these dependencies, we can install airflow and its packages. (You can modify these packages depending on need. Celery and RabbitMQ are needed to use the Web-based GUI)
sudo pip install airflow[async,devel,celery,crypto,druid,gcp_api,jdbc,hdfs,hive,kerberos,ldap,password,postgres,qds,rabbitmq,s3,samba,slack]
Update 11/2018 (🙏 Andrew Stroup) Error while install airflow: By default one of Airflow’s dependencies installs a GPL re this error:
raise RuntimeError(“By default one of Airflow’s dependencies installs a GPL “RuntimeError: By default one of Airflow’s dependencies installs a GPL dependency (unidecode). To avoid this dependency set SLUGIFY_USES_TEXT_UNIDECODE=yes in your environment when you install or upgrade Airflow. To force installing the GPL version set AIRFLOW_GPL_UNIDECODE
After successfully installing airflow and packages, we start up Airflow’s database:
airflow initdb
….to set up the first-time configs. An airflow.cfg
file is generated in the airflow home directory. We should open it with a text editor, and change some configurations in the [core] section:
- for the executor, we should use CeleryExecutor instead of SequentialExecutor if we want to run the pipeline in the webUI:
executor = CeleryExecutor
- for the backend DB connection, we should pass along the connection info of the postgresql database
airflow
we just created:
sql_alchemy_conn = postgresql+psycopg2://ubuntu@localhost:5432/airflow
If you don’t want the example dags to show up in the webUI, you can set the load_examples
variable to False
. Save and quit.
And to prepare for the next steps, we also need to set up the broker_url
and celery_result_backend
in the [celery] section:
broker_url = amqp://guest:guest@localhost:5672//
celery_result_backend = amqp://guest:guest@localhost:5672//
(can use the same one as broker_url
)
For the configuration file to be loaded, we need to reset the database:
airflow initdb
If the previous steps were followed correctly, we can now call the airflow webserver, and access the webUI:
airflow webserver
To access the webserver, configure the security group of your EC2 instance and make sure the port 8080 (default airflow webUI port) is open to your computer. Open a web browser, copy and paste your EC2 instance ipv4 address, followed by :8080
, and the webUI should pop up. However, we are still half way through. Close the browser and the airflow webserver.
Rabbitmq and Celery
Rabbitmq is the core component supporting airflow on distributed computing systems. To install the rabbitmq, run the following command:
sudo apt-get install rabbitmq-server
And change the configuration file /etc/rabbitmq/rabbitmq-env.conf
:
NODE_IP_ADDRESS=0.0.0.0
And start a rabbitmq service:
sudo service rabbitmq-server start
Celery is the python api for rabbitmq. However, by the time this installation was carried out, airflow 1.8 has compatibility issues with celery 4.0.2 due to the librabbitmq library. So make sure to install celery version 3 instead.
sudo pip install 'celery>=3.1.17,<4.0'
If you accidentally installed celery 4.0.2, you need to uninstall it before installing the lower version:
sudo pip uninstall celery
Otherwise, there will be a confusing error message when you call the airflow worker
: Received and deleted unknown message. Wrong destination?!?
The airflow webUI
Now airflow and the webUI is ready to shine. Let’s see how to put dags (directed acyclic graphs: task workflow) in it and run them. We need to create a dags
file in the airflow home directory: mkdir dags
. Write some test dags and put them in the dags
directory. Reload the dags:
airflow webserver
airflow scheduler
airflow worker
For the airflow webUI to work, we need to start a webserver and click the run button for a dag. Under the hood, the run button will trigger the scheduler
to distribute the dag in a task queue (rabbitmq) and assign workers
to carry out the task. So we need to have all the three airflow components (webserver
, scheduler
and worker
) running. Since we installed the scheduler
and the worker
on the same EC2 instance, we had memory limitations and were not able to run all three components at once, we opened up the airflow webserver
and airflow scheduler
first, clicked the run button for the test dag, closed the airflow webserver
and opened the airflow worker
. The scheduler assigned the tasks in the queue to the workers, and the workers carried out the tasks. The scheduler and the workers recorded their activities in their respective logs in the airflow home directory. After the workers finished the task, we terminated the workers, and reopened the webserver. And the test dag in the webUI became marked successful.
A word about a few useful optional arguments:
-D
: this argument can make the process run as a Daemon in the background-c
: this argument controls the maximum number of workers that can be triggered byairflow worker
. The default number can also be set in theairflow.cfg
file. It is handy when there is working memory limitation on the server.
Application Configuration
Disable automatic catchup
By default, Airflow will try to backfill jobs that it may have missed. In our use case, this is not the desired behavior. To change this, open the airflow.cfg
file, find the catchup_by_default
variable and set its value to False
.
Safety and Environment Variables Configuration
The following sections describe how to perform important steps for configuring the Airflow installation.
Enable Authentication
It is strongly recommended that you enable web authentication on your Airflow server. This is easily done in the following two steps.
- In the
airflow.cfg
file, find the[webserver]
section and add the lines below.
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth
Note: The file contains a line with
authenticate
set toFalse
. Be sure to remove that line.
- Navigate to the airflow directory and open a Python interpreter.
$ cd ~/airflow
$ python
Run the commands shown below to create a user.
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'new_user_name'
user.email = 'new_user_email@example.com'
user.password = 'set_the_password'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
Now when you access the server in your browser, you will first have to authenticate on a login page.
Add Postgres Connections
Before you add any of your connections, it is strongly recommended that you enable encryption so that your database passwords and API keys are not stored in plain text.
Enabling Encryption
Install required libraries and packages
sudo apt-get install gcc libffi-dev python-devel libssl-dev
sudo pip install cryptography
sudo pip install airflow[crypto]
Generate an encryption key
FERNET_KEY=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print FERNET_KEY")
echo $FERNET_KEY
Edit the ~/airflow/airflow.cfg
file, replacing the placeholder value for the fernet_key
with your key.
# Secret key to save connection passwords in the db
fernet_key = <YOUR_KEY_HERE>
Restart the Airflow server
cat $AIRFLOW_HOME/airflow-webserver.pid | xargs kill -9
airflow webserver -p 8080 &
Set up a Postgres connection
Note: Before beginning, make sure to add the
airflow
Security Group on AWS to one of the Security Groups authorized to access the RDS instance you will be connecting to.
Adding a Postgres connection is easy. In the Web UI, click Admin -> Connections -> Create. Enter the settings for your connection as shown by the example below.
Conn Id: some-db
Conn Type: Postgres
Host: something-something.us-region-2.rds.amazonaws.com
Schema: something
Login: user
Password: ********
Port: 5432
Click Save.
That’s it. You should now see your database connection appear under Data Profiling -> Ad Hoc Query.
Set Airflow Variables and Environment Variables
Airflow Variables
In the Web UI, click Admin -> Variables. Create the following keys and add their corresponding values. The AWS user represented by the key will need read/write access to the bucket specified.
bucket_name
aws_access_key_id
aws_secret_access_key
Environment Variables
Make a directory within the home directory for parsing.
mkdir ~/parse
Save this as an environment variable by adding the following line to your .bashrc
file:
export PARSE_DIR=~/parse
Run source ~/.bashrc
to reload the shell.