Airflow on AWS EC2 instance with Ubuntu

Abraham Pabbathi
6 min readJan 27, 2020

--

Airflow is fast becoming the de-facto orchestration tool for many data engineering organizations. I for one am new to Airflow and wanted to setup an Airflow server and play around with it to understand what the buzz is all about. But little did I know how complicated the installation would be. After reading through multiple medium posts and stack-overflow responses I was finally able to setup a working Airflow server. Given that none of the above mentioned resources completely captured what I really I had to do to get the server up and running I wanted to write this post to make it easy for folks who are in my boat.

Without further ado let me jump right into it. The use case I was pursuing was to setup an Airflow server on an AWS EC2 instance running Ubuntu 18.04 OS and use the Airflow server to trigger Databricks Jobs.

Step 1: Stand up the EC2 Instance

Login to your AWS Account and navigate to Services>EC2. Launch and Instance and select Ubuntu AMI. In the next screen you will be prompted to select an instance size. You want to go with a fairly big instance as the t2.micro instance doesn’t cut it. I chose t2.medium and that seems to work just fine for small workloads. If you are installing a production version you may want to go even bigger like m5a.xlarge. Next ensure Auto-assign Public IP is set to Enable. Select the defaults for the next few screens. When it asks for setting up security group, add the rule to open port 8080 to public as that’s the port through which you can connect to airflow server.

Step 2: Install Postgres Server on the EC2 Instance

By default airflow uses sqlite to store metadata but if you want a fairly robust installation you want to use postgres database to store all the metadata.

Next you need to ssh into the server to do the installation and setup of postgresql database. Here are the steps to follow

sudo apt-get updatesudo apt-get install python-psycopg2sudo apt-get install postgresql postgresql-contrib

Also create an os user called ‘airflow’ to do the rest of the installation

sudo adduser airflow
sudo usermod -aG sudo airflow
su - airflow

From here on please make sure you are logged in as airflow user. Going back to ubuntu user will mess up the installation.

Once the postgres database is installed it’s time to create database and users to access the database.

sudo -u postgres psql

Execute the following commands to create the airflow database and user

Next we need to set a few config settings to ensure the airflow server can connect to the postgres database. Find the pg_hba.conf file and edit it.

sudo nano /etc/postgresql/10/main/pg_hba.conf

Find the below line

# IPv4 local connections:host    all             all             127.0.0.1/32         md5

and change it to

# IPv4 local connections:host    all             all             0.0.0.0/0            trust

Next open the following config file

sudo nano /etc/postgresql/10/main/postgresql.conf

Uncomment the following lines

#listen_addresses = ‘localhost’ # what IP address(es) to listen on

to

#listen_addresses = ‘*’ # what IP address(es) to listen on

Restart the postgresql server to get all the config changes

sudo service postgresql restart

Step 3: Install Airflow server

Next we will install airflow server

su - airflowsudo apt-get install python3-pipsudo python3 -m pip install apache-airflow[postgres,s3,aws,azure,gcp,slack]

You can test if the airflow server installation was successful by executing the following command

airflow initdb
airflow webserver

Next go to the following link <ec2 public ip address>:8080

You should see the airflow UI.

Step 4: Connect Airflow to Postgresql

Next we will work on connecting the airflow server to the Postgresql database as its metadata database.

Open the airflow.cfg in the airflow installation directory /home/airflow/airflow and look for the below line

sqlite:////home/airflow/airflow/airflow.db

change it to help airflow connect to your postgresql database

sql_alchemy_conn = postgresql+psycopg2://airflow:a1rfl0w@localhost:5432/airflow

Next make the following changes. You can see the difference between different executors at this link.

# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, KubernetesExecutor
executor = LocalExecutor

Finally you don’t want all the example dags to clutter your UI so set the load_examples to false.

load_examples = False             # Airflow samples

Step 5: Create DAGs

Next we will go to /home/airflow/airflow and create a directory called dags

Then navigate to this directory and create three DAGs the code for the dags is as follows

hello_world.py

from datetime import datetimefrom airflow import DAGfrom airflow.operators.dummy_operator import DummyOperatorfrom airflow.operators.python_operator import PythonOperatordef print_hello():return 'Hello world!'dag = DAG('hello_world', description='Simple tutorial DAG',schedule_interval='0 12 * * *',start_date=datetime(2017, 3, 20), catchup=False)dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)dummy_operator >> hello_operator

databricks_trigger_job.py: make sure you set the right job_id from databricks

import airflowfrom airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
args = {
'owner': 'airflow',
'email': ['example@example.com'],
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(1)
}
dag = DAG(
dag_id='databricks_trigger_job', default_args=args,
schedule_interval='None')
job_run = DatabricksRunNowOperator(
task_id='job_task',
dag=dag,
job_id=17160)
job_run

databricks_create_job.py: make sure you set the right notebook path

import airflowfrom airflow import DAGfrom airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperatorargs = {'owner': 'airflow','email': ['example@example.com'],'depends_on_past': False,'start_date': airflow.utils.dates.days_ago(1)}dag = DAG(dag_id='databricks_create_job', default_args=args,schedule_interval=None)abe_cluster = {'spark_version': '6.2.x-scala2.11','node_type_id': 'i3.xlarge','aws_attributes': {'availability': 'ON_DEMAND'},'num_workers': 2}notebook_task_params = {'airlfow_cluster': abe_cluster,'notebook_task': {'notebook_path': '/Users/xxxxxx/ETL Patterns/python/airflow-notebook',},}notebook_task = DatabricksSubmitRunOperator(task_id='notebook_task',dag=dag,json=notebook_task_params)notebook_task

Step 6: Setup Airflow Webserver and Scheduler to start automatically

We are almost there. The final thing we need to do is to ensure airflow starts up when your ec2 instance starts.

sudo nano /etc/systemd/system/airflow-webserver.service

Paste the following into the file created above

[Unit]Description=Airflow webserver daemonAfter=network.target postgresql.serviceWants=postgresql.service[Service]EnvironmentFile=/etc/environmentUser=airflowGroup=airflowType=simpleExecStart= /usr/local/bin/airflow webserverRestart=on-failureRestartSec=5sPrivateTmp=true[Install]WantedBy=multi-user.target

Next we will create the following file to enable scheduler service

sudo nano /etc/systemd/system/airflow-scheduler.service

Paste the following

[Unit]Description=Airflow scheduler daemonAfter=network.target postgresql.serviceWants=postgresql.service[Service]EnvironmentFile=/etc/environmentUser=airflowGroup=airflowType=simpleExecStart=/usr/local/bin/airflow schedulerRestart=alwaysRestartSec=5s[Install]WantedBy=multi-user.target

Next enable these services and check their status

sudo systemctl enable airflow-webserver.service
sudo systemctl enable airflow-scheduler.service
sudo systemctl start airflow-scheduler
sudo systemctl start airflow-webserver

Just restart the EC2 instance you should have the Airflow UI look like below

Finally to make sure your Databricks DAG is able to connect to your Databricks installation you need to setup your Databricks connection. To accomplish this go to Admin>Connections and set it up using a Databricks access token. The connection should look like below.

Conclusion

In this post we saw how to setup a development Airflow server and setup DAGs to trigger Databricks jobs. To setup up a production Airflow server you may need to look into setting up multiple worker nodes using celery. Hope this post was helpful. If it helped you or if you got stuck in at any point please post a comment and I will try to improve this post. Thanks for reading.

--

--