Apache Airflow

Installing Apache Airflow with Pip

A Comprehensive Guide and Analysis of Pros and Cons

Eric Flynn
5 min readApr 25, 2023
Photo by Abby Anaday on Unsplash

Introduction

Apache Airflow has become one of the most popular workflow orchestration tools available. It allows for different programming tasks to be authored, scheduled, and then monitored. This article will assume the reader already has a working understanding of Apache Airflow and will skip the introduction for the purpose of conciseness. For a more detailed introduction, check out my article here.

This article is part one of a series dedicated to investigating each method available to install and run an Apache Airflow environment. These include pip, Docker, Kubernetes, and third-party managed services. Every method has its own pros and cons which will be weighed in each corresponding article.

This article will show the reader how to install and configure an Apache Airflow environment using pip/Python. Then the pros and cons of using this environment setup will be discussed.

Preparation

For this article, each of the deployment environments will be installed on a t2.micro Amazon Linux 2023. The only change from defaults is a modification in the security group to allow inbound traffic on port 8080. This will the Airflow UI to be accessible over the internet.

See below for the list of packages that will be utilized in this article.

python = 3.9.16
pip = 32.1
tree = 1.8.0
tmux = 3.2a

Configuring the Environment

Below, I will first show how I went about installing all the necessary packages needed to configure the Airflow environment. Then I will configure the environment and investigate the Airflow UI.

Setup

  1. Install & Upgrade Necessary Packages
~$ sudo yum install python -y

~$ sudo yum install pip -y

~$ sudo yum install tmux -y

~$ pip install --upgrade pip

2. Create Directories & Activate Virtual Environment

~$ mkdir airflow && cd airflow-pip

[airflow-pip]$ python3 -m venv venv

[airflow-pip]$ source venv/bin/activate

(venv) [airflow-pip]$

3. Install Airflow, Export Home Directory, and Initialize Airflow Db

(venv) [airflow-pip]$ pip install apache-airflow

(venv) [airflow-pip]$ pwd
/home/ec2-user/airflow-pip

(venv) [airflow-pip]$ export AIRFLOW_HOME=/home/ec2-user/airflow-pip

(venv) [airflow-pip]$ airflow db init

I will ignore any error messages that occurred after the db init command. The airflow-pip directory should now look like below:

(venv) [airflow-pip]$ tree -L 2
.
├── airflow.cfg
├── airflow.db
├── logs
│ └── scheduler
├── venv
│ ├── bin
│ ├── include
│ ├── lib
│ ├── lib64 -> lib
│ └── pyvenv.cfg
└── webserver_config.py

4. Create Admin Role

(venv) [airflow]$ airflow users create \
--role Admin \
--username admin \
--email admin \
--firstname admin \
--lastname admin \
--password admin

(venv) [airflow]$ airflow users list

id | username | email | first_name | last_name | roles
===+==========+=======+============+===========+======
1 | admin | admin | admin | admin | Admin

5. Run the Scheduler & Server in Tmux Sessions

  • Initiate a new tmux session with tmux new -s scheduler
  • Start scheduler with airflow scheduler command
  • Detach with cntrl+b, then d
  • Repeat the same steps but for webserver instead of scheduler

Access the User Interface

Now that the scheduler, and webserver are both running in their own dedicated terminals, the Airflow environment setup is complete. Now its time to access the user interface to confirm everything is running properly.

7. Locate Public IP & Navigate

The public IP of an EC2 can be found in the “details” tab under instance settings in AWS. I will copy/paste this into my browser followed by port 8080. (<publicIP>:8080)

8.Login with Admin Credentials

Username: admin | Password: admin

9. Investigate the UI

Pros & Cons

Pros

The main benefit around installing Airflow with pip is simplicity. The administrator only needs to possess a working knowledge of pip and how to run Python applications and they can be up and running with a development environment in just 5 minutes.

It is also very easy to install additional provider packages into an environment of this type. Since everything is running on the host machine, provider packages can simply be pip installed directly into the virtual environment being utilized. This process is much more complicated for containerized environments as will be discussed in the following parts of this series.

Cons

Right away, you can see that there are multiple error messages being displayed urging me “Do not use **** in production”. These are indicative of the main issue around installing Airflow on pip, which is that it doesn’t come production-ready. Below I will dive a little deeper to explain why.

The first warning suggest using a new database for the Airflow metadata db. The pip installation method comes with SQLite by default. Data stored in SQLite is stored in a local file with no true database server. SQLite also doesn’t support concurrency, and cannot scale. It is pretty clear why SQLite is only suited for a development environment.

It is important to note that you can configure this environment to utilize a more production-ready database for its backend such as PostgreSQL or MySql. This however, includes additional configuration steps which can be found here.

By default, the pip installation of Airflow comes with the SequentialExecuter which can only run one task at a time. This is the only executer that can be used with SQLite since it doesn’t support concurrency. In order to upgrade to advanced executer, the database must first be upgraded. Then, it is recommended to switch to the LocalExecuter, which requires additional configuration steps.

Conclusion

Utilizing pip to install and manage an Airflow environment is a great choice for individuals who are just learning Airflow. It is extremely simple to setup, and doesn’t have many moving parts to manage. This can be a great starting point to help a beginner understand the inner workings of Airflow. Also, since its very easy to install provider packages, this environment can serve as an ideal sandbox to test out features before merging them into a more advanced environment hosted on Docker or Kubernetes.

Where this method lacks is in its production-readiness. There are many additional configuration steps needed to bring it to a level where it would even be considered remotely production ready.

In the next article in this series I will introduce how to setup an Airflow environment via Docker Compose. This method is a bit more complex as it requires and understanding of containers, however it comes out of the box much more production-ready than the pip installation. Thanks for reading and don’t forget to follow so you don’t miss it! 😃

--

--

Eric Flynn

Analytics Engineer | Spreading knowledge and positivity.