Apache Airflow — Part 2

Setting up Airflow using an AWS EC2 Instance

Wairagu Wilberforce
Data Pulse for AI
6 min readFeb 21, 2023

--

For the second part of this series, we are going to look at one of the simplest way to set up airflow to get you up to speed in building scalable data pipelines. The overall objective of this project will be to pull data from the World bank API. Before getting into the nitty gritties of the project, lets first look at setting up Apache Airflow.

some of the most common methods of setting up airflow include:

  1. Using Python’s official third party software repository PyPi: the most effective way to install airflow using PyPi is by running the following command from your preferred local terminal:
pip install apache-airflow

once installed, you can run the airflow server using the following code snippet

airflow webserver -p 8080
airflow scheduler

2. Using Docker: You can also use Docker to set up Apache Airflow. This method involves creating a Docker image and running it as a container. To do this, you need to have Docker installed on your system. You can use the following command to create an Airflow Docker image:

docker build -t my-airflow .

Once the image is created, you can start Airflow by running the following command:

docker run -d -p 8080:8080 my-airflow webserver
docker run -d my-airflow scheduler

3. Using Apache Airflow Helm Chart: Helm is a package manager for Kubernetes that allows you to install and manage applications on Kubernetes clusters. This chart will bootstrap an airflow deployment on a Kubernetes cluster using the helm package manager. To install Airflow using Helm, you need to have a Kubernetes cluster and Helm installed. You can install the Airflow Helm chart by running the following command:

helm install apache-airflow apache-airflow/airflow

4. Using cloud providers such as AWS: AWS provides a variety of options for deploying Airflow that can be categorized under IaaS, PaaS, as well as SaaS. The three main options of deploying airflow on AWS include:

i. Using AWS EKS: Stands for Elastic Kubernetes service, .

ii. Using an EC2 instance: stands for Elastic Compute Cloud

iii. Using AWS MWAA service: stands for AWS Managed Workflows for Apache Airflow service

In this article, we’ll look extensively on option 4 (ii) where we’ll set up an Ubuntu instance and install Airflow on it. We’ll also look at some of the perks that comes with using this approach as well as some of its limitations.

Setting up an AWS EC2 Instance

For the readers new to the AWS cloud, its a comprehensive cloud computing platform that includes infrastructure as a service (Iaas) and platform as a service (PaaS) offerings. In simple terms is that we can access computing services over the internet from anywhere. this services majorly include, computing, storage, databases, analytics, Application integration and many more. Here’s a detailed documentation on how to create your AWS account and get you started with the cloud

The AWS service majorly responsible for offering compute resources is AWS EC2 which stands for Elastic Compute Cloud. Its a web based service that provides resizable compute capacity. literally, you get to build and host your applications on servers at Amazon’s data center. Let’s get started with spinning up our ubuntu instance that we’ll use to install and run airflow.

  1. from you AWS Management console, search for EC2 and once on it click on the launch instance button as shown:

2. Next give your instance a name, for this case we’ll call it “airflow_instance”. Then select a 64-bit Ubuntu AMI Image. to learn more on AMIs visit this link.

3. For the instance size, select t2.medium from the drop down list which offers 2CPU cores and 4 GB RAM which is sufficient to run Airflow. This type is past the free tier limits and might charge USD 0.15 per hour.

4. Additionally, you should create a Key Pair that would aid you in accessing your instance. Once created, this key pair file will automatically get downloaded to your machine with a .pem extension. make sure to save it in your working directory for you will need it later on while connecting to your instance via SSH.

5. Under the networking section, create a new security group and be sure to check all three boxes to allow both SSH and HTTPs access. This will permit access to your instance from the specified IP range that you’ll choose on the adjacent drop down list. for starters its best you leave it at Anywhere IPv4. Here is more on security groups.

6. Finally, you can click on the Launch instance button and your EC2 instance will be created in a minute or two.

7. Once created, you can click on your instance and a brief summary of it will show up. Click on the connect button and navigate to the SSH client tab, copy paste the ssh command shown and open your preferred terminal. I'm going to use git bash to connect to our instance.

NB: make sure you run the command from the same working directory where your key pair file exists, otherwise an error will appear:

Installing and starting the airflow server

Before starting the installation process, we first need to update as well as install python’s pip library to our ubuntu filesystem using the following commands

sudo apt-get update
sudo apt install python3-pip

Once this is in order, next step involves installing airflow, we first check the python version we are currently running using the following command:

python --version

Next is for us to set the environmental variable for our constraint url that we will use to install airflow based on the python version we are running:

Execute the following command replacing where we have the first {} with the airflow version you want to install and the second {} with the python version you are currently running. In my case i’m installing airflow 2.5.1 an d im running python 3.10

CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-{airflow_version}/constraints-{python_version}.txt"

Once that is done, we can now install airflow using the following command;

pip install apache-airflow==2.5.1 --constraint "${CONSTRAINT_URL}"

Finally once airflow is installed we can spin up a default standalone user using the following commands

airflow db init 

airflow standalone

The above command sets up all airflow components and creates a default user credentials for you to Log in. By default, the server runs at a dedicated port 8080 that you can access from any web browser.

to access your airflow UI from your browser, head to your instance details page from the management console. copy paste the Public IPv4 address to a new tab of your browser and add “:8080” at the end of that URL. Enter the log in credentials given from your terminal and the following page appears with demo DAGs to get you started.

NB: Always remember to shutdown all running instances to minimize on costs when not in use.

Limitations of using this approach

For production level workflows, data pipelines need to be triggered either on demand or on a schedule. For scheduled based workflows can vary from minutes, hours to daily based triggers. For instance, if one’s project needs demands a daily based trigger, this means that your instance will need to run through out which is a very bad idea.

This method is therefore suitable for Proof of concept applications to check if your Airflow DAGs runs successfully and is not to be used for production level workflows.

--

--