Airflow-2 Development Environment on GCP Cloud Shell

Published in

Badal-io

13 min readSep 16, 2021

TLDR

We will show you how to set up an automated and feature-rich Airflow 2 development environment on GCP Cloud Shell Code Editor in 10 minutes.

Feel free to skip to the “Running Airflow Development Environment on GCP Cloud Shell” section if you are impatient to get started.

Introduction

It has been a while since Apache Airflow 2.0 was released, and many developers have already started exploring it. The very first question that bothers them is how to upgrade/install and configure a local Airflow-2 development environment that would just run and allow the data engineers to focus on development with no hustle and bustle. We are not the exception here at Badal.

There are a number of articles and resources elaborating on how to quickly get started with Airflow-2 including the official Quick Start Guide provided by the Airflow community. It may be pretty sufficient; however, we see two common problems our clients face:

Enterprises may restrict the software that needs to be installed on developers’ laptops. Quite often developers cannot install the required dependencies to run Airflow locally or are even unable to install Docker.
Airflow developers need a customized local environment where they can interactively develop, run and test their workflows

In this article, we are going to showcase how we run Apache Airflow-2 on the Google Cloud Platform (GCP) Cloud Shell service with Docker-Compose, and how the solution may address common challenges Airflow developers may encounter. Thus, our goal was to have a local Airflow-2 setup that meets the following criteria:

A lightweight and OS-agnostic development with minimum prerequisites to start up.
Easy to deploy and operate — no admin knowledge required.
Be able to customize the environment with variables and external connections.
Be able to execute DAG tasks locally.
Be able to perform code changes in an IDE with immediate effects in Airflow Scheduler and the Airflow user interface.
Be able to run unit and integration tests within the same environment.
Integration with CI/CD pipelines to stage and deploy workflows into production environments.

We will walk through the process of the quick initialization and installation of the Airflow-2 setup. The proposed development environment is available for local use using Docker containers. Though the solution can be run locally on personal computers (PC), we will be focusing more on running an Airflow dev environment with Docker Compose on GCP Cloud Shell.

The proposed Airflow setup has minimum requirements — just install the latest stable versions of Git, Docker, and Docker-Compose, and you are good to go. We came up with detailed guidelines (steps) for each operating system that can be found in the project repository:

https://github.com/badal-io/airflow2-local-ci-cd

The project can also be integrated into an automated continuous integration/continuous delivery (CI/CD) process using GCP CloudBuild to test and deploy workflows into GCP Cloud Composer; however, this stuff will be discussed in Part II.

Proposed Development Environment

The proposed Airflow setup is based on the official docker image and docker-compose code that is provided on the Airflow documentation portal. We have customized it in a way that allows us to have a simplified local development environment with all the custom tools and integrations that make the development, testing, and deployment smooth and productive.

We are not going to discuss here what Airflow is and what its components are, as there are plenty of resources on the WEB explaining its architecture and key concepts. What is important to highlight is what Airflow components we are going to run on Docker:

The code deploys three Docker containers:

Airflow Scheduler
Airflow Web Server + Worker
Postgres DB

Software versions. We can always change the version in the deployment code. Below are the current versions that the solution has been tested on:

Airflow: 2.1.1
Python: 3.8
Postgres: 13

Benefits of running an Airflow dev environment on Docker, in our view:

Your workspace files are always synchronized with docker containers. With the use of an IDE program, the development process becomes easier and faster.
Unit and Integration tests are run inside a container that is built from the same image as Airflow 2 instances.
A local PC environment is not affected by any dependencies or packages installed for Airflow.

The project is assembled and hosted at: https://github.com/badal-io/airflow2-local-ci-cd. We tried to keep the repository neat and clean (at least in the root directory) with an intuitive naming convention and folder structure:

    .
    ├── ci-cd                     # CI/CD deployment configuration
    ├── dags                      # Airflow DAGs 
    ├── data                      # Airflow DATA 
    ├── docker                    # Docker configuration
    ├── gcp-cloud-shell           # Cloud Shell custom csripts
    ├── helpers                   # Backend scripts
    ├── logs                      # Airflow logs 
    ├── plugins                   # Airflow plugins
    ├── tests                     # Tests
    ├── variables                 # Varibales for environments
    ├── .gitignore                # Git's ignore process
    ├── pre-commit-config.yaml    # Pre-commit hooks
    ├── LICENSE                   # Project license
    ├── README.md                 # Readme guidlines
    └── docker-compose.yaml       # Docker Compose deployemnt code

There are standard Airflow folders such as dags, plugins, data , logs — these are the ones the developers mostly work with.
The docker-compose.yaml file contains infrastructure code that deploys all Airflow containers and mounts necessary volumes from a local PC.
Since we use Docker-Compose, the Dockerfiles for local and CI/CD deployments are inside the docker folder.
All environment variables for the project are placed in the configuration files in the variables folder.
All kinds of tests are placed in the tests folder.
The helpers folder contains Python and Bash scripts that support automation and provisioning for the project behind the scenes.
GCP Cloud Shell customization file is placed in a separate folder gcp-cloud-shell.
The ci-cd folder contains code and configuration for a CI/CD pipeline that runs on the GCP Cloud Build service.
And finally, docker-compose.yaml file contains code that provisions all the Airflow services for us.

GCP Cloud Shell

Let’s discuss in more detail this alternative option that can be interesting for users who, for any reason, cannot have the above-mentioned Docker tools installed on their PCs.

The project has been successfully tested within the GCP Cloud Shell/Editor service. GCP Cloud Shell — is an ephemeral cloud virtual instance (virtual machine) accessible anywhere from a web browser. We can manage our resources with its online terminal preloaded, with utilities such as the gcloud command-line tool, kubectl, terraform, docker, and more. We can also develop, build, debug, and deploy cloud-based apps using the online Cloud Shell Editor. More details about the service can be found here.

GCP Cloud Shell Editor, source: *https://cloud.google.com/blog/products/application-development/introducing-cloud-shell-editor*

We found the following benefits when working with GCP Cloud Shell:

Development-ready environment, everything that we need is preinstalled.
Persistent disk storage and a $HOME directory that preserves our work.
Cloud IDE with version control and terminal — fully functional development environment.
Web preview that creates a URL to point to a web server on our local endpoint (any port from 2000 to 65000).
It is Free!

Having said that, consider the following limitations and caveats before working with Cloud Shell/Editor as well:

The virtual machine instance that backs your Cloud Shell session is not permanently allocated to a Cloud Shell session and terminates if the session is inactive for 20 minutes.
Once the session is terminated, any modifications that you made to it outside your $HOME directory are lost. So you have to re-run the Airflow initialization steps.
Cloud Shell sessions are capped at 12 hours, after which sessions are automatically terminated. You can use a new session immediately after.

Running Airflow Development Environment on GCP Cloud Shell

Now, let’s get started with our development environment. In the first place, we need to make sure that the GCP Cloud Shell environment is customized, so no surprises during the initialization or installation.

Step 1: Firstly, we need to access GCP Cloud Shell from a browser using GCP credentials: https://ide.cloud.google.com. It might take a couple of minutes until it initializes.

Step 2: Open a terminal session (Menu Terminal — New Terminal) and clone the repo using a GIT command below, where is <repository> is a forked repo that you are expected to create from the project.

git clone <repository>

Step 3: Now we can open the cloned folder with the Cloud Shell IDE. In the Cloud Shell user interface (UI), click on the Open Folder option and select the airflow2-local folder.

You should see the folder on the left side of the GCP Cloud IDE.

Step 4: In a Cloud Shell terminal window navigate to the project directory and run cloud-shell-init.sh. This will customize the environment and install some prerequisites (mind the correct path to the file). Before executing the script, we need to make sure that we are inside the right director, and also the file is actually executable:

cd airflow2-local-ci-cdsudo chmod +x ./helpers/scripts/cloud-shell-init.shsudo ./helpers/scripts/cloud-shell-init.sh

The successful output should look like this:

root@cloudshell:$ sudo ./helpers/scripts/cloud-shell-init.sh###### - Cloud Shell has been successfully customized, you can proceed with Airflow deployment!

From now and onwards, the Cloud Shell environment is ready for the Airflow-2 setup.

Customizing Airflow Environment Settings

Before we initialize and start Airflow services we also need to perform some customizations to our future Airflow deployment, in particular:

Add Python dependencies to requirements-airflow.txt.The Python requirements will be installed during the image build process. For example:

docker/requirements-airflow.txt

We can add Airflow variables by listing them in docker-airflow-vars.json. The variables will be imported/updated into the Airflow database backend on every start-up.

variables/docker-airflow-vars.json

If we need environment variables in a container’s OS, we can add them into docker-env-vars. In fact, we can add the Airflow variables here as well, this is an alternative way to the previous method. The only difference is that the variables must have a specific prefix — AIRFLOW_VAR in their names, for example:

 AIRFLOW_VAR_{VARIABLE_NAME}

It is also possible to add Airflow configuration-related items as environment variables, in this case, the prefix must be AIRFLOW__<config item>. For instance:

AIRFLOW__CORE__SQL_ALCHEMY_CONN=`my_conn_string`

An example of environment variables is below:

variables/docker-env-vars

Bear in mind, Airflow variables/config items that are added by this method will not be reflected in the Airflow database, but still, be accessible. You can read more about this approach here.

Sometimes we need variables that contain sensitive information, such as API keys or passwords. Obviously, this type of content cannot be exposed in a public repo. For such cases, there is the docker-env-secrets file. All variables containing secrets can be listed there in the same fashion that was described above. The file is added to the gitignore process, so no public exposure, the secrets will always stay in our local environment.

Another important step, if we are planning to work with GCP services we must add our GCP project id as an environment variable. Set the project-id variable in the docker-env-vars file or docker-env-secrets file:

GCP_PROJECT_ID=’awesome-project-1’

Last, but not least, if there is a custom Airflow configuration file ready, we can uncomment the line in our Dockerfile in order to include it in the image. The configuration file will be added to the Airflow image upon the next docker-compose rebuild.

Initializing and Starting Airflow-2

So, we have fulfilled all the requirements and are ready to move forward with the actual Airflow setup. If we deploy it for the first time, we will need to run an initialization script that builds the necessary Docker images and prepare a database backend for Airflow, in the first place.

We are going to walk through the steps of initialization and starting the Airflow-2 deployment without diving too much into technical details, however, you can always look at the project code at the GitHub repository https://github.com/badal-io/airflow2-local-ci-cd where we tried to elaborately comment on each step as much as possible.

Step 1: Initialize and start the Airflow-2 deployment by executing the following script. It may take around 2–3 minutes until the setup is initialized and all services are started.

sudo ./helpers/scripts/init_airflow.sh

Step 2: To verify the setup open a new terminal window and run the following command to make sure that all 3 containers (webserver, scheduler, postgres_db) are running and healthy:

docker ps

The correct output should look like this (output truncated):

root@cloudshell:~/$ docker psCONTAINER ID   IMAGE                                                     
cd04737c0db2   airflow-webserver   
a5d89d49ab97   airflow-scheduler   
8bf7667cd245   postgres:13

Step 3: No Step 3! Congratulations, Airflow-2 is Up and Running!

Useful commands for solution operations and maintenance you can find in README in the project repository.

Accessing Web Interface

Now, we can access the Airflow web interface via Web Preview in GCP Cloud IDE.

On the right top corner there is an icon:

Click on it and choose Preview on port 8080. We can also change the port if we need something different from 8080.

A new tab will be opened with an Airflow web user interface, and you can go ahead and provide the default credentials: username airflow, password: airflow.

Deploying DAGs

We can now deploy DAGs into the dags folder, and we are going to do that using the Cloud Shell IDE.

On the top right corner click on More (three dots) and select Upload. Select a destination folder — DAGs, then choose your files from the local PC and upload them.

Now, we can verify our DAGs on the web UI. As seen in the screenshot below, the two DAGs have been deployed successfully.

Let’s run our DAG. We need to switch it from the “Paused” state and run it. In a moment it should be green, which means it is processing the workflow.

We can also see what is going on in the Airflow log(terminal window):

Now, let’s deliberately break our DAG by making some changes (commenting) in the DAG’s code:

If we apply any changes to a DAG, it should be immediately reflected in the web UI. In the following example we can see how it is reflected on the Web UI:

As we can see, we can make changes “on the go” and those will be applied immediately. We found this quite useful during the development process.

Running DAG tests

DAGs are specified as Python code, which is one of Airflow’s key principles. Since they can be treated like any other piece of code, we can plug our DAGs into typical software development lifecycles using source control and testing. There are several test runners available for Python, however, we will be using pytest in our examples. We have placed unit and integration tests into a specific directory — tests, that has the unit and integration folders in it for unit and integration tests accordingly.

To run a test we need to call a script — airflow (tests/airflow). What it does is it just creates a new ephemeral Airflow container, runs the command (a test command in our case), prints output, and finally, terminates the container. That is all we need!

Unit Test

To run unit tests navigate to the tests directory and run the following command (quotes is mandatory):

./airflow “pytest <test directory>”For example:./airflow “pytest tests/unit”

The outcome could be as below. The unit test for our DAGs passed successfully.

root@cloudshell:/root$ ./airflow "pytest tests/unit"
Creating airflow2-local-ci-cd_airflow-ops_run ... done==================== test session starts =======================
platform linux -- Python 3.8.10, pytest-5.3.5, py-1.10.0, pluggy-0.13.1
rootdir: /opt/airflow
plugins: celery-4.4.7, anyio-3.2.0, airflow-0.0.3, testconfig-0.2.0
collected 5 itemstests/unit/test_dag_validation.py .....                                                                                                  [100%]====================== 5 passed in 2.64s ======================root@cloudshell:/root$

Integration Test

To run integration tests navigate to the tests directory and run the following command (quotes is mandatory):

./airflow “pytest <test directory>”For example:./airflow “pytest — tc-file config.ini -v tests/integration”

The outcome could be as below. The integration test for our DAGs passed successfully.

root@cloudshell:/root$ ./airflow "pytest --tc-file config.ini -v tests/integration
"Creating airflow2-local-ci-cd_airflow-ops_run ... done

==================== test session starts ==================
platform linux -- Python 3.8.10, pytest-5.3.5, py-1.10.0, pluggy-0.13.1 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /opt/airflow
plugins: celery-4.4.7, anyio-3.2.0, airflow-0.0.3, testconfig-0.2.0
collected 2 itemstests/integration/test_bg_operator.py::test_create_table PASSED                                                                          [ 50%]
tests/integration/test_bg_operator.py::test_delete_table PASSED                                                                          [100%]===================== 2 passed in 2.63s =====================root@cloudshell:/root$

Troubleshooting

Try the instructions below if you’re having issues with the environment. You can check whether your problem has been solved after each step. Also, you can have a look at the operations and maintenance commands available here.

Restart Airflow containers by using docker-compose down/up commands.

docker-compose downdocker-compose up

2. Restart GCP Cloud Shell and try again.

3. Clean up all volumes and reinitialize Airflow again (from scratch setup).

docker-compose down --volumes --rmi allsudo ./helpers/scripts/init_airflow.sh

4. Remove all containers, networks, images with the docker system prune command. Delete the Airflow directory and clone it again. Re-install Airflow (from scratch setup).

docker system prune -a --volumessudo ./helpers/scripts/init_airflow.sh

Caveats

The setup was tested in the development environment and is not meant to be used in production.
GCP Cloud shell session ends upon 20 minutes of the web interface inactivity. You have to reinitialize the whole Airflow setup again if the session has ended.

Summary

We believe that an Airflow development environment should be easy to deploy and operable on any operating system or platform. That will encourage developers to focus more on development rather than system installation and configuration. We have showcased a solution that allows developers to quickly deploy a local Airflow setup that has everything that they need. We have also demonstrated a viable option, the Cloud Shell offering from GCP, that has all the tools and prerequisites pre-installed, and allows you to run a local Airflow dev environment in the cloud via a web browser.

If you have any questions or thoughts on how it could be improved, please leave a comment below! Let us know if this was helpful for you.

The project is publicly available here:

https://github.com/badal-io/airflow2-local-ci-cd

What is next? In Part II we are going to discuss the CI/CD workflow of this project, and how it integrates to CI/CD pipelines for GCP Cloud Composer.