Local testing with Google Cloud Composer — Apache Airflow
Introduction & motivation
This guide will cover how to install Airflow locally for the purposes of testing Airflow code with a focus on users of Google Cloud Platform. To give a little context, I’d like to briefly describe Apache Airflow. From the Airflow project home page:
“ Airflow is a platform to programmatically author, schedule and monitor workflows.” (Link)
These workflows can contain almost anything you could imagine: bash scripts tasks, Hive tasks, MySQL tasks, AWS S3 tasks, Google ML engine tasks, Slack tasks, and many others. Airflow pipelines are set up as DAGs (Directed Acyclic Graphs) which contain tasks designed to chain to other upstream or downstream tasks. Many companies such as Robinhood, Lyft, and others use Airflow to orchestrate their data pipelines. I suggest reading either article for more in-depth discussion about Airflow design.
Beginning in March 2018, Google Cloud Platform began offering Apache Airflow under the name Cloud Composer. The nice value add is that you get a one-click deployment of Airflow on Kubernetes and no-fuss integration to your cloud project resources via service account. For testing, Google recommends having separate testing and production environments, but what if you wanted to develop and test locally? This guide shares how I set up my local Airflow development environment.
Note: As of this guide, Cloud Composer is on v1.9 (Release Notes). This is important since Airflow still has some breaking changes between releases, especially with a few of the GCP operators. Testing with a different version may not yield similar results. Therefore I will be installing Airflow 1.9.0 and Python 3 for this guide.
Setting up your local environment
Before we continue, Windows users will need to install a Linux OS of choice. Unfortunately, as of this writing, Airflow does not work well (if at all) on Windows. My personal preference is Ubuntu Desktop (so I could use Visual Studio Code) + Oracle VirtualBox. A vanilla install is fine, no special configurations needed.
Once in Linux, let’s make sure we have PIP and Venv installed for virtual environments.
sudo apt-get install python3-pip
sudo apt-get install python3-venv
Next, we’ll create an “airflow” virtual environment and activate it so we don’t create issues with the default Python installation.
python3 -m venv envs/airflow
Make sure your terminal shows (airflow) is activated. We’ll have to upgrade PIP and install wheel to prevent installation errors.
pip install --upgrade pip
pip install wheel
When we try to install Airflow and Python 3, we’ll have some problems with the GCP extensions. Apache Beam presently prevents installation on Python 3. So we’ll force an older installation of Beam (original solution).
pip install google-cloud-dataflow==2.2.0
Now we can install Apache Airflow v1.9.0 to match Google Cloud Composer.
pip install apache-airflow[gcp_api]==1.9.0
The installation works, but when you use Airflow + GCP operators, you’ll see complaints about missing Pandas module pandas.tools which was deprecated. Let’s just install the minimum version of Pandas that Airflow 1.9.0 needs.
pip install pandas==0.17.1
Ok now we’re getting close. We need to initialize the Airflow metadata database and start the local webserver as we will make two configuration changes to simplify our use of GCP operators.
airflow webserver -p 8000
Navigate to http://localhost:8000 > Admin > Connections. There are two connections bigquery_default and google_cloud_default. We’ll switch both of these to Conn Type: “Google Cloud Platform” and type in a project id.
The last step we need to do is obtain the credentials to connect to GCP. You could generate the key.json file yourself from the Cloud Console or use the Google Cloud SDK (recommended for your development environment). For the latter, follow this quick installation guide https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu.
Once complete, simply run:
gcloud auth application-default login
This will create your json credentials file, placing it in a default location where Airflow will automatically look for it.
Ready to test
I won’t be covering how to write a DAG in this tutorial. For more on that, you may want to check out an example like https://medium.com/google-cloud/airflow-for-google-cloud-part-1-d7da9a048aa4
Once you write your first DAG, you can save it to ~/airflow/dags. Python files containing DAGs placed in this folder are automatically recognized by Airflow. This allows you to locally test your DAG before moving it to Cloud Composer. For example, if we have a DAG named my_dag and a task named my_task, testing is as simple as
airflow testing my_dag my_task 2019-02-13
At this point you can simply move your DAG(s) to your Cloud Composer environment using the Google Cloud SDK
gsutil — m cp ~/airflow/dags/*.py gs://your-composer-bucket/dags/
Using this method you can quickly write and test Cloud Composer / Airflow pipelines without running two separate Composer environments.