Airflow Series — Setting Up for GCP Projects

Jonatas Junior
Jul 6 · 9 min read
Image for post
Image for post
Photo by JJ Ying on Unsplash

This is the first story of a series of hands on training to teach a bit about Airflow and it intent to explain how to struct a project to have a local environment setup for dockers (using CeleryExecutor) and considering steps to have the service account from a GCP project prepared to be used in your dags.

We are using an existent docker image (see git project) built by myself containing Airflow 1.10.10 with Python 3.7 and setting up to be possible to use Role-based access control (RBAC) interface.

Have Fun!

Information and Prerequisites

Project structure

The training was design to explain the process of having the environment by project, assuming that all dag that you must want to design is related to the same project, it is easy to organize that way.

Let´s look for 01_setting_up example project:

Image for post
Image for post
project structure

Detailing the files

docker-composer.yml

The heart of our training, the composer file presents below is responsible for creating our containers (all the four + dependency container)

Database services

This part is responsible to up two database services:

  • Redis will be the database in memory to support message broker services (using standard port).
  • Postgres will be the database responsible to store Airflow metadata information, we are already setting the database name, user, and password (using standard port).
composer database services

Celery monitoring

Once we are opting to use CeleryExecutor to run our Airflow tasks, We will use Flower as the tool for monitoring and administrating the Celery cluster.

Observe that this service depends on the Redis service to start (that makes perfect sense since without the database it is not possible to have the workers working well).

composer flower service

Airflow containers

For the correct working of our Airflow environment based on CeleryExecutor, we separate in three distinct services, where I intent to detail well;

webserver:

composer webserver airflow service
  • depends_on: this means that the service has the dependency of Postgres service to go start, as mentioned before, Postgres will be our metadata repository.
  • environment: Defines variables that will be used to when up the service and it is setting up options that really interfere in the correct working of the services
    AIRFLOW__WEBSERVER__RBAC=true, setting that we want to use RBAC interface (added to Airflow 1.10 as a optional UI with more options, and access control, we are going to use it once the next version from Airflow (2.0) it will be the only option available).
    LOAD_EX=n, setting Airflow without load sample dags, Airflow has some sample DAGs to support new adventurer(=y would show the sample dags).
    FERNET_KEY, it is mainly used to encrypt connections in Airflow (it could generate new fernet keys to be used).
    EXECUTOR=Celery, it determines that we want to use Celery executors in our Airflow environment.
    POSTGRES_*, variables that determine user, password, database, host, and port from Postgres database service that Airflow will use as metadatabase (it should be the same as the Postgres service previously configured).
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: parameter used to set sql alchemy connection string, it could be observed that it is the same information of the Postgres parameters.
  • volumes: we use to map some folder or file from our local disk to a volume mounted into our docker container, it is really useful to have any change in your local environment instantly reflected in your docker, we map for now:
    — dags folder mounted inside of airflow directory, it will permit you to develop in your local environment and see your dag reflecting the changes.
    — resources folders inside of the airflow directory, it is a good practice to have a resource folder where you could use to keep the support files to your dags working well (as SQL files, schemas, others …).
    — config_files folder mounted inside of airflow directory, it will be explained in details further but it will contain files to support of setting up the Airflow environment with the necessary information to work in your projects.
    — scripts folder mounted inside of airflow directory, it will mount the directory with the programs that will support the Airflow environment setup.
    — requirements.txt file mounted inside of the root folder of your docker, the entry point of our container if detecting this file will install the python dependencies inside of the file, very useful to set new packages that you want in your docker.
  • ports: indicates the port that will be used to the web service and for each port will mapping for your local machine, in this case, the web service will use 8080 port and map to 8080 in your machine, if you already have some service using that port, it is necessary to adjust that.
  • command: webserver, this command that will be executed as a parameter for the entry point program, in this case, indicates webserver.
  • healthcheck: it is a command to check if the webserver is already working, and it is set to do 3 tries of 30 seconds with intervals of 30 seconds, after this time the docker is set as unhealthy.

It was a really extensive explanation, but the next two follow the same principles, with few differences;

scheduler and worker:

composer worker and scheduler services

The configuration is almost the same of the webserver for worker and schedule, with very few changes:

  • depends_on: dependencies will change, where the worker depends on the webserver and scheduler depends on the worker to start
  • volumes: config_files and scripts folder are no longer necessary for this services, it will just use the webserver container to load the parameters (the metadata database is shared for all services)
  • ports and health check are no longer necessary
  • command: changes to adapt to each one of the services (worker and scheduler)

Config Files

composer file will help to start the docker containers, but after that we need to setup our Airflow environment and it is a good idea to have this files to support us.

variables.json

We are going to have the variables that we want to load to our Airflow environment, which could be recovery from our dags and used them. I used to have a group of common variable in all dags like environment, resource folder path, email, slack parameters and also a variable for(?) the dag as a dictionary (JSON), it’s easy to parse inside of any dag and also avoid too many variables request in your AirflowI used to have a group of common variables that I used to set for my dags (like environment, resource folder path, email, slack parameters …) and also, a variable for the dag as dictionary (JSON), it is easy to parse inside of any dag and also avoid too many variables in your Airflow.

airflow variables

connections.json

The same concept of variables the variables file but for connections, in our example we will create a connection named GCP_CONNECTION of the type google_cloud_platform, tha points to an specific gcp project (this is mine, you must use yours here) and service account file (likewise gcp project, it must be yours here)

airflow connections

pools.json

The same concept of variables the variables file but for pools, create or update pools that are used to limit the usage of resource of the Airflow, it means that our task-related with a specific pool will be managed accordingly with the capacity indicates in slots.

airflow pools

users.json

As soon as we are using RBAC user interface the access for the Airflow UI is controlled by username and password, this file will help us to create users with specifics roles.

airflow users

gcp_service_account.json

It is the file used to connect with GCP services that will be referred into your connections or used by bash commands, this file could be generated into google cloud console here, google page.

(some information was omitted by security reasons, use from your GCP project instead)

gcp service account (sample file)

Script Files

Airflow already has a command-line tool that enable us to do a lot of things like create users, add variables, pools and connections, and more…

Unfortunately, some of this commands are not so practical as another, for example, we are able to load the JSON files with variables and pools in batch, but for users and connections have to be performed one user or one connection each time.

The scripts below are used to turn these jobs easier for us, I will not explain each program but left the code here.

airflow create connections program
airflow create users program

Running Containers

After having the project structure with all the files necessary and set up, it is time to run our docker-compose file which will start our docker containers with all services that we defined there.

Pull docker image

Assuming that you are fine with the prerequisites, we will need to pull the docker image with the necessary features to up the containers. we do that with running the command bellow in our favorite command tool.

docker pull buzz84/docker-airflow:latest

It could take a while, I need to work in a smaller image (sorry for that :( )

Up Docker Containers

Once, we pull the image base for our containers, We will use the docker-compose command to up our services based on the docker-composer.yml file, so from your docker folder execute:

docker-compose -f "docker-composer.yml" up -d

If everything goes well the containers will up in about 2 minutes, so you could do your first test. Open your browser and access the localhost:8080 to check if the webserver is accessible. It will request username and password to access, do not worry, we are going to create the user.

Image for post
Image for post
airflow login page

Setting Airflow

Create Users

This command will use the webserver docker container to execute our python program to create the users based on the file users.json

docker exec -ti $(docker ps -q --filter name=webserver) python ./scripts/airflow_create_users.py ./configs/users.json
Image for post
Image for post

Now, you are able to connect in the Airflow UI with your user and password, congratulations.

Create Connections

Follow the same step from the process before, we are going to execute the python program to create the connections based on the file connections.json

docker exec -ti $(docker ps -q --filter name=webserver) python ./scripts/airflow_create_connections.py ./configs/connections.json
Image for post
Image for post

Create Variables

We are going to create the variables using airflow command and the file variables.json

docker exec -ti $(docker ps -q --filter name=webserver) airflow variables -i ./configs/variables.json
Image for post
Image for post

Create Pools

We are going to create the variables using airflow command and the file variables.json

docker exec -ti $(docker ps -q --filter name=webserver) airflow pool -i ./configs/pools.json
Image for post
Image for post

Final Considerations

This structure of usage for docker inside your projects could be useful to several situations and could be applied for Docker for other purposes.

Never forget to update your config files, you will see that during the development of dags for your project will be necessary adjust or create new variables, connections, or pools, and have it updated is a good practice, the developer to work with the project will thank you.

Thinking about automated scripts to start your containers and set your environment I have already use that and it saves a lot of time.

Comming soon

The next training will start from this point and I will teach you how to design dags using GCP (Google Cloud Platform).

The Startup

Medium's largest active publication, followed by +734K people. Follow to join our community.

Jonatas Junior

Written by

Experienced Data Engineer | Passionate About Data | Endless Curiosity | Tireless Student

The Startup

Medium's largest active publication, followed by +734K people. Follow to join our community.

Jonatas Junior

Written by

Experienced Data Engineer | Passionate About Data | Endless Curiosity | Tireless Student

The Startup

Medium's largest active publication, followed by +734K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store