This is the first story of a series of hands on training to teach a bit about Airflow and it intent to explain how to struct a project to have a local environment setup for dockers (using CeleryExecutor) and considering steps to have the service account from a GCP project prepared to be used in your dags.
Information and Prerequisites
The training was design to explain the process of having the environment by project, assuming that all dag that you must want to design is related to the same project, it is easy to organize that way.
Let´s look for 01_setting_up example project:
Detailing the files
The heart of our training, the composer file presents below is responsible for creating our containers (all the four + dependency container)
This part is responsible to up two database services:
- Redis will be the database in memory to support message broker services (using standard port).
- Postgres will be the database responsible to store Airflow metadata information, we are already setting the database name, user, and password (using standard port).
Once we are opting to use CeleryExecutor to run our Airflow tasks, We will use Flower as the tool for monitoring and administrating the Celery cluster.
Observe that this service depends on the Redis service to start (that makes perfect sense since without the database it is not possible to have the workers working well).
For the correct working of our Airflow environment based on CeleryExecutor, we separate in three distinct services, where I intent to detail well;
- depends_on: this means that the service has the dependency of Postgres service to go start, as mentioned before, Postgres will be our metadata repository.
- environment: Defines variables that will be used to when up the service and it is setting up options that really interfere in the correct working of the services
— AIRFLOW__WEBSERVER__RBAC=true, setting that we want to use RBAC interface (added to Airflow 1.10 as a optional UI with more options, and access control, we are going to use it once the next version from Airflow (2.0) it will be the only option available).
— LOAD_EX=n, setting Airflow without load sample dags, Airflow has some sample DAGs to support new adventurer(=y would show the sample dags).
— FERNET_KEY, it is mainly used to encrypt connections in Airflow (it could generate new fernet keys to be used).
— EXECUTOR=Celery, it determines that we want to use Celery executors in our Airflow environment.
— POSTGRES_*, variables that determine user, password, database, host, and port from Postgres database service that Airflow will use as metadatabase (it should be the same as the Postgres service previously configured).
— AIRFLOW__CORE__SQL_ALCHEMY_CONN: parameter used to set sql alchemy connection string, it could be observed that it is the same information of the Postgres parameters.
- volumes: we use to map some folder or file from our local disk to a volume mounted into our docker container, it is really useful to have any change in your local environment instantly reflected in your docker, we map for now:
— dags folder mounted inside of airflow directory, it will permit you to develop in your local environment and see your dag reflecting the changes.
— resources folders inside of the airflow directory, it is a good practice to have a resource folder where you could use to keep the support files to your dags working well (as SQL files, schemas, others …).
— config_files folder mounted inside of airflow directory, it will be explained in details further but it will contain files to support of setting up the Airflow environment with the necessary information to work in your projects.
— scripts folder mounted inside of airflow directory, it will mount the directory with the programs that will support the Airflow environment setup.
— requirements.txt file mounted inside of the root folder of your docker, the entry point of our container if detecting this file will install the python dependencies inside of the file, very useful to set new packages that you want in your docker.
- ports: indicates the port that will be used to the web service and for each port will mapping for your local machine, in this case, the web service will use 8080 port and map to 8080 in your machine, if you already have some service using that port, it is necessary to adjust that.
- command: webserver, this command that will be executed as a parameter for the entry point program, in this case, indicates webserver.
- healthcheck: it is a command to check if the webserver is already working, and it is set to do 3 tries of 30 seconds with intervals of 30 seconds, after this time the docker is set as unhealthy.
It was a really extensive explanation, but the next two follow the same principles, with few differences;
scheduler and worker:
The configuration is almost the same of the webserver for worker and schedule, with very few changes:
- depends_on: dependencies will change, where the worker depends on the webserver and scheduler depends on the worker to start
- volumes: config_files and scripts folder are no longer necessary for this services, it will just use the webserver container to load the parameters (the metadata database is shared for all services)
- ports and health check are no longer necessary
- command: changes to adapt to each one of the services (worker and scheduler)
composer file will help to start the docker containers, but after that we need to setup our Airflow environment and it is a good idea to have this files to support us.
We are going to have the variables that we want to load to our Airflow environment, which could be recovery from our dags and used them. I used to have a group of common variable in all dags like environment, resource folder path, email, slack parameters and also a variable for(?) the dag as a dictionary (JSON), it’s easy to parse inside of any dag and also avoid too many variables request in your AirflowI used to have a group of common variables that I used to set for my dags (like environment, resource folder path, email, slack parameters …) and also, a variable for the dag as dictionary (JSON), it is easy to parse inside of any dag and also avoid too many variables in your Airflow.
The same concept of variables the variables file but for connections, in our example we will create a connection named GCP_CONNECTION of the type google_cloud_platform, tha points to an specific gcp project (this is mine, you must use yours here) and service account file (likewise gcp project, it must be yours here)
The same concept of variables the variables file but for pools, create or update pools that are used to limit the usage of resource of the Airflow, it means that our task-related with a specific pool will be managed accordingly with the capacity indicates in slots.
As soon as we are using RBAC user interface the access for the Airflow UI is controlled by username and password, this file will help us to create users with specifics roles.
It is the file used to connect with GCP services that will be referred into your connections or used by bash commands, this file could be generated into google cloud console here, google page.
(some information was omitted by security reasons, use from your GCP project instead)
Airflow already has a command-line tool that enable us to do a lot of things like create users, add variables, pools and connections, and more…
Unfortunately, some of this commands are not so practical as another, for example, we are able to load the JSON files with variables and pools in batch, but for users and connections have to be performed one user or one connection each time.
The scripts below are used to turn these jobs easier for us, I will not explain each program but left the code here.
After having the project structure with all the files necessary and set up, it is time to run our docker-compose file which will start our docker containers with all services that we defined there.
Pull docker image
Assuming that you are fine with the prerequisites, we will need to pull the docker image with the necessary features to up the containers. we do that with running the command bellow in our favorite command tool.
docker pull buzz84/docker-airflow:latest
It could take a while, I need to work in a smaller image (sorry for that :( )
Up Docker Containers
Once, we pull the image base for our containers, We will use the docker-compose command to up our services based on the docker-composer.yml file, so from your docker folder execute:
docker-compose -f "docker-composer.yml" up -d
If everything goes well the containers will up in about 2 minutes, so you could do your first test. Open your browser and access the localhost:8080 to check if the webserver is accessible. It will request username and password to access, do not worry, we are going to create the user.
This command will use the webserver docker container to execute our python program to create the users based on the file users.json
docker exec -ti $(docker ps -q --filter name=webserver) python ./scripts/airflow_create_users.py ./configs/users.json
Now, you are able to connect in the Airflow UI with your user and password, congratulations.
Follow the same step from the process before, we are going to execute the python program to create the connections based on the file connections.json
docker exec -ti $(docker ps -q --filter name=webserver) python ./scripts/airflow_create_connections.py ./configs/connections.json
We are going to create the variables using airflow command and the file variables.json
docker exec -ti $(docker ps -q --filter name=webserver) airflow variables -i ./configs/variables.json
We are going to create the variables using airflow command and the file variables.json
docker exec -ti $(docker ps -q --filter name=webserver) airflow pool -i ./configs/pools.json
This structure of usage for docker inside your projects could be useful to several situations and could be applied for Docker for other purposes.
Never forget to update your config files, you will see that during the development of dags for your project will be necessary adjust or create new variables, connections, or pools, and have it updated is a good practice, the developer to work with the project will thank you.
Thinking about automated scripts to start your containers and set your environment I have already use that and it saves a lot of time.
The next training will start from this point and I will teach you how to design dags using GCP (Google Cloud Platform).