Setting up Airflow on Azure, isn’t quite as easy as on Google Cloud, where Airflow exists as a managed service called “Cloud Composer”.Microsoft has, however, come up with a quick-start template to setup Airflow on Azure, this template sets up both a web-app hosting an Airflow instance and the postgres database backing it up, make it easier to deploy to the cloud.
Unfortunately this web-app is purely based off puckel’s Docker image, without specific configuration applied to it. There is no management of extra dependencies that might be needed, no setup of users and authorizations or to how to deploy dags.
One of the way to fix that is by extending puckels’ image and leverage private container repository. What we will be looking to add in terms of feature:
- Admin login / Access Authorization
- MsSQL Support
- Custom Dag Folders
- Azure Specific interfaces
Use private Azure Container Registry
It is possible to setup a web-app with a container hosted in a private repository, by providing three application settings covering the username, password and url of the private repository.
- dockerRegistryUsername: username under acr/access keys
- dockerRegistryPassword: password under acr/access keys
- dockerRegistryUrl: login server under acr/access keys
We can therefore modify the arm template azuredeploy.json provided as part of the quick-start to make use of an Azure Container Regisry. For this we first need to add a couple of parameters allowing for the authentication with the Azure container registry.
And adding them to the relevant (eg: DOCKER_REGISTRY_URL) app settings properties, such as:
Along with these authentication settings, we also need to define one more variable, the specific container that we want to use. We need to define a variable in the following format:
- dockerContainer: foo.azurecr.io/airflow:latest
Like for the docker registry variables we need to add it to the parameters sections:
We will need to reference the specific container as part of the linuxFxversion in the following way:
Authentication and Authorization
One of the draw back of using the quick start template or directly the puckel’s image is that, by default, Airflow is setup as public. Airflow offers a few modules and configuration, that allow to password protect it.
In order to store connection details and password in a secure manner, rather than in plain text, airflow needs to be setup with the crypto module.
Airflow needs to be provided with a fernet key in order to make use of the crypto package. The following website provides a tutorial on how to generate a fernet key for airflow. The fernet key can be provided to airflow by populating the environment variable: AIRFLOW__CORE__FERNET_KEY
Once, the cryptography module is enabled within Airflow, it is also good to setup password authentication. Airflow has a specific module for this, which can be installed using the following pip command:
It is also necessary, to specify within the airflow.cfg the authentication backend. It is possible to do this in different ways, either through an environment variable override AIRFLOW__WEBSERVER__AUTH_BACKEND or by modifying the airflow.cfg file. Since we are already creating a custom image, we might as well include a custom airflow.cfg file within it:
After having secured credentials and having defined an authentication backend, we will need to setup a script to generate an admin logging. The admin script would typically be launched within the Docker’s container entrypoint.sh.
This script adds a user to airflow based on environment variable, these environment variables will also need to be passed through to the ARM template azuredeploy.json. More advanced scripts, would take into account, whether the specific user was already created and update the information based on that.
After doing these changes and deploying the custom image, you should be faced with a sign in page for Airflow.
Azure offers scalable managed SQL Server instances, which proves to be a good choice for a data repository. The base image of airflow from puckels’ however doesn’t interact with this database. In this specific case we are not looking at setting up a database to host airflow metadata (although possible), but rather as a target destination from some of the dags.
In order to take advantage of a connection to MsSQL, the MsSQL driver needs to be installed. The documentation on how to setup this driver for Linux is provided on Microsoft’s website. Since the latest puckel/docker-airflow image is based on “python:3.6-slim”, itself based on debian 9 “stretch”, we can add these dependencies to the image, by following the Debian 9 documentation on Microsoft’s website:
Once this is done a few packages will need to be installed through pip:
The airflow[mssql] package adds the operator and hooks to interact with MsSQL while pyodbc and pymsqsl, provides a database interface to MsSQL. After having installed the driver and the python packages, we can now connect to the specified MsSQL server, using a SQL Alchemy connection string in the following format:
“Mounting” a DAG folder
Azure command line, provides an interface to mount external file shares directly onto the web-app. We are therefore able to override the content of the dag folder, with one contained within an external file-share:
In order to be able to import modules within the dag folder we need to add airflow’s home directory to the python path, this can be done in the Docker file with the following command:
Azure Specific Interfaces
Airflow offers a set of hook, sensors and operators to interact with different elements from the azure ecosystem, be it Azure blob Storage (WASB), Cosmos db, Azure Container instance or the Azure data-lake. Each of these have specific connectors which can be installed individually but airflow also provides for installing all of them in a systematic manner: