Tutorial: Managed Airflow on Azure

How to get started with Apache Airflow using the Azure Data Factory Airflow service. How to upload DAGs and use Airflow git sync with Github and Azure DevOps.

DataFairy
5 min readAug 31, 2023

The current state of Airflow on Azure

Since the beginning of 2023 Azure is offering Apache Airflow as a managed service in Data Factory. It has been in preview for a while and is now GA. The service is currently based on a virtual machine managed by Azure and might be extended to a Kubernetes service in the future.

If you navigate to your ADF resource you should find Airflow at the bottom of the menu:

Apache Airflow in ADF.

You can have multiple Airflow instances running with different configurations:

The current options are:

  • Different Airflow versions from 1.1 to 2.4.3 (not always an option)
  • Different VM sizes (small and large)
  • Number of extra nodes
  • Different DAG sources (git sync and storage upload)

In the following I want to explain how you can add your DAGs to Apache Airflow on Azure Data Factory.

Uploading DAGs

Without git sync you need to upload files manually to your Airflow instance. This is quite tedious if you just want to debug, and therefore I recommend to develop locally and only upload production pipelines.

Import dag files to Airflow.

You will need to create a linked service to your storage in Azure. Linked services is how ADF connects to other resources in Azure. Then choose the container and directory where a dag folder is located. Don’t choose the dag folder. After you click import wait for Airflow to refresh.

If your files are valid you won’t see any error messages and should see all your pipelines stored in the dag folder.

Using Git Sync

Public Github repository

Adding a public github repository is straightforward. When creating a new Airflow instance just fill in the following configuration and you should be up and running in no time.

Private Github repository with token

When adding a private repository to Airflow it becomes a bit trickier. What you don’t want to do is update your git configuration of an existing Airflow instance and add a token:

Currently this will fail in ADF.

If you update your current Airflow instance’s git sync properties you will probably get this error:

It’s either an authentication issue or an issue with updating Airflow.

Something went wrong with updating your instance. Another possibility is that your PAT doesn’t have enough permissions. It is currently not documented anywhere what the minimum permissions are. (Get to work Microsoft and update your Managed Airflow documentation!)

Anyway, if you want to add a private repository you need to create a new Airflow instance. I have used a PAT with full access to the repository. The most important thing is to not update an existing Airflow instance.

Azure DevOps repository with PAT

Disclaimer: The following method might not work yet for everyone. Managed Airflow is constantly changing and Microsoft is updating this service even without making any announcements. The chances that this method of integrating with Azure DevOps might work just fine for everyone in the near future are therefore quite high.

Azure DevOps is the option called ADO in the selection:

In Azure DevOps you can create a full-access PAT for your entire repository.

Navigate to user settings in the top right corner.
Full access PAT.

GIT Username:

I have been running into timeout issues and server errors because of this. I used my Azure DevOps git username. This didn’t work. After support calls and hours of testing I tried something else. I used the Azure DevOps organization name as Git username with my PAT.

Don’t be surprised if starting an Airflow instance in Data Factory with a DevOps repository can take a few hours. and even give a time-out error.

Timeout errors due to failed connection to Azure DevOps.

Service Principal Authentication:

Disclaimer: The following method might not work yet for everyone. Managed Airflow is constantly changing and Microsoft is updating this service even without making any announcements. The chances that this method of integrating with Azure DevOps might work just fine for everyone in the near future are therefore quite high.

If the PAT doesn’t work for you you might want to try using a Service Principal. For this you will need:

  • Service Principal client id
  • Service Principal secret

Considerations when using Azure Airflow service

  • Costs: Airflow in ADF is based on Virtual Machines of a fixed size. The costs are therefore not scaling down to zero when your are not using the service. https://learn.microsoft.com/en-us/azure/data-factory/airflow-pricing
  • Options: Airflow in ADF comes with restricted customization options. Some settings are also hidden from the users (admin) views. You could be an admin and still get the following message:
Here the admin is Microsoft Azure/ADF.
  • Preview/GA: The ADF Airflow service is very much new and not all bugs and issues are resolved. The available documentation consists currently of only three pages.
  • Versions: Apache Airflow versions are ahead of the Managed Airflow versions available in ADF. Astronomer.io offers versions up to 2.6.x while ADF offers version 2.4.3.
  • Local vs cloud: It appears to be more efficient (cost and time) to develop DAGs in a local environment and then upload them to the cloud. The limited number of available Airflow versions could be an issue that will have to be addressed by Microsoft.

If you found this article useful, please follow me.

--

--

DataFairy

Senior Data Engineer, Azure Warrior, PhD in Theoretical Physics, The Netherlands. I write about Data Engineering, Machine Learning and DevOps on Azure.