The ultimate guide to Managed Airflow on Azure Data Factory

A guide on how to run Airflow in Azure Data Factory, use Git Sync and work with custom environments and private Python packages.

DataFairy
Towards Data Engineering
5 min readOct 12, 2023

--

Image courtesy by the author

Why setting up Airflow on ADF is a challenge

Running Airflow instance with Git sync enabled and working

Managed Airflow in Azure Data Factory has been around since February 2023 as a service in preview and hasn’t been GA for long. There are still various issues with the service and new features are being developed constantly. The documentation on Microsoft is slowly increasing but a couple of challenges still remain getting the service running properly.

For the last weeks I have been stress testing Managed Airflow on Azure. After a lot of trial and error I got it running using custom environments, Git Sync and custom Python packages. In this article I want to summarize what I have learned and how you can remediate some of the current issues I have encountered.

Here are some of the issues I came across so far:

  • There is a lack of experience with this service in general. Microsoft support hasn’t gotten me very far either.
  • Features like Git sync for anything but Github are not as well documented and it’s not always clear what the input configuration has to be.
  • Airflow instances can break when you input the wrong configuration or try to install different Python packages. It’s not possible to restart them.
  • It’s impossible to login to your Airflow instance because it redirects you to a different Azure tenant you have no access to when the Kubernetes Python package is added as requirement.
  • Finding the following error in the Azure Diagnostics logs: You need to initialize the database. Please run `airflow db init`

Getting started

The current state of Managed Airflow on ADF

The current version of Managed Airflow on Azure has improved quite a bit from it’s preview version. Here are some of the features:

  • Ability to connect to blob storage for uploading dags and retrieving Python requirements files.
  • Ability to use Git Sync for different services such as Github and Azure DevOps.
  • Autoscale.
  • Ability to use your own Python package and install it on Managed Airflow.
  • Provide your own Python requirements.(based on Python 3.8.17).
Managed Airflow configuration options

Git Sync with Azure DevOps

One of the challenges with Git Sync in Managed Airflow is to get the configuration right. My experience is that Github is the best supported solution so far and it is mentioned in all the official documentations. Both public and private repositories are straightforward to sync with Airflow.

Azure DevOps has been such a challenge that I reached out to Microsoft support but without much success. The breakthrough for me was that I experimented with the username. It worked once I used my organizations name as username.

So if this is your git clone url: https://OrgName@dev.azure.com/OrgName/ProjectName/_git/RepoName

Use the name before the @ as username: OrgName

Custom environments

It’s possible to change the standard Python packages and install your own versions and even add some others. I would recommend to first run your custom setup locally with astro cli or any other Airflow distribution that matches your Managed Airflow version.

When adding Python packages for a Airflow version that has Git Sync enabled make sure to remove the “” from the package name when adding it in the UI. Even though it says to add them(!). Separate the packages by comma.

Packages should not contain “”

If you have issues here, check your logs in Azure Data Factory (AzureDiagnostics). You might see an error saying: “package==xx.xx.xx” not found.

The Kubernetes issue

Managed Airflow on Azure and the Kubernetes Python packages do not go together. What happens is that you will be unable to login to your Airflow instance. I don’t know what happens in the backend but the service redirects you to a different tenant.

You can resolve the issue by removing the packages from the requirements. (kubernetes and kubernetes-asyncio)

Custom Python packages

To get your own Python package running on Managed Airflow you can follow the guide below:

Sync a GitHub repository with Managed Airflow — Azure Data Factory | Microsoft Learn

My experience is that this guide is quite complete but I would still add the following:

  • It’s a 2 step process when you upload your package for the first time.
  • First make sure your package is uploaded or git synced before you install it. Otherwise the instance might break.
  • Once the first step is done add your package path to the requirements as described in the documentation and update your Airflow instance.

db init issues

I found that this issue is either a wrong configuration or a bad implementation of Managed Airflow. Try the same setup but with a different size of the Airflow instance. Until Microsoft fixes the backend issue this is a temporary fix.

Summary

In this article we looked at Managed Airflow in Microsoft Azure Data Factory. We went through the most important issues I have encountered over the last few weeks to get Airflow running using Git sync, custom environments and private Python packages. For every issue I present a solution that worked for me and might also help you to get to the next step of the Managed Airflow journey in Azure.

If you found this article useful, please follow me.

--

--

DataFairy
Towards Data Engineering

Senior Data Engineer, Azure Warrior, PhD in Theoretical Physics, The Netherlands. I write about Data Engineering, Machine Learning and DevOps on Azure.