Azure Data Factory’s orchestration problem — why we need Airflow

Published in

Glitni

7 min readJun 9, 2023

Azure Data Factory (ADF) is a data ingestion, orchestration and transformation tool and a integral part of Microsoft Fabric. This article focuses on the orchestration part of ADF and one of its core functional limitations using for-each loops.
With the recent introduction of Managed Airflow I believe the orchestration part of ADF should, in many cases, be solved outside of ADF using Airflow or a similar dedicated orchestration tool.

Collage, Apache Foundation, Microsoft and Dall-E

TLDR: ADF does not offer true parallelization in for-each loops.

Understanding the problem with ADF’s ForEach Activity

Back in October 2019, I stumbled upon a problem with the copy activity running within a ForEach in Azure Data Factory. We were operating a nightly ingestion job, involving hundreds of tables, each being processed through four parallel threads.

A closer look at the execution chart revealed a significant amount of wasted time, conspicuously represented in the highlighted area of the execution graph below:

Found this old gem in my message history asking around if anyone else has noticed this “bug” from October 2019.

Initially, this only caused 15 minutes of wasted time in a nightly pipeline that had ample time before it needed to be complete.

Today, at a project with a much larger scale, this delay transforms from a minor inconvenience into a performance bottleneck causing delayed data delivery for users.

Below we see around 1000 copy activities running in 50 concurrent threads and, as highlighted in yellow, the orchestration run loses momentum multiple times. There should be no such drop-off until the 50 last activities are running and gradually finishing.

1000 copy activities running in 50-paralell threads, wasted time highlighted

It turns out this behavior seems not to be a bug, but a design flaw:

Azure Data Factory ForEach activity does not properly run in parallel.

This inefficiency poses a challenge for scaling your data platform when using ADF as an orchestration tool. In this article, I will delve into the issue and how a separate dedicated orchestration tool such as Managed Airflow together with ADF can solve this problem.

ADF ForEach Parallelization Explained

In short, when you give ADF an input consisting of the array [a,b,c,d,e,f,g,h,i] and have it work on those items in batches of 4, it pre-allocates these batches in the following way:

Table 1: How ADF PRE-assigns ForEach input items to batches/threads

Assuming all tasks have the same running time - this is no issue. However, this is often not the case. For example, let’s assume tasks “a” through “i” in Table 1 have the following running times::

Table 2: Example running times of tasks “a” through “i”

We can see that the total running time of the activity was determined by the order in which the tasks were batched. Unfortunately, in this case, Batch 1’s 250-second running time for loops 1 to 3 determined the total running time. Batch 2 and 3 finished their assigned work in 60 seconds, while Batch 1 was still working on loop 1 and had loops 2 and 3 in the queue.

I had expected Azure Data Factory to handle ForEach loops by allocating tasks to a free batch slot when a batch is available to start a new loop, rather than allocating all tasks before starting execution. The workflow is illustrated below:

Table 3: How ForEach batching should be done

Reproducible example using ADF ForEach

For example, taking 16 tasks that are to be done in parallel running 4 threads. In ADF this can be done by passing these into a ForEach activity that passes passes each @item() to a wait-activity that waits for @item() amount of time. The input below:

LoopItems = [100,1,1,1,100,1,1,1,100,1,1,1,100,1,1,1]

This would give the following GANTT and execution chart:

4 batches runtime **6 minutes 46 seconds (406 seconds)**

While instead, if I changed inputs to :

LoopItems = [100,1,1,1,1,100,1,1,1,1,100,1,1,1,1,100]

We get:

4 batches of wait activity — optimally ordered runtime **2 minutes 6 seconds (126 seconds)**

We see that the same operation has an optimal running time of 126 seconds and a longest running time of 406 seconds (a 234% increase in running time), with the only difference being the order of the inputs.

Using Airflow combined with ADF for the same task

On February 1, 2023, Azure released “Managed Airflow” which can be created through the management plane of Azure Data Factory. Here Azure is late to the party. GCP launched their managed Airflow (Cloud Composer) back in May 2018 and Amazons MWAA (Managed Workflows for Apache Airflow) in November 2020.

Although there are emerging tools in the data orchestration landscape, such as Prefect and Dagster — Airflow remains the preferred tool for many organizations, largely owing to its abundant connectors, plugins and community.

And now, for organizations already using Azure data services Airflow is an even more appealing choice as we don’t have to handle the user authentication, provisioning, infrastructure and all other things with running Airflow in Kubernetes. Additionally, we can anticipate easier integrations with Azure services and potentially re-use ADF linked services or dataset authentication or identities.

Back to the topic of ForEach loops — how can we utilize Airflow to improve running time with parallel ADF pipelines?

Example using Managed Airflow

Getting started with testing Airflow is very easy. The basic steps are outlined in the introduction article from Microsoft.

After creating a Managed Airflow through the ADF interface the most important step is adding connection details (client id and secret) for an app registration that has contributor access to ADF in order to allow Airflow to connect.

Setting up ADF dataset in Airflow using Azure AD App registration

Result timing with Airflow orchestrating the same ADF pipelines

Airflow Large instance : 2 minutes 26 seconds (20 seconds slower then the optimal ordering in ADF)

Using Airflow, the same pipeline running time results are slightly slower than the ADF optimal run, but a lot faster than the worst run. This also gives us the added benefit of running times being consistent irrespective of ordering.

The dag definition I used can be seen below, and the ADF pipeline can be found here:

The dag used for running adf wait pipeline activities

Summary

When using ADF ForEach activity with many activities of highly variable duration, there is a risk of wasting time and experiencing delayed data if the ordering of inputs are unevenly entered. This is a significant limitation in ADF’s functionality as an orchestration tool and could be alleviated by using a dedicated orchestration tool such as Airflow in combination with ADF pipelines.

In addition to solving this very concrete issue, Airflow provides a wealth of connectors and increased opportunities such as being able to use Python anywhere in an orchestration flow. ADF is a great tool, but even greater when combined with other tools.

As this discussion shows, there is certainly a need and purpose for a dedicated orchestration tool such as Managed Airflow in Azure. Airflow is a very mature and widely used tool. I certainly hope that the developers put in the effort to improve the integration with ADF, Microsoft Fabric and other Azure services such as managed identities, key vaults, datasets and linked services in ADF.

Appendice: Other potential solutions for ADF’s ForEach Problem

While there might be many solutions to this issue I am still quite baffled that this is not something that has been addressed by ADF product team.

Enric Vidal Martinez has made the same observation and submitted an Azure Data Factory idea for this to be fixed, but has only received a few votes per 08.06.23. If you agree this is an issue, please join us in voting for it!

Manual intervention — sorting/bucketing of input

One potentially simple and effective solution to that I have used is to sort the inputs by expected running time. When utilizing run-time history/logs, you can avoid some of the biggest issues by making sure that the two or three heaviest runs does not end up on the same thread randomly. In this simplified example, sorting by size would give the optimal running order, but this is, likely not usually going to be the case.

Using the pipeline concurrency option

ADF has a built-in box for limiting pipeline execution concurrency. However, the issue remains that pipelines need to be fired simultaneously and their responses need to be waited for.

ADF pipeline settings for putting concurrency limit — NB: it only allows queueing 50 items!

Furthermore, this must be set on a per-pipeline basis, which can cause problems if you follow best practices and have a generic re-usable pipeline that you want to execute with different concurrency limitations based on process and inputs.

Trying and using this option leads however to other problems at scale. ADF refuses to queue more than 50-pipelines using pipeline concurrency control again requiring to build some kind of additional queue and retry logic to handle these failures.