The purpose of this blog post is to describe how to avoid a common performance pitfall when using Azure Data Factory Mapping Data Flows. The same information exists in the public documentation (Data Flows Performance and Tuning Guide) but can be missed in the ocean of documentation so my goal here is to give the exact instructions in a succinct and easy to consumer manner. Also, I would like to attribute my colleague Roshan Budathoki who shared this nugget of wisdom to help with a customer of mine which subsequently has resulted into this post for benefit of others.
Azure Data Factory Mapping Data Flows use Apache Spark clusters behind the scenes to perform processing and if default settings are used each Data Flow Activity inside a pipeline spins up a new Spark cluster with 3–5 minute cluster startup time. In case your pipeline consists of multiple sequential Data Flow activities this cluster start up time can add up negatively impacting the overall execution time of the pipeline.
The solution is to set TTL (Time to Live) property of the Integration Runtime used to execute Data Flow Activity to an appropriate non-zero value. This solution of setting TTL to non-zero value is applicable not only for sequential Data Flow activities within a single pipeline but also helpful if you are running multiple pipelines in a sequential manner (maybe running published pipelines during the development/testing phase or some other valid reason in Production). On a side note, Data Flow Debug Mode also retains the same cluster to reuse for running multiple Data Flow activities but that is only for running non-published pipelines.
Step 1 — Create Integration Runtime with appropriate TTL for the Data Flows
It’s important to note that it is not possible to change TTL for AutoResolveIntegrationRuntime so another Integration Runtime needs to be explicitly created.
Step 2 — Update Integration Runtime setting on Data Flow Activities
The Mapping Data Flows are invoked from ADF Pipelines using the Data Flow Activity and it is the Data Flow Activity which has the setting for Azure Integration Runtime (IR) where Integration Runtime created in Step 1 needs to be specified (make sure to apply this setting on all Data Flow Activities in your pipeline).
- Data Flows Performance and Tuning Guide — https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance
- Data Flows Performance and Tuning Guid > Optimizing Integration Runtime here — https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#ir
- Data Flows Debug Mode — https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-debug-mode