Azure Data Factory Mapping Data Flows Performance Pitfall to avoid

Inderjit Rana
Sep 17 · 3 min read

The purpose of this blog post is to describe how to avoid a common performance pitfall when using Azure Data Factory Mapping Data Flows. The same information exists in the public documentation (Data Flows Performance and Tuning Guide) but can be missed in the ocean of documentation so my goal here is to give the exact instructions in a succinct and easy to consumer manner. Also, I would like to attribute my colleague Roshan Budathoki who shared this nugget of wisdom to help with a customer of mine which subsequently has resulted into this post for benefit of others.

Problem Statement

Azure Data Factory Mapping Data Flows use Apache Spark clusters behind the scenes to perform processing and if default settings are used each Data Flow Activity inside a pipeline spins up a new Spark cluster with 3–5 minute cluster startup time. In case your pipeline consists of multiple sequential Data Flow activities this cluster start up time can add up negatively impacting the overall execution time of the pipeline.

Solution Summary

The solution is to set TTL (Time to Live) property of the Integration Runtime used to execute Data Flow Activity to an appropriate non-zero value. This solution of setting TTL to non-zero value is applicable not only for sequential Data Flow activities within a single pipeline but also helpful if you are running multiple pipelines in a sequential manner (maybe running published pipelines during the development/testing phase or some other valid reason in Production). On a side note, Data Flow Debug Mode also retains the same cluster to reuse for running multiple Data Flow activities but that is only for running non-published pipelines.

Step 1 — Create Integration Runtime with appropriate TTL for the Data Flows

It’s important to note that it is not possible to change TTL for AutoResolveIntegrationRuntime so another Integration Runtime needs to be explicitly created.

Create Integration Runtime in Azure Data Factory (Manage > Integration Runtimes > New > Azure Self-Hosted)
Set Azure Integration Runtime Time to live (TTL)

Step 2 — Update Integration Runtime setting on Data Flow Activities

The Mapping Data Flows are invoked from ADF Pipelines using the Data Flow Activity and it is the Data Flow Activity which has the setting for Azure Integration Runtime (IR) where Integration Runtime created in Step 1 needs to be specified (make sure to apply this setting on all Data Flow Activities in your pipeline).

Integration Runtime setting for Data Flow Activity in a Pipeline

Microsoft Azure

Any language.

Inderjit Rana

Written by

Senior Cloud Architect @Microsoft. Please feel free to connect with me on LinkedIn: https://www.linkedin.com/in/singhinderjit/

Microsoft Azure

Any language. Any platform. Our team is focused on making the world more amazing for developers and IT operations communities with the best that Microsoft Azure can provide. If you want to contribute in this journey with us, contact us at medium@microsoft.com

Inderjit Rana

Written by

Senior Cloud Architect @Microsoft. Please feel free to connect with me on LinkedIn: https://www.linkedin.com/in/singhinderjit/

Microsoft Azure

Any language. Any platform. Our team is focused on making the world more amazing for developers and IT operations communities with the best that Microsoft Azure can provide. If you want to contribute in this journey with us, contact us at medium@microsoft.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store