Optimizing ADF Pipelines for Data Ingestion

Learnings from the field with Azure Data Factory

Archit Pandita
Hashmap, an NTT DATA Company
6 min readAug 5, 2021

--

Overview

When it comes to data engineering and data integration, everyone is focused on one thing — optimizing data pipelining time and making it most productive. I want to share some recent optimization learnings about Azure Data Factory (ADF) in a quick walk-through.

In this example, I chose to use batch for each activity over parallelism of copy in order to optimize pipeline time. I also built this pipeline to run copy activity multiple times over a single time. Also, I provided a time analysis data table to provide some clarity into the head-to-head comparison. However, please note that the comparison is based on the data payload of my API from 35KB to 14.7MB. In all, the files totaled 26.5MBs. With that context now in place, let's dive in!

Scenario

Here, I have an API with pagination to which the data is kept in Azure Blob Storage.

i1 Design
  1. In this case, I have created a pipeline to ingest data from the API; paginating the data as it is dumped into blob storage. To be specific, a request is made for each page of data. However, each page has to be requested until the last page is reached and the total pages are provided in the API’s response.
i2 For-each activity
i3 API Ingestion

2. All files are then copied from raw blob storage to standard blob storage (unprocessed files).

i4 copy raw to standard blob

3. If all files or regex-based files are copied from one or multiple locations the method below will be more efficient.

i5 Copy all file src to test using wildcard

The Comparisons & Results

I have used Copy activity where the source is the API and sink is the blob storage. An array is then passed for each activity to ingest the data and store it.

Comparison: Ingest different amounts of data and copy from raw to standard blob
Parameters: DIU= Auto, Parallelism=default vs DIU= Auto, Parallelism=2, For-Each=seq

i6 pipeline comparison based on copy activity

Comparison: Ingest different amounts of data and copy from raw to standard blob
Parameters: DIU= Auto, Parallelism=default Vs DIU= Auto, Parallelism=5, For-Each=seq

i7 pipeline comparison based on copy activity

Comparison: Copy all using different Parallelism Auto vs 2 vs 5
Parameters: DIU= Auto, Parallelism=default Vs DIU= 4, Parallelism=2 Vs DIU= 4, Parallelism=5

i8 copy activity for all files
i9 compare copy activity parallelism

Using For-each activity with Batch rather sequential

  1. Run pipe for 1–5 all pages
  2. Data ingest API -> Blob raw in batch
  3. Copy file individually in batches [1,2,3,4,5]

Comparison: Ingest multiple pages of data from the API to raw and then to standard blob, total data = 27.5MB, total files 5 files
Parameters: For-each Sequential Vs For-each Batch size

i10 compare For-each activity using batch

Be sure to copy all files from one folder to the other. This method is more efficient than leveraging the metadata to acquire a list of files only to then move them to another location using copy in batch 2. In general, if the objective is solely to copy the files, the copy-only method is the best option. Otherwise, it makes sense to process the files through multiple activities using the batching method.

i11 copy file iteratively
i12 compare using copy activity vs iterative copy

Optimizing Cost

  1. Be sure to use the manual and lowest DIU (default auto min 4). If the pipeline is not too heavy the value can be lowered to 2. When this happens two nodes will be used to run. This means fewer resources being used and more cost savings.
  2. Publish and Run over debug: Use Publish and trigger over debug as during debug. The default TTL for Debug sessions is 60 minutes and this is not including the run, so use wisely. A more in-depth answer can be found within Microsoft’s Documentation on ADF.
  3. It is also important to Publish and Run over debug. The default TTL for Debug sessions is 60 minutes and that is not accounting for the run, so be mindful.

Observation

Parallelism in copy activity is a no-go. Typically, threads increase the throughput of the data, but the default/auto will adjust itself to an even more optimized option. Luckily my data was not too high. However, if the data is high and the resources are limited, setting the parallelism and DIU will be more effective.

Be cognizant of the resource utilization, because if the system is getting overloaded by the processes a limit will need to be set to an upper or lower value. If a copy from multiple files from blob is needed, it is better to use copy with regex, wildcard, etc.

If actually processing is required for the files batch is a better option. Batch running for each activity is more effective as it will execute the activity in parallel.

Closing Thoughts

Hopefully, this analysis provided some interesting points to think about regarding optimization and efficiency while building ADF pipelines which might save you some cost as well by adopting parallelism via batch and copy activity with regex.

(It is likely that the results will vary based on the set of conditions and size of the data)

Ready to Accelerate Your Digital Transformation?

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with nearly every industry to solve the toughest data challenges including cloud and data migrations, building new data apps, design and architecting data pipelines, automating data processes, etc. — we can help you shorten time-to-value!

We offer a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Additional Resources

Archit Pandita is a Data Engineer for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--