Parallel loads ensure Pipelines scale

Azure Data Factory is built for scale, but you still need to consider how your coding patterns will affect that.

Carl Follows
Ricoh Digital Services
4 min readApr 27, 2020

--

As with creating new functionality in any programming language the first task is always make sure the code produces the expected output (the unit test).
Next is to clear down the system and check it will still work when the new component it’s incorporated into the whole (the integration test).

At this point the temptation is always to push the change into user testing and move onto delivering the next piece of functionality. Often forgotten is to review the solution and understand how it behaves at scale; when a serious amount of data is thrown into it. This may not even be a requirement for the business as usual operation, maybe just during the initial data migration.
But invariable, if it’s not considered in advance the system is bound to fail at the worst possible moment.

Azure Data Factory is built for scale, but this can be limited by neglecting non-functional requirements and choosing the wrong coding pattern.

One important place to check in the For Each activity.

Sequential For Each

The For Each activity has a subtle but critical configuration “Sequential”.
By default this is set to false, meaning that ADF will try and perform all the iterations at the same time.

This will be good for you, if you your code expects it and allows for it.
But if you’re new to Data Factory then this can come as a bit of a surprise

Initial Approach

Consider the simple requirement, use ADF to move files to a directory, which must be looked-up in a configuration database.

An initial approach may be to loop over the list of files, call a procedure to determine where the file needs to go, set the result to couple of variables and perform the copy activity.

For unit testing set the Sequential flag to monitor how it all works.

Iterate over the files

Within the For Each activity, the lookup calls a procedure on the database and the output is set into variables…

Lookup configuration and move each file in turn

Then within the copy activity those variables are passed into the parameters of a generic sink dataset…

Variables passed into dataset parameters

Ready to Go ?

This will work in unit testing a few files, but try and move hundreds of files it will take a awfully long time (as you might well expect).

Remove Sequential

So the first thought is to make the activity parallel by removing the Sequential configuration, that’ll speed things up, won’t it ?

Alas the behavior is not as some might expect. What you need to appreciate is the scope of the variables. They are not scoped to within the For Each loop, rather there is only a single instance of them for the entire pipeline execution. This means that when the pipeline is executing with multiple For Each items running in parallel, the setting of the variables will be happening out of sync with the file copy activity.

What this means is that the files will likely go to the wrong place and that’s likely to make you unhappy.

Make it Parallel

So how to enable parallel processing?

In this scenario we can see that the problem is in the use of variables.
Whilst I’d typically encourage using variables to make code clearer to monitor and maintain, in this case it’s preventing the solution from scaling.

So we need to replace the setting of variables with expressions within the subsequent tasks, i.e.

Expression passed into dataset parameters

Taking the output of the first activity and feed that directly into the second

@{activity('GET Path').output.firstRow.DataLakePath}

Now we're shifting

So a few subtly changes can make a big difference to how your solution performs. Yes, some of your expressions will become more complex, and you may need to repeat blocks of logic in multiple places.
But it’s better to have a well performing system than just beautiful code.

Remember that Programming languages are all different, and a pattern that serves you well in one can cause you serious headaches in another.
Hopefully reading this will have removed one such headache for you.

--

--

Carl Follows
Ricoh Digital Services

Data Analytics Solutions Architect @ Version 1 | Practical Data Modeller | Builds Data Platforms on Azure