Continuous Integration & Delivery with Azure Data Factory

C:\Dave\Storey
The Startup
Published in
11 min readJun 22, 2020

--

Introduction

As Software Engineers, we all know that Continuous Integration and Continuous Delivery (CI/CD) are things that we should strive to achieve in the systems we build. There is nothing worse than having something we build sit locked away in Git, rotting, unable to see the light of day/production because our team have an archaic release process. Or even worse, when companies do not trust their engineering teams to have automated deployments for fear of breaking production environments.

I recently worked on a large scale data ingestion project using the Azure cloud platform. Naturally Azure Data Factory V2 (ADF) became our logical choice of technology, but the question arose, how could we provide a flexible platform that would allow data engineers to experiment, build and modify their pipelines, while at the same time provide a stable production system to feed their downstream services with new data, and minimise downtime for deployments?

Turns out that this is actually not that hard thanks to the tooling available in ADF and Azure. But there were some interesting technical challenges to overcome. So sit back, relax, grab yourself some popcorn or a cuppa, and join me on this voyage of discovery into CI/CD using Azure Data Factory.

As always I will summarise my findings at the end of this blog, so if you want the TL;DR; version, please just skip to the end (I promise I won’t be offended😄)

The Design

So the design of the system wasn’t anything too radical, it followed the standard practice of using multiple environments, standard DevOps practices, quality gates and deployment technologies that I am sure a number of you are familiar with:

Simplified system design
  • There were 3 physical ADF instances — For security purposes we put each instance in its own subscription, but a resource group could suffice.
  • Data engineers would be granted access to only the Dev ADF instance. This would prevent someone making changes directly on Test or Prod and circumventing the CI/CD process
  • The Dev ADF instance had Git integration enabled so all changes to ADF artifacts such as pipelines and triggers etc. were persisted into source control.
  • Data engineers would work on git branches, once they were happy with their work they would raise a Pull Request to merge their changes into the master branch.
  • To ensure changes could not be made directly onto the master branch, we enabled branch policies in git.
  • Our release pipeline would perform automated deployment and testing of ADF artifacts to our environments.
  • Test environment was essentially an integration environment, pipelines would be run against example datasets to ensure they functioned as expected, and only if these tests passed would we allow the release to prod to take place

Seems simple right? Well thankfully it actually is because of the tools ADF provides you.

ADF L❤️ves Git

So the real “secret sauce” that makes all of our solution possible was being able to connect ADF to Git. By doing this, we were able to enforce good DevOps practices, and to have have all our ADF changes automatically start our release pipeline.

Now I’m not going to write how to setup the integration manually using the UI etc. because there are a lot of other great docs that go into this. I thought instead I would show the approach we took to achieve this using Terraform:

Here are a few things to note about this code snippet:

  1. The use of dynamic for vsts_configuration is necessary because we only want to have the git integration enable for our “dev” environment. For those not familiar with this syntax, it is a really great addition to Terraform in 0.12 for having conditional blocks inside a resource 🙂
  2. As we were using a Service Connector inside AzDO for provisioning these resources, we had to grant the SP the Data Factory Contributor role to be able to do advanced configuration.
  3. The integration of ADF to Git is very handy if you ever need to tear down your infrastructure, because once you run your script again, and the integration is created, magically all your pipelines will reappear in your new ADF instance!
  4. The full configuration options for the azurerm_data_factory resource can be found here.

So while the code above will provision an ADF with git integration if the configuration is set to be deploying to the dev instance (this could also be achieved using terraform workspaces) there are a few things you need to be aware of:

  • branch_name : this is the name of the “Collaboration branch”. I will go into this in more detail in the next section, but suffice to say that this is the branch that ADF wants to use as “the source of truth” and defaults to master
  • root_folder : this can be insanely helpful if you are building a mono repo of code and want to store your ADF artifacts alongside other code artifacts. By specifying a folder here ADF can use a subfolder inside the repo rather than pollute the root of the project.

There is an issue/feature with the Git integration feature of ADF that requires some elevated permissions. You may see the following error when trying to perform the integration via Terraform:

"The client 'xxx@xxx.com' with object id 'xxxxx' does not have authorization to perform action 'Microsoft.DataFactory/locations/configureFactoryRepo/action' over scope '/subscriptions/xxxxxx' or the scope is invalid"

If this happens, you can resolve it by creating a custom role with the necessary permission applied, and then granting that role at the subscription level. Note this must be set at the subscription level at the time of writing this blog, hopefully this will be resolved soon.

I just want to take this opportunity to point out that the azurerm provider lets you do a lot more that just provision an ADF instance. A colleague of mine has contributed lots of other cool additions to the provider. For example, you could potentially have all of your ADF pipelines created, deployed and managed via Terraform

Collaboration The ADF Way

As I mentioned in the previous section, when ADF is integrated with Git, there is this concept of a “collaboration branch”. But what does this mean? Let's dig a little bit deeper on this topic and get a better understanding of collaboration in ADF. Please take a look at the following diagram, provided by Microsoft, which explains the process:

https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment

As you can see, the collaboration branch is what ADF uses to determine its current state. So using good git workflows we should be working in branches, raising Pull Requests, reviewing each others code, and when we merge our code into the collaboration branch, then ADF will see these changes and update itself accordingly.

This is all well and good, and our development ADF instance now knows about all our changes, but how can we then get these changes rolled out to our other environments? 🤔

In ADF terminology this is known as “Publishing”, and can only take place from the collaboration branch. When you click the “Publish” button ADF identifies all changes between the collaboration branch now and the last time a Publish was performed and generates all necessary ARM templates to represent the state of the ADF instance. These ARM templates are then pushed back into your git repo under a special branch named adf_publish ready to be deployed to another instance.

Now we have ARM templates being published onto a branch in git, the only thing left to do is create a release pipeline to push these templates to our other environments et voilà! we can automate ALL THE THINGS!

One big “but”…

As you can see, this is all great, but there is one giant caveat… someone still needs to click the Publish button to start the flow 😔 I can’t begin to tell you the number of times I was swearing under my breath that I couldn’t see my changes in the Test/Prod environment, only to have a teammate remind me that I hadn’t clicked “Publish”

Sometimes… you just have to apply palm straight to face

Deployment Pitfalls

Ok so we have our release pipeline setup following the guide provided by Microsoft. Seems simple enough… and for the most part it is, but there are a few pitfalls along the way that I just want to flag up:

  • The adf_publish branch, natively only contains ARM related resources created via the ADF Publishing process. As we had other dependencies inside our source code repo we decided instead to trigger deployments from master and use the adf_publish branch as a deployment artifact instead.
  • ADF Triggers must be stopped before they are updated. Luckily there is a script supplied that can help you automate this.
  • ADF does a good job of hoisting a number of configuration values up as ARM Parameters, which can easily be overridden during deployment, but it is not foolproof. For some more complicated pipelines you may find certain values you would expect to be hoisted as Parameters are actually not. At this point you have two options:
  1. You can create a custom default parameterization template, however, be warned, the syntax is not the easiest to understand (I will likely write a future blog about this topic to make it easier to understand)
  2. You can use this handy hack found by a member of my team and configure ADF to just pull parameter values directly from KeyVault.
  • When deploying ADF resources via ARM it is usually best to select an “Incremental” deployment. This means that new and existing resources will be updated/created, but old resources that are no longer part of the template will be left hanging around. For ADF this is usually quite safe, but worth remembering if you start to see orphaned/old resources hanging around in your subscriptions. More information about ARM deployment modes can be found here

Integration Testing Azure Data Factory

So… we have an automated pipeline for deploying ADF logic/artifacts to our environments, but what about quality gates? A key part of the CI/CD process is to include quality gates to our pipelines, without these quality gates, we could be pushing broken code into production! How on Earth do we test our pipelines to make sure they work outside of Dev?

Testing ADF is quite straight forward thanks to an extensive set of cmdlets provided for interfacing via Powershell. Below is a sample Powershell script that will enable you to trigger a pipeline to run, poll until is has completed and then, if it succeeds, perform some assertion logic on its output:

But what are we going to test?

This is a question that comes up all too frequently when dealing with integration tests. A few things to remember:

  • You are not testing ADF: Microsoft have a lot of tests for this already, you only want to test your pipeline logic
  • What does your pipeline do?: Hopefully nothing too complicated, so you may be able to simplify your testing to “Given x input file, do we get output that matches y?”
  • Pipelines can take a long time to run: Dependant on the logic inside, and amount of work taking place on the ADF instance, your pipeline may take a few minutes to run. How fast does your CI/CD process need to be?
  • Identify test domain boundaries: Remember that there may be better ways to test things. For instance, if your pipeline calls out to an Azure Function, you should potentially think about testing that resource separately and/or mocking it during testing to produce consistent results.
  • Treat Test and Prod environments differently: When deploying to production the last thing you want to do is run an extensive testing suite and prevent production workloads from getting CPU time. Think carefully about what you need to test on production, I would potentially argue that all you need to do is perform some simple smoke tests to ensure the pipeline has been deployed successfully.

TL;DR; Summary;

  • Continuous Integration and Continuous Deployment of ADF is relatively simple to do because of the built in integration to source control systems such as AzDO Git Repositories or GitHub; once this integration is enabled it is a simple case of building a release pipeline.
  • Integration testing of ADF pipelines is possible via the usage of Powershell cmdlets, but can be quite arduous, and I would definitely recommend that you keep these tests as simple as possible to avoid them becoming brittle as pipelines evolve.
  • People must remember to click the Publish button in ADF; if they don’t click it, then code will not be released to other environments. It is therefore important to think about your development process when using this workflow.
  • ADF uses a special adf_publish branch for storing deployment ARM templates. This could be used to
  • Some work is required to swap out certain types of pipeline config values in ARM template parameters at deployment time. But there are workarounds available (See the Deployment Pitfalls section)
  • Pipeline Triggers must be stopped before they are updated. This can lead to some unexpected race conditions if you are using file based triggers. Example: File based trigger has to be stopped to deploy, but right during that window the trigger is stopped, a new file lands in your storage account… uh oh!
  • Integration testing of ADF Pipelines can be quite painful and brittle unless you spend time thinking about what you need to test and keep your tests simple.
  • Coordination of deploying ADF artifacts and any dependencies requires some consideration. A good example is a pipeline that calls an Azure Function, how do you orchestrate the deployment of the pipeline if it requires an update to the Function first?
  • If you are deploying ADF with Git integration via Terraform, then the user/service principal performing the deployment needs to have permissions to perform the action configureFactoryRepo at the subscription level.
  • Reverting changes made to ADF is hard. If you deploy something that is broken, rolling back can be very time consuming and painful. One idea we did toy with was having 2 Production environments and having a blue/green deployment process where we would “switch over to” our new environment by enabling the triggers once tests were complete, and leaving the old version of our pipelines deployed to a separate instance of ADF with triggers left disabled.

Closing Remarks

Azure Data Factory V2 is a powerful and robust platform for performing ETL at scale. It provides a nice simple interface for engineers to drag and drop processes together and form complex logic, whilst hiding away a lot of the complexity of error handling etc.

Hopefully this blog post has shown you how simple it can be to get a CI/CD workflow setup and running with Azure Data Factory V2. It is not without its pitfalls, but I hope in reading this, I can help others navigate around the potential pain points and pitfalls.

In retrospect, if Iwas to undertake the same type of project again, I would probably investigate the use of the new Terraform resources for pipelines and triggers etc. rather than have to use ARM templates. The reasons for this are:

  • Reduce the need to know how to manipulate ARM template parameters (which can be surprisingly complicated).
  • If you are already using Terraform to deploy Infrastructure, this naturally slots into that flow.
  • Terraform does a much better at managing state than using incremental ARM template deployments.
  • It would also remove the need to remember to click that damned Publish button in the ADF web UI!!!!! 😠

Further Reading

--

--

The Startup
The Startup

Published in The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

C:\Dave\Storey
C:\Dave\Storey

Written by C:\Dave\Storey

Software engineer & lover of all things code. Too much to learn and so little time. Currently working at Trainline London.

Responses (1)