Continuous Deployment for Azure Data Factory

Otrek Wilke
CodeX
Published in
9 min readSep 6, 2021

--

When using the low code platform Azure Data Factory for ETL and other data engineering purposes, it is still a good idea to continuously integration your progress, especially when working in a team. Also continuously deploy new versions of your Data Factory, while not working directly in the production system is a good practice for fast-evolving businesses. How? Let’s dive into it.

tl;dr

Use NPM to create an ARM Template from the collaboration branch of a data factory and a bicep file. Build the infrastructure automatically and deploy the latest version of the collaboration branch to various versions of your ADF using a build pipeline. Use a Release Pipeline to publish the collaboration branch.

CI/ CD in general

Continuous Integration

CI is the process of continuously integrate changes into the current main version of your product, where it is an app, a webpage, or a database. This comes with some implications, first, it is crucial to assert that the product functions in the way it is expected, and second, if it is a team project everyone needs to contribute to the product as often as possible. This later in itself has two issues first team members need to be open about their work and secondly some member is responsible for the integration of changes.
Integration of changes and parallel development on multiple tasks can be easily addressed with the use of git and the feature branch workflow or the git-flow workflow. Using pull request and a maintainer or a group of maintainers a certain level of quality can be assured in the main version.
To address the quality of the product every developer should assert that his code is tested. Therefore test-driven design (TDD) should be used.

Continuous Deployment

When continuously integrating new features to a product the next level is to deploy this latest version right after integration and building (in case a build is needed). Maybe it is a good idea to continuously deploy only to the test or UAT environment, to keep the user from the inconvenience of too-rapid change.

Why use CI/ CD in an Azure Data Factory, the reasons are similar to why to use CI/ CD in a database project. As a team works on the data engineering part of data-driven business, using CI/ CD helps to make new insights available to the business while assuring a level of quality necessary to make the right decision.
Developers should not work in a production environment, but it is easier to integrate small changes fast than large new versions once in a while. Bringing the latest changes from the dev environment can be a tedious task when integrating new versions multiple times a day. Here is the process how to automate this task.

Then generell workflow for CI/ CD and prerequesits

Before getting started with CI and CD in Azure Data Factory you need to set up a few things.

  1. An Azure Account — obvious, if you do not have a Microsoft account with an Azure account and a subscription to create Azure resources, go to www.azure.com and sign up. New accounts get 100$ free for 30 days (no ad).
  2. An Azure DevOps organization that is linked to the aforementioned Azure subscription with a git repository for the development environment of the data factory.
  3. Not necessary, but I would recommend if you set up a local working environment with Azure CLI (az), git, and Visual Studio Code — the latter is more or less necessary, but you can use any editor

How code should flow

  1. Create a main collaboration branch in the development Data Factory.
  2. As a developer create a new feature branch to make the changes you want.
  3. Create a pull request for your feature branch to the main branch.
  4. As the PR gets accepted and merged into the main branch a new build of the ARM Template for the Data Factory should be triggered and the current state of the main branch should be deployed to the various Data Factory environments you use.
  5. The current state of the Data Factory as defined in the main branch should be published to the development Data Factory.
  6. After approval, the new version should also be deployed to the next stage, whether it’s UAT or Production. In the following example, it’ll be published to Production right away.

Setting up the infrastructure

First, we need to create the necessary infrastructure, the data factory environments, and the code repository.

Code repository first

The first step is to create a new Azure DevOps project with at least one repository. This needs to be done with a different IaC solution than bicep or manually. It is recommended to use only one single repository for your data factory, for the development environment. This is due to the fact, that the latest version from the main development branch will be deployed to all other environments.
Also, you need to have some parallel processing capabilities to run the pipelines. Make sure to be at least on the free tier of the parallel processing or have a processing resource as a service or your self-hosted build environment ready.

Setup the Dev Data Factory using bicep

Starting with using bicep as the IaC tool to create the Azure Data Factory. Bicep can be used to define various parts of the data factory but in this first demo, only the factory itself and the git integration is defined within the bicep file

For more please refer to Microsoft.DataFactory factories

Required files to build the ARM templates

As node.js and npm are used to create the ARM Template for the data factory a package.json file is needed to tell npm what to do. It introduces the dependency to the ADFUtilities NPM packages and defines the build script to create the ARM Template.
Here it is possible to also add a validate script to first validate the template before building.

NPM package configuration file

Next, a publish_config.json file is needed to define the publish_branch of the created data factories.

Last but not least an ARM Template definition file is needed. This definition file can be used to configure various things about the data factory, in this example, a minimal version, with just the factory name as a parameter is included.

The CI Part of the Azure Data Factory

Continuous Integration is the practice to commit to the main branch by every team member as often as possible. Sometimes it’s referred to as commit to main daily. The goal is here to have everybody and every working environment as up-to-date as possible.

As stated before the feature branch flow is used in conjunction with pull requests to have every developer commit to the main version of the data factory as often as possible. When these PRs get merged, the new version of the Data Factory should be available to all versions of the Factory, like Development, Testing, UAT, and Production or as many different environments as you need.

This is done with an Azure DevOps Pipeline, a build pipeline to be precise.

Start with creating an empty build pipeline and add the script included in the end to create, build and deploy the Azure data factory. Let’s go through it step by step

  1. Before going through the steps of building the ARM Template, it defines to start the build pipeline for every commit in the main branch, if you like you can change to any other release branch. Also, it uses the latest ubuntu image for the build process and two variables for the working directory, and the subscription to use
  2. The pipelines start with the steps in the build stage
  3. In steps 1 and 2 node.js and npm are installed
  4. Step 3 validates the artifacts using the ADFUtilities NPM package
  5. Step 4 creates the arm templates to be deployed using the ADFUtilities npm package
  6. Steps 5 and 6 use bicep to create the data factory based on the ARM Template and the bicep file
  7. After creating the artifacts in the artifacts folder in the development stage these are deployed to the development data factory.
  8. Last but not least the latest version of the main branch is also deployed to the production ADF. Note that the factory name to deploy to is defined via a variable in the stage, which refers to the parameter given in the arm_templates_parameters.json.

The CD Part for the Azure Data Factory

Now as we have the same main version in all our Data Factory environments, we do or do not want everything to be published to every factory as soon as the new version is available. Therefore a Release Pipeline is used.
The Release Pipeline does the publish part, which you would have triggered manually by clicking the publish button in your Azure Data Factory development frontend. To remove the necessity to publish a new version even to the test environment manually the release pipeline is also used to publish to the development data factory.
First, the newly build Version of the Data Factory is downloaded as an artifact, then it is published to the Test/ Dev environment. After approval in the Test/ Dev stage. the version gets pushed to UAT and then Prod, or as in this Demo, directly to production.
If you prefer you can have your new version published right away onto Dev, UAT, and Prod in one go, up to how you set the triggers within the release pipeline.

Download the artifact, deploy to development and last but not least after approval to production

The release pipeline is shown here consist of three steps

  1. Download the artifacts
  2. Publish the current main branch to the development data factory
  3. Publish the current main branch to the production data factory
Tasks in the Pipeline

Download the artifacts

The first step is super simple. Only the previously created artifact needs to be specified here. Also, a continuous deployment trigger or a schedule might be specified here.

Download the artifacts

Deploy to Development and Production

Both the deployment to development and the production environment consists of the tasks. Two Azure Powershell tasks and an ARM Template Deployment Task

First run the pre-deployment PowerShell script, while passing the parameters for the data factory name, the resource group name, and the template to use from the artifact.

Predeployment configuration setup

Next, run the arm template deployment, make sure to overwrite the necessary parameters.

Configuration for the actual deployment

Last but not least run the post-deployment script, again the parameters for data factory name, resource group, and template need to be given. Attention! The additional parameter preDeployment must be set to $false.

Configuration for the post deployment script

Wrap up

In this article, it is shown how to set up the development and deployment process for a data factory in a DevOps style implementing continuous integration and continuous deployment. In this example the way recommended by Microsoft is used, there are other ways, for example using terraform. The given templates and examples can be easily extended to be used for your specific application.
Be aware that you have to specify several parameters within the pipelines. Parameters in the release pipeline are specified in the frontend of Azure DevOps, which in my opinion is not ideal, since the paradigm of having everything as code is broken.
If you like the article, leave a clap. If you like to know more about Azure Data Factory, continuous integration, and data engineering, leave a comment or question or subscribe.

And if this article was helpful to you consider:

Sources

https://en.wikipedia.org/wiki/Continuous_integration

https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment-improvements

The full pipeline

--

--

Otrek Wilke
CodeX

Data Engineering made easy. Writing about things learned in data engineering, data analytics, and agile product development.