How to Set Up CI/CD for Azure Data Factory Using Azure DevOps

BulletByte
12 min readSep 3, 2023

--

If you have a team of developers working on Azure Data Factory for data engineering workloads, implementing CI/CD provides an efficient and seamless approach to building, validating, and deploying ADF components across environments.

This post will provide a quick guide on how to get started with setting up two simple CI/CD pipelines in Azure DevOps using YAML for automatic build, validation, and deployment of Azure Data Factory artifacts (e.g. Linked Services, Triggers, Pipelines, Datasets) across environments.

Benefits of CI/CD for Azure Data Factory

Implementing CI/CD for Azure Data Factory brings several benefits, especially if you are collaborating with multiple data engineers in a big team:

  • Version control with GIT enables source code protection, auditability, and traceability.
  • Eliminate human errors associated with manual deployments and improve developers’ productivity.
  • Ensure consistency across environments and make troubleshooting easier.
  • Development efforts can be easily scaled up as projects expand.
  • Reduce risks as changes can be automatically validated and gatekept before production deployment.

Azure Data Factory CI/CD Lifecycle

The diagram below describes the stages within the Azure Data Factory CI/CD lifecycle. More details on how to set up each stage will be described in the subsequent sections.

Azure Data Factory CI/CD Architecture

Pre-requisites

  1. At least two Azure Data Factory environments (Dev and Prod).
  2. An Azure DevOps Project with a repository created.
  3. Service Connection with Service Principal that have at least Data Factory Contributor role on the target Data Factory resources.

Development in ADF with Azure DevOps GIT

Azure Data Factory has two modes — GIT mode and Live mode. In the development environment (Dev ADF), GIT mode is where the development happens. Follow the steps below to set up Git Integration for the Dev ADF.

​​Step 1: In the Dev Azure Data Factory Studio, navigate to the Manage tab > select Git configuration under Source control section > click Configure.

Configure Git under Managament Hub in the ADF UI

Step 2: In the Repos settings, choose “Azure DevOps Git” as the repository type. Select your Azure DevOps Account, Project name and the existing Repository that you have created in your DevOps project. Use `main` as the Collaboration branch and the default `adf_publish` as the Publish branch. You should create a sub-folder `adf-code` under the Root folder for easy organisation of all your ADF artifact files.

Azure DevOps Git Repos Settings in ADF

ADF development process with Git integration:

  • Each developer creates their own Feature branch as a clone of the Collaboration branch (‘main’).
  • Once the developer has made and tested the changes in the Feature branch, he/she needs to create a Pull Request (PR) to merge into the Collaboration branch (‘main’).
  • When the PR is approved and completed, changes will be committed in the ‘main’ branch which will automatically trigger the first pipeline that will build and deploy the changes to the Live mode of the Dev ADF. More details on how to set up this build-and-deploy-dev pipeline are in the next section.
In the ADF Studio GIT mode, developers can create feature branches to work on
A Pull Request is required to merge to the collaboration branch

Build and Deploy to Dev ADF

In the Build (CI) stage, an Azure DevOps pipeline will automatically validate the ADF code and generate the ARM templates using the ADFUtilities package in NPM. The same pipeline will also deploy the ADF artifacts to to the Live mode in the DEV ADF automatically. Below are the steps to set up your first Azure DevOps pipeline to build and deploy ADF in Dev environment.

Step 1: Add a `package.json` file into your DevOps repos which will contain the details to obtain the ADFUtilities package for ADF code validation later on in the pipeline.

  1. In your DevOps repository, create another subfolder `ci-cd` and a `package.json` file
  2. Paste the script below into the `package.json` file.​
{
"scripts": {
"build": "node node_modules/@microsoft/azure-data-factory-utilities/lib/index"
},
"dependencies": {
"@microsoft/azure-data-factory-utilities": "^0.1.6"
}
}
Create a sub-folder `ci-cd` in your repository to keep the package.json file

Step 2: Create two variable groups — one for Dev and one for Prod. Variables will be referenced within our YAML pipeline files. Use of variables prevents hard-coding of sensitive information and makes it easier to update and maintain parameters.

  1. In Azure DevOps, go to Pipelines > Library > click + Variable group. In this example, the variable groups are named `adf-dev` and `adf-prod` respectively.
  2. For each variable group, add the following variables and the respective values of your target Data Factories: subscription_id, service_connection_name, resource_group_name, data_factory_name
  3. Save.
Create 2 variables groups to keep the parameters with values to be replaced across environments

Step 3: In Azure DevOps, go to Pipelines > create New pipeline and select Azure Repos Git (YAML). Select the Dev ADF repository, and then choose Starter pipeline to start building our first pipeline that will build and deploy artifacts to the Dev ADF. Repeat this step to create the production deployment pipeline in the next section.

Create a new pipeline using Azure Repos Git (YAML)
Choose Starter pipeline to start scripting

Step 4: Start with renaming the YAML file name and its path. In this example, we are going to store the pipeline YAML file in the same `ci-cd` folder as the `package.json` file and name this `adf-build-and-deploy-dev-pipeline.yml`. Modify the YAML file to turn off the default trigger to avoid any accidental pipeline run.

trigger:
# - main
- none
Rename the YAML file path name and turn off the default trigger

Step 5: Start building your pipeline YAML file for the build and deploy of the Dev ADF. This section describes each of the steps included in the DevOps pipeline, the final YAML file can be downloaded at the end of this blog post.

  1. Declare variables with variable group `adf-dev`, additional variables, and the default vmImage.
pool:
vmImage: ubuntu-latest

variables:
- group: adf-dev
- name: BuildAdfResourceId
value: /subscriptions/$(subscription_id)/resourceGroups/$(resource_group_name)/providers/Microsoft.DataFactory/factories/$(data_factory_name)
- name: WorkspaceArmTemplateDirectory
value: $(Pipeline.Workspace)/adf-artifact-ArmTemplate

​2. Create a stage called `Build_Adf_Arm_Stage` (CI) which will include these steps:

  • Ensure the workspace is clean and that there are no previous codes that are still in there.
  • Install Node.js and npm package management saved in your package.json file. This will enable the use of the ADFUtilities for validation and creation the deployment template.
  • Validate all ADF source code within subfolder `adf-code` using the validate command.
  • Generate ARM templates based on the ADF source code using the export command.
  • Publish the ARM templates as a pipeline artifact named `adf-artifact-ArmTemplate`
stages:
- stage: Build_Adf_Arm_Stage

jobs:
- job: Build_Adf_Arm_Template
displayName: 'ADF - ARM template'
workspace:
clean: all
steps:
- checkout: self
displayName: 'Checkout ADF repo'
clean: true
path: $(data_factory_name)

# Installs Node and the npm packages saved in your package.json file
- task: NodeTool@0
displayName: 'Install Node.js'
inputs:
versionSpec: '14.x'

- task: Npm@1
displayName: 'Install npm package'
inputs:
command: 'install'
workingDir: '$(Build.SourcesDirectory)/ci-cd/'
verbose: true

# ADF - Validates all on adf-code.
- task: Npm@1
displayName: 'Validate Source code'
inputs:
command: 'custom'
workingDir: '$(Build.SourcesDirectory)/ci-cd/'
customCommand: 'run build validate $(Build.SourcesDirectory)/adf-code $(BuildAdfResourceId)'

# ADF - Generate ARM template
- task: Npm@1
displayName: 'Generate ARM template'
inputs:
command: 'custom'
workingDir: '$(Build.SourcesDirectory)/ci-cd/'
customCommand: 'run build export $(Build.SourcesDirectory)/adf-code $(BuildAdfResourceId) "ArmTemplate"'

# Publish the artifact
- task: PublishPipelineArtifact@1
displayName: 'Publish ARM tempate'
inputs:
targetPath: '$(Build.SourcesDirectory)/ci-cd/ArmTemplate'
artifact: 'adf-artifact-ArmTemplate'
publishLocation: 'pipeline'

​ 3. Create a stage called `Deploy_Dev_Stage` (CD) which will include these steps:

  • Download the ARM artifacts that was published in the build stage. It is a good practice to list all files downloaded in from the directory.
  • Turn off all ADF triggers as a best practice to ensure no ADF pipelines may run while deployment is in progress.
  • Deploy to Dev ADF (Live mode). In order to successfully deploy Linked Services, make sure that the service principal or Managed Identity used have sufficient access to the related Azure resources linked from ADF. You may include overrideParameters at this step to replace any default values with custom values.
  • After the deployment is completed, turn on all triggers which were stopped so that ADF pipelines may resume to run according to their configured schedule.
- stage: Deploy_Dev_Stage
displayName: Deploy Dev Stage
dependsOn: Build_ADF_ARM_Stage
jobs:
- deployment: Deploy_Dev
displayName: 'Deployment - DEV'
environment: DEV
strategy:
runOnce:
deploy:
steps:
- task: DownloadPipelineArtifact@2
displayName: Download Build Artifacts - ADF ARM templates
inputs:
artifactName: 'adf-artifact-ArmTemplate'
targetPath: '$(WorkspaceArmTemplateDirectory)'
- script: dir
displayName: List files in Workspace
workingDirectory: '$(WorkspaceArmTemplateDirectory)'

- task: toggle-adf-trigger@2
displayName: STOP ADF Triggers before Deployment
inputs:
azureSubscription: '$(service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(data_factory_name)'
TriggerFilter: '' #Name of the trigger. Leave empty if you want to stop all trigger
TriggerStatus: 'stop'

- task: AzureResourceManagerTemplateDeployment@3
displayName: 'Deploying to Dev RG task'
inputs:
deploymentScope: 'Resource Group'
azureResourceManagerConnection: '$(service_connection_name)'
subscriptionId: '$(subscription_id)'
action: 'Create Or Update Resource Group'
resourceGroupName: '$(resource_group_name)'
location: '$(location)'
templateLocation: 'Linked artifact'
csmFile: '$(WorkspaceArmTemplateDirectory)/ARMTemplateForFactory.json'
csmParametersFile: '$(WorkspaceArmTemplateDirectory)/ARMTemplateParametersForFactory.json'
overrideParameters: '-factoryName "$(data_factory_name)"'
deploymentMode: 'Incremental'

- task: toggle-adf-trigger@2
displayName: START ADF Triggers after Deployment
inputs:
azureSubscription: '$(service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(data_factory_name)'
TriggerFilter: '' #Name of the trigger. Leave empty if you want to start all trigger
TriggerStatus: 'start'

4. Update the trigger to only trigger the pipeline when there are new changes in the `main` branch `adf-code` folder. Click Save.

trigger:
branches:
include:
- main
paths:
include:
- adf-code

​Step 6. Rename the DevOps pipeline.

Rename the DevOps pipeline as `adf-build-and-deploy-dev-pipeline`

Step 7: Trigger the DevOps pipeline. Go back to ADF studio, create a new feature branch, make some changes and start a pull request to merge to the `main` branch. Observe the DevOps pipeline runs automatically. As soon as the pipeline has successfully built and deployed the Dev ADF, check the changes that you have made in `Live` mode on ADF studio.

A successful pipeline run shows the two-stage completion that build and eploy Dev ADF

We have just completed building the CI/CD to build and deploy ADF in the development environment. With this, developers no longer need to remember to hit the “Publish” button in the ADF Studio UI — changes will be automatically validated and deployed from Git to Live mode as soon as the PR is completed in Dev.

Deploy to Prod ADF

Now that the ADF artifacts are deployed in the development environment. The next part is to create another DevOps pipeline that will deploy the ADF components to a new environment — Production.

Step 1: Similar to the instruction in step 3 of the previous section, create a new DevOps pipeline named `adf-deploy-prod-pipeline.yml` under the `ci-cd` subfolder of your repos.

Step 2: Start building the new pipeline for production deployment with the following steps:

  • Explicitly disable the default pipeline trigger. By default, Azure DevOps pipeline will be triggered by any changes committed in the `main` branch, you would have to explicitly disable this.
  • Set the trigger condition for this new pipeline which will be dependent on: (1) the completion of the previous pipeline `adf-build-and-deploy-dev-pipeline`, and (2) the latest dev deployment pipeline run is tagged `Production`.
  • Set the variables and variable group — `adf-prod` to be used for this pipeline.
trigger:
# - main
- none

resources:
pipelines:
- pipeline: 'ADF-Build-Dev-resource'
project: 'Demo'
source: 'adf-build-and-deploy-dev-pipeline'
trigger:
stages:
- Deploy_Dev_Stage
tags:
- Production

pool:
vmImage: ubuntu-latest

variables:
- group: adf-prod
- name: WorkspaceArmTemplateDirectory
value: $(PIPELINE.WORKSPACE)/ADF-Build-Dev-resource/adf-artifact-ArmTemplate
Define pipeline triggers and variables for deploying to Prod ADF

Step 3: Continue to complete the pipeline YAML file. Create a stage called `deploy_to_Prod` which is very similar to the deploy to dev stage that we had created in the previous pipeline.

  • Download the ARM artifacts by referencing what was already built from the previous pipeline. Again, it is a good practice to list the files from the directory.
  • Stop all ADF Triggers to make sure no ADF pipelines may run accidentally while the deployment is going on.
  • Deploy your ADF artifacts to Production environment. The overrideParameters section is important here, you can include any ARM template parameters that need to have their values replaced accordingly as they move to production. More details on this can be found in the Bonus Tips below.
  • After the deployment, start all ADF triggers again.
stages:
- stage: Deploy_to_Prod
displayName: Deploy Prod Stage
jobs:
- deployment: Deploy_Prod
displayName: 'Deployment - Prod'
environment: PROD
strategy:
runOnce:
deploy:
steps:
- download: 'ADF-Build-Dev-resource'

- script: dir
displayName: List files in Workspace
workingDirectory: '$(WorkspaceArmTemplateDirectory)'

- task: toggle-adf-trigger@2
inputs:
azureSubscription: '$(service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(data_factory_name)'
TriggerFilter: ''
TriggerStatus: 'stop'

- task: AzureResourceManagerTemplateDeployment@3
displayName: 'Deploying to Prod RG task'
inputs:
deploymentScope: 'Resource Group'
azureResourceManagerConnection: '$(service_connection_name)'
subscriptionId: '$(subscription_id)'
action: 'Create Or Update Resource Group'
resourceGroupName: '$(resource_group_name)'
location: '$(location)'
templateLocation: 'Linked artifact'
csmFile: '$(WorkspaceArmTemplateDirectory)/ARMTemplateForFactory.json'
csmParametersFile: '$(WorkspaceArmTemplateDirectory)/ARMTemplateParametersForFactory.json'
overrideParameters: '-factoryName "$(data_factory_name)"'
deploymentMode: 'Incremental'

- task: toggle-adf-trigger@2
inputs:
azureSubscription: '$(service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(data_factory_name)'
TriggerFilter: ''
TriggerStatus: 'start'

Step 4: To trigger the Production deployment pipeline, go to the latest successful run of the `adf-build-and-deploy-dev-pipeline` and add a tag `Production` to it. Observe that the `adf-deploy-to-prod-pipeline` will start running automatically. Check that all your ADF artifacts have been added to the production environment once the pipeline has run successfully.

Add a tag `Production` to the latest successful Dev pipeline run to trigger to Prod deployment pipeline

Bonus Tips

1. Make use of the overrideParameters and variable groups to automatically replace the values of parameters in the ARM templates as you deploy ADF artifacts across environment.

  • Refer to the `ARMTemplateParametersForFactory.json` file which can be found in the pipeline artifact to know which ADF parameters are available and can be replaced.
  • If there are multiple ADF parameters to be included, you can add them as multiple lines under overrideParameters using the example syntax below:
overrideParameters: >-
-factoryName "$(data-factory-name)"
-linkedservice_storage_account_properties_typeProperties_url "$(adls-url)"
-linkedservice_key_vault_properties_typeProperties_baseUrl "$(akv-url)"

2. Integrate Azure Key Vault to protect sensitive information in your ADF linked services’ connection strings as well as Azure DevOps variables. It is fairly common for data engineers to create ADF linked services with hard-coded connection strings and credentials. Similarly, DevOps engineers may also store sensitive information in variables used in Azure DevOps pipelines. We can avoid the security risks of exposing sensitive information of your organisation’s cloud resources by keeping these information as secrets in Azure Key Vault.

3. Implement Azure DevOps Git Branch Policies and Approvals for Prod environment. It is a good practice to implement additional checks and approval processes for your CI/CD pipelines. This is to prevent unauthorized or unapproved changes to your Production ADF and it is especially important when working in big team.

4. Do not change the deployment mode! Always use ‘Incremental’ mode as the scope of deployment affects the whole resource group and not just data factory. If you want to synchronize the ADF Live mode with your GIT (publish your repo as the source of truth), go to your collaboration branch in ADF > Manage Hub > Git configuration > select “Overwrite live mode”.

Force publish changes from Collaboration branch

5. It’s recommended to use Global Parameters in ADF. For CI/CD to work with global parameters, you can select the setting to “Include global parameters in the ARM template” under ADF Manage Hub > ARM template.

Include global parameters in ARM template to use in CI/CD

Conclusion

By implementing the complete CI/CD process for Azure Data Factory in Azure DevOps, we have successfully automated out ADF development and deployment process. This enables source code control, auditability, traceability, and consistent across your ADF environments, reducing risks of unwanted changes being implemented in Production.

I hope this article has been helpful. You can get a free copy of the YAML files for both pipelines via the link to my GumRoad store below.

References

  1. Link Azure DevOps variable groups with Azure Key Vault
  2. Implement Azure DevOps Git Branch Policies
  3. Implement Pipeline Approvals and Checks for Azure DevOps Environment

Like this post? 👏 Check out more content on my blog or consider supporting me with a cuppa coffee. ☕️ Thank youu 🙏

Originally published at https://bulletbyte.weebly.com.

--

--

BulletByte

I blog about data, tech, my hobbies and life - from sunny Singapore