Implementing CI/CD for Azure Data Factory using Github Actions

Caspar van der Woerd
Auraidata
Published in
8 min readJun 12, 2023

Introduction

At one of our customers we were involved in the implementation of an automated CI/CD pipeline using Github Actions which validates and deploys Azure Data Factory resources. In this blog, we share why CI/CD is important for ETL and how it can be implemented when working with Azure Data Factory and Github Actions. Code examples are provided 🎊.

Note: this article assumes a basic understanding of Azure Data Factory.

Why do we need CI/CD for our ETL jobs.

Extract, Transform and Load (ETL) is a process for extracting and integrating data from multiple sources into a single datastore for analysis. The pipelines we build for ETL are at the core of many analytics solutions as they automate the data flow which fuels dashboards and ML models. Therefore, it is important that ETL is tested and deployed properly to prevent introducing unwanted errors impacting downstream data products.

One way to achieve this is to set up different environments for testing and production where we can safely play around with test or production data without impacting the end-users of our data products. CI/CD involves the implementation of automated processes to test and deploy our solutions to these environments.

Problem

Let’s consider an example that is a (simplified) scenario reflecting the problem at our customer:

  • We have an Azure Data Factory (ADF) workspace which contains a pipeline that retrieves cocktail recipes using TheCocktailDB API🍹
  • The pipeline stores the recipes in an Azure Blob Storage
  • Authentication with the API is done using a key which is stored in an Azure Key Vault πŸ”’
The pipeline retrieves the API key from the Key Vault and copies a recipe to an Azure Blob Storage
  • For each of these resources (ADF, Blob Storage, Key Vault) we have both a development and a production version. These are nested in corresponding resource groups in Azure (rsg-demo-dev and rsg-demo-prd). Note that the CocktailDB API works with separate test and production keys.
  • A Github Repository is connected to the development ADF and stores all the definitions for the cocktail recipes pipeline.

In the remainder of this article we present a solution and implementation that uses Github Actions to automatically deploy changes to our ADF workspace in the development ADF to the production ADF πŸš€.

Azure resources used in our scenario

Solution

After the implementation of our solutions, the workflow for development in our ADF is as shown below. The Github Actions workflows ensure that deployment to the live ADFs is done smoothly and safely. Next we will describe details of the implementation and which validation additional validation checks are done.

Workflow for development in ADF after implementing our solution.

Note: the proposed solution applies to the above scenario but can easily be adapted to solutions with different workflows or a different amount of ADF workspaces.

Implementation

Repository

The Github Actions workflows required for our solution are implemented in the same repository that stores the ADF resources. The repository we use can be found here: [insert link here]. The structure is as follows:

.
β”œβ”€β”€ .github
β”‚ └── workflows
β”‚ β”œβ”€β”€ deploy_adf.yml # reusable workflow to deploy to ADF env
β”‚ β”œβ”€β”€ merge_to_main.yml. # triggers workflow to merge to main
β”‚ └── validate_adf.yml # validates resources on pull request
β”œβ”€β”€ dataset # contains ADF datasets
β”œβ”€β”€ linkedService # contains ADF linked services
β”œβ”€β”€ pipeline # contains ADF pipelines
β”œβ”€β”€ trigger # contains ADF triggers
β”œβ”€β”€ package.json. # configuration for packages for npm
β”œβ”€β”€ publish_config.json # configuration file for ADF
β”œβ”€β”€ env_mappings.json # parameter mappings dev -> prd
└── README.md

Prerequisites

  • In order to connect Github and Azure we need to create an enterprise application with a corresponding Service Principal (SPN) with the right access as described here.
  • We want to avoid storing sensitive information, therefore we will store the client_id of our SPN, subscription_id and tenant_id as secrets in our Github repository.
  • Make sure the main branch of your Github repository is a protected branch.
  • Optionally, restrict the access of users to the production ADF using RBAC in Azure. This prevents users from working directly in the production environment.
Azure specific references are stored as secrets in our Github Repository

Validating ADF templates before merging to main

The workflow defined in validate_adf.yml prevents merging incorrect ADF resources to the main branch by making it a required status-check. It uses data-factory-validate-action to check JSON-formatted ADF resources in the repository. The workflow is similar to clicking validate_all in ADF UI. The YAML file looks as follows:

name: ValidateADF

on:
pull_request:
branches: [ "main" ]
workflow_dispatch:

jobs:
validate_adf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Validate Data Factory resources
uses: Azure/data-factory-validate-action@v1.1.5
with:
path: ./
id: /subscriptions/${{ secrets.AZURE_SUBSCRIPTION_ID }}/resourceGroups/rsg-demo-dev/providers/Microsoft.DataFactory/factories/adf-aurai-dev
The validate_adf wofklow can be set as required status check to prevent invalid ADF templates from being merged.

Automating ADF resources deployment

Deployment of ADF resources is equivalent to using the publish button in the ADF UI to deploy changes to your ADF workspace (e.g. updating a trigger schedule). We want to be able to deploy both to dev and prd. Therefore, we use a reusable workflow deploy_adf.yml which can be called by other workflows and looks as follows:

name: Deploy ADF
run-name: Deploy ${{ inputs.data_factory }}

on:
workflow_call:
inputs:
resource_group:
description: 'Name of resource group that contains the ADF'
required: true
default: 'rsg-demo-dev'
type: string
data_factory:
description: 'Name of ADF'
required: true
default: 'adf-aurai-dev'
type: string
pause_triggers:
description: 'Pause all triggers before publishing.'
required: false
default: false
type: boolean
overwrite_references:
description: 'Use env_mappings.json to overwrite environment specific references.'
required: false
default: false
type: boolean
secrets:
client_id:
description: 'Client ID of the SPN used to login to azure using OIDC'
required: true
tenant_id:
description: 'Azure Active Directory Tenant ID'
required: true
subscription_id:
description: 'Subscription ID of the subscription containing the ADF'
required: true
permissions:
id-token: write
contents: read

jobs:
deploy_adf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Pause all triggers
if: ${{ inputs.pause_triggers }}
run: |
echo "Pausing all triggers"
for file in ./trigger/*
do
jq '.properties.runtimeState="Stopped"' $file | tee trigger/temp.json >/dev/null
mv trigger/temp.json $file
echo "Set trigger runtimeState to 'Stopped' for $file"
done
- name: Export ARM Template
id: export
uses: Azure/data-factory-export-action@v1.0.2
with:
path: ./
id: /subscriptions/${{ secrets.subscription_id }}/resourceGroups/${{ inputs.resource_group }}/providers/Microsoft.DataFactory/factories/${{ inputs.data_factory }}
- name: Overwrite references in ARM Template
if: ${{ inputs.overwrite_references }}
run: |
echo "Start overwriting references in ARMTemplateForFactory.json and ARMTemplateParametersForFactory.json using the mappings in env_mappings.json"
jq -c 'to_entries[]' env_mappings.json | while read line; do
old=$( jq -r '.key' <<< "$line" )
new=$( jq -r '.value' <<< "$line" )
echo "Replace $old by $new"
sed -i "s|$old|$new|g" ${{ steps.export.outputs.arm-template-directory }}/ARMTemplateForFactory.json
sed -i "s|$old|$new|g" ${{ steps.export.outputs.arm-template-directory }}/ARMTemplateParametersForFactory.json
done
- name: Publish ARM template
uses: actions/upload-artifact@v2
with:
name: adf-artifact
path: ${{ steps.export.outputs.arm-template-directory }}
if-no-files-found: error
- name: Copy ARM Template to root
run: cp -a armTemplate/. .
- name: AZ OIDC login
uses: azure/login@v1
with:
client-id: ${{ secrets.client_id }}
tenant-id: ${{ secrets.tenant_id }}
subscription-id: ${{ secrets.subscription_id }}
enable-AzPSSession: true
- name: Add datafactory extension to AZ CLI
run: az extension add --name datafactory
- name: Deploy resources
uses: Azure/data-factory-deploy-action@v1.2.0
with:
resourceGroupName: ${{ inputs.resource_group }}
dataFactoryName: ${{ inputs.data_factory }}
armTemplateFile: ARMTemplateForFactory.json

The workflow consecutively calls Github Actions to export the ARM template for the ADF resources, login to Azure and deploy the exported ARM template to ADF. The script takes the following parameters:

  1. resource_group: Azure resource group name containing ADF.
  2. data_factory: Name of ADF workspace
  3. pause_triggers: If true, it pauses all triggers thereby pausing all pipeline schedules in the ADF workspace. We use this to pause all dev pipelines to save costs.
  4. overwrite_references: Whether dev references should be replaces with prd references. For example, in our case we want to connect in our dev ADF to our dev key vault (kvl-demo-dev), whereas we want to connect to our production key vault in prd (kvl-demo-prd). By adding this mapping toenv_parameters.json and setting this input to true we can globally replace all occurences of kvl-demo-dev with kvl-demo-prd.
  5. In addition the workflow takes secret inputs client_id (client id of the SPN used for login to Azure using OIDC), tenant_id (id of the tenant) and subscription_id (id of the subscription containing the ADF).

Kicking off the ADF deployments

Finally, the workflowmerge_to_main.yml triggers when we push to the main branch (i.e. by merging a pull request). The script, shown below, implements the following steps:

  1. Trigger the deploy_adf.yml worfklow for the dev environment while pausing all triggers (since we don’t need triggers running constantly in dev).
  2. If deployment to dev is successful, trigger the deploy_adf.yml workflow for production. For production, dev parameters are replaced by their prd equivalent.

For both steps it passes the corresponding inputs and the relevant secrets which are stored in our Github.

name: MergeToMain
on:
push:
branches:
- main
permissions:
id-token: write
contents: read
jobs:
deploy_dev:
name: DeployToDev
uses: ./.github/workflows/deploy_adf.yml
with:
resource_group: 'rsg-demo-dev'
data_factory: 'adf-aurai-dev'
pause_triggers: true
overwrite_references: false
secrets:
client_id: ${{ secrets.AZURE_DEV_CLIENT_ID }}
tenant_id: ${{ secrets.AZURE_TENANT_ID }}
subscription_id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
deploy_prd:
needs: deploy_dev
name: DeployToPrd
uses: ./.github/workflows/deploy_adf.yml
with:
resource_group: 'rsg-demo-prd'
data_factory: 'adf-aurai-prd'
pause_triggers: false
overwrite_references: true
secrets:
client_id: ${{ secrets.AZURE_PRD_CLIENT_ID }}
tenant_id: ${{ secrets.AZURE_TENANT_ID }}
subscription_id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
Merging to main triggers deployment to development and consecutively to production.

Conclusion

By implementing the above Github Actions workflows we now have a deployment pipeline for our Azure Data Factory resources. The solution enables us to test and develop pipelines in our development environment. Automatic validation of resources is done before we can deploy our changes to the development and production environments while updating environment specific references.

This solution provides data engineers a robust way to safely and efficiently deploy changes to Azure Data Factory resources. Thereby, saving time and preventing potential errors in downstream data products πŸš€.

Aurai provides custom data solutions that help companies gain insights into their data. We engineer your company’s future through simplifying, organizing and automating data. Your time is maximized by receiving the automated knowledge effortlessly and enacting better processes on a foundation of relevant, reliable, and durable information. Interested in what Aurai can mean for your organisation? Don’t hesitate to contact us!

--

--

Caspar van der Woerd
Auraidata
Writer for

Data Engineer/Data Scientist based in Amsterdam. Interested in big-data, machine learning and AI.