Implementing CI/CD for Azure Data Factory using Github Actions

Published in

Auraidata

8 min readJun 12, 2023

Introduction

At one of our customers we were involved in the implementation of an automated CI/CD pipeline using Github Actions which validates and deploys Azure Data Factory resources. In this blog, we share why CI/CD is important for ETL and how it can be implemented when working with Azure Data Factory and Github Actions. Code examples are provided 🎊.

Note: this article assumes a basic understanding of Azure Data Factory.

Why do we need CI/CD for our ETL jobs.

Extract, Transform and Load (ETL) is a process for extracting and integrating data from multiple sources into a single datastore for analysis. The pipelines we build for ETL are at the core of many analytics solutions as they automate the data flow which fuels dashboards and ML models. Therefore, it is important that ETL is tested and deployed properly to prevent introducing unwanted errors impacting downstream data products.

One way to achieve this is to set up different environments for testing and production where we can safely play around with test or production data without impacting the end-users of our data products. CI/CD involves the implementation of automated processes to test and deploy our solutions to these environments.

Problem

Let’s consider an example that is a (simplified) scenario reflecting the problem at our customer:

We have an Azure Data Factory (ADF) workspace which contains a pipeline that retrieves cocktail recipes using TheCocktailDB API🍹
The pipeline stores the recipes in an Azure Blob Storage
Authentication with the API is done using a key which is stored in an Azure Key Vault 🔒

The pipeline retrieves the API key from the Key Vault and copies a recipe to an Azure Blob Storage

For each of these resources (ADF, Blob Storage, Key Vault) we have both a development and a production version. These are nested in corresponding resource groups in Azure (rsg-demo-dev and rsg-demo-prd). Note that the CocktailDB API works with separate test and production keys.
A Github Repository is connected to the development ADF and stores all the definitions for the cocktail recipes pipeline.

In the remainder of this article we present a solution and implementation that uses Github Actions to automatically deploy changes to our ADF workspace in the development ADF to the production ADF 🚀.

Solution

After the implementation of our solutions, the workflow for development in our ADF is as shown below. The Github Actions workflows ensure that deployment to the live ADFs is done smoothly and safely. Next we will describe details of the implementation and which validation additional validation checks are done.

Note: the proposed solution applies to the above scenario but can easily be adapted to solutions with different workflows or a different amount of ADF workspaces.

Implementation

Repository

The Github Actions workflows required for our solution are implemented in the same repository that stores the ADF resources. The repository we use can be found here: [insert link here]. The structure is as follows:

.
├── .github
│   └── workflows                 
│       ├── deploy_adf.yml        # reusable workflow to deploy to ADF env
│       ├── merge_to_main.yml.    # triggers workflow to merge to main
│       └── validate_adf.yml      # validates resources on pull request
├── dataset                       # contains ADF datasets
├── linkedService                 # contains ADF linked services
├── pipeline                      # contains ADF pipelines
├── trigger                       # contains ADF triggers
├── package.json.                 # configuration for packages for npm
├── publish_config.json           # configuration file for ADF
├── env_mappings.json             # parameter mappings dev -> prd
└── README.md

Prerequisites

In order to connect Github and Azure we need to create an enterprise application with a corresponding Service Principal (SPN) with the right access as described here.
We want to avoid storing sensitive information, therefore we will store the client_id of our SPN, subscription_id and tenant_id as secrets in our Github repository.
Make sure the main branch of your Github repository is a protected branch.
Optionally, restrict the access of users to the production ADF using RBAC in Azure. This prevents users from working directly in the production environment.

Azure specific references are stored as secrets in our Github Repository

Validating ADF templates before merging to main

The workflow defined in validate_adf.yml prevents merging incorrect ADF resources to the main branch by making it a required status-check. It uses data-factory-validate-action to check JSON-formatted ADF resources in the repository. The workflow is similar to clicking validate_all in ADF UI. The YAML file looks as follows:

name: ValidateADF

on:
  pull_request:
    branches: [ "main" ]
  workflow_dispatch:

jobs:
  validate_adf:
    runs-on: ubuntu-latest
    steps:      
      - uses: actions/checkout@v3
      - name: Validate Data Factory resources
        uses: Azure/data-factory-validate-action@v1.1.5
        with:
          path: ./
          id: /subscriptions/${{ secrets.AZURE_SUBSCRIPTION_ID }}/resourceGroups/rsg-demo-dev/providers/Microsoft.DataFactory/factories/adf-aurai-dev

The validate_adf wofklow can be set as required status check to prevent invalid ADF templates from being merged.

Automating ADF resources deployment

Deployment of ADF resources is equivalent to using the publish button in the ADF UI to deploy changes to your ADF workspace (e.g. updating a trigger schedule). We want to be able to deploy both to dev and prd. Therefore, we use a reusable workflow deploy_adf.yml which can be called by other workflows and looks as follows:

name: Deploy ADF
run-name: Deploy ${{ inputs.data_factory }}

on:
  workflow_call:
    inputs:
      resource_group:
        description: 'Name of resource group that contains the ADF'
        required: true
        default: 'rsg-demo-dev'
        type: string
      data_factory:
        description: 'Name of ADF'
        required: true
        default: 'adf-aurai-dev'
        type: string
      pause_triggers:
        description: 'Pause all triggers before publishing.'
        required: false
        default: false
        type: boolean
      overwrite_references:
        description: 'Use env_mappings.json to overwrite environment specific references.'
        required: false
        default: false
        type: boolean
    secrets:
      client_id:
        description: 'Client ID of the SPN used to login to azure using OIDC'
        required: true
      tenant_id:
        description: 'Azure Active Directory Tenant ID'
        required: true
      subscription_id:
        description: 'Subscription ID of the subscription containing the ADF'
        required: true
permissions:
  id-token: write
  contents: read
  
jobs:
  deploy_adf:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Pause all triggers
        if: ${{ inputs.pause_triggers }}
        run: |
          echo "Pausing all triggers"
          for file in ./trigger/*
          do
            jq '.properties.runtimeState="Stopped"' $file | tee trigger/temp.json >/dev/null
            mv trigger/temp.json $file
            echo "Set trigger runtimeState to 'Stopped' for $file"
          done
      - name: Export ARM Template
        id: export
        uses: Azure/data-factory-export-action@v1.0.2
        with:
          path: ./
          id: /subscriptions/${{ secrets.subscription_id }}/resourceGroups/${{ inputs.resource_group }}/providers/Microsoft.DataFactory/factories/${{ inputs.data_factory }}
      - name: Overwrite references in ARM Template
        if: ${{ inputs.overwrite_references }}
        run: |
          echo "Start overwriting references in ARMTemplateForFactory.json and ARMTemplateParametersForFactory.json using the mappings in env_mappings.json"
          jq -c 'to_entries[]' env_mappings.json | while read line; do 
              old=$( jq -r '.key' <<< "$line" )
              new=$( jq -r '.value' <<< "$line" )
              echo "Replace $old by $new"
              sed -i "s|$old|$new|g" ${{ steps.export.outputs.arm-template-directory }}/ARMTemplateForFactory.json
              sed -i "s|$old|$new|g" ${{ steps.export.outputs.arm-template-directory }}/ARMTemplateParametersForFactory.json
          done
      - name: Publish ARM template
        uses: actions/upload-artifact@v2
        with:
          name: adf-artifact
          path: ${{ steps.export.outputs.arm-template-directory }}
          if-no-files-found: error    
      - name: Copy ARM Template to root
        run: cp -a armTemplate/. . 
      - name: AZ OIDC login
        uses: azure/login@v1
        with:
            client-id: ${{ secrets.client_id }}
            tenant-id: ${{ secrets.tenant_id }}
            subscription-id: ${{ secrets.subscription_id }}
            enable-AzPSSession: true      
      - name: Add datafactory extension to AZ CLI
        run: az extension add --name datafactory
      - name: Deploy resources
        uses: Azure/data-factory-deploy-action@v1.2.0
        with:
          resourceGroupName: ${{ inputs.resource_group }}
          dataFactoryName: ${{ inputs.data_factory }}
          armTemplateFile: ARMTemplateForFactory.json

The workflow consecutively calls Github Actions to export the ARM template for the ADF resources, login to Azure and deploy the exported ARM template to ADF. The script takes the following parameters:

resource_group: Azure resource group name containing ADF.
data_factory: Name of ADF workspace
pause_triggers: If true, it pauses all triggers thereby pausing all pipeline schedules in the ADF workspace. We use this to pause all dev pipelines to save costs.
overwrite_references: Whether dev references should be replaces with prd references. For example, in our case we want to connect in our dev ADF to our dev key vault (kvl-demo-dev), whereas we want to connect to our production key vault in prd (kvl-demo-prd). By adding this mapping toenv_parameters.json and setting this input to true we can globally replace all occurences of kvl-demo-dev with kvl-demo-prd.
In addition the workflow takes secret inputs client_id (client id of the SPN used for login to Azure using OIDC), tenant_id (id of the tenant) and subscription_id (id of the subscription containing the ADF).

Kicking off the ADF deployments

Finally, the workflowmerge_to_main.yml triggers when we push to the main branch (i.e. by merging a pull request). The script, shown below, implements the following steps:

Trigger the deploy_adf.yml worfklow for the dev environment while pausing all triggers (since we don’t need triggers running constantly in dev).
If deployment to dev is successful, trigger the deploy_adf.yml workflow for production. For production, dev parameters are replaced by their prd equivalent.

For both steps it passes the corresponding inputs and the relevant secrets which are stored in our Github.

name: MergeToMain
on:
  push:
    branches:
      - main
permissions:
  id-token: write
  contents: read
jobs:
  deploy_dev:
    name: DeployToDev
    uses: ./.github/workflows/deploy_adf.yml
    with:
      resource_group: 'rsg-demo-dev'
      data_factory: 'adf-aurai-dev'
      pause_triggers: true
      overwrite_references: false
    secrets:
      client_id: ${{ secrets.AZURE_DEV_CLIENT_ID }}
      tenant_id: ${{ secrets.AZURE_TENANT_ID }}
      subscription_id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  deploy_prd:
    needs: deploy_dev
    name: DeployToPrd
    uses: ./.github/workflows/deploy_adf.yml
    with:
      resource_group: 'rsg-demo-prd'
      data_factory: 'adf-aurai-prd'
      pause_triggers: false
      overwrite_references: true
    secrets:
      client_id: ${{ secrets.AZURE_PRD_CLIENT_ID }}
      tenant_id: ${{ secrets.AZURE_TENANT_ID }}
      subscription_id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Merging to main triggers deployment to development and consecutively to production.

Conclusion

By implementing the above Github Actions workflows we now have a deployment pipeline for our Azure Data Factory resources. The solution enables us to test and develop pipelines in our development environment. Automatic validation of resources is done before we can deploy our changes to the development and production environments while updating environment specific references.

This solution provides data engineers a robust way to safely and efficiently deploy changes to Azure Data Factory resources. Thereby, saving time and preventing potential errors in downstream data products 🚀.

Aurai provides custom data solutions that help companies gain insights into their data. We engineer your company’s future through simplifying, organizing and automating data. Your time is maximized by receiving the automated knowledge effortlessly and enacting better processes on a foundation of relevant, reliable, and durable information. Interested in what Aurai can mean for your organisation? Don’t hesitate to contact us!