Automating Data Orchestration with Azure Data Factory: A Comprehensive Guide

Jil Modi
Simform Engineering
5 min readDec 1, 2023

--

Streamlining Data Workflows: An Azure Data Factory Journey

In this blog post, we’ll explore Azure Data Factory, discussing its capabilities and potential impact on data operations. We’ll focus on illustrating automating data orchestration using Azure Data Factory through a high-level diagram. Join us as we unravel the complexities, step by step, offering insights into how Azure Data Factory can streamline data workflows and operations.

High-level diagram of data orchestration

Data Pipeline

The requirement entails a daily process where files are stored in Azure Blob Storage, and the objective is to extract the data from these files and load it into different tables.

This diagram captures the essence of this journey, showcasing the flow from file storage to table extraction — transforming abstract requirements into a visual roadmap for success.

Prerequisites

  1. Active Azure Account Subscription: If you are logging into Azure for the first time, an Azure free subscription for a month is available. This subscription provides access to Azure services and resources, including Blob Storage and Azure Data Factory.
  2. Azure Blob Storage Account: Create an Azure Blob Storage account within your Azure subscription. This account serves as the repository for storing various types of data, including the data files to be loaded.
  3. Azure Data Factory: Ensure access to Azure Data Factory, a cloud-based data integration service used for orchestrating and automating data movement and transformation workflows. It plays a significant role in this data-loading process.

Through the use of Dot Net API, files are transferred from a particular platform to Azure Blob Storage, wherein they are organized into distinct folders.

Create this final pipeline, which involves extracting data from Blob Storage and its subsequent loading into the designated target tables.

This pipeline is integrated with two additional pipelines: one named “execute pipeline1,” which is responsible for retrieving child items within folders and populating staging tables, and another named ‘Bradlee_sp_calling’ which executes stored procedures for migrating data.

The ‘Get Metadata’ activity retrieves folder names from Blob Storage. The ‘For Each’ loop iterates through these folders, extracting individual folder names. Subsequently, a child pipeline is invoked to retrieve files from specific folders.

We require a specification to collect child items, enabling the retrieval of all folder names within the Blob Storage structure.

Within the ‘For Each’ activity, the subsequent step involves creating a child pipeline responsible for fetching files. The process within this child pipeline remains consistent for retrieving the necessary files.

Within this setup, I’ve established a parameter called ‘Parent_folder,’ populated by values originating from the first pipeline. This parameter serves a crucial role within this pipeline, as it’s utilized within the ‘For Each’ loop, ensuring seamless integration and continuity of the process.

Within the ‘For Each’ loop,’ I’ve integrated an ‘If’ condition to discern the item type. When the condition evaluates to ‘file,’ the subsequent activity is triggered; otherwise, no further action is taken.

Within the ‘If condition’ activity, I added these other activities:

The ‘Copy_data_from_files_to_table’ activity orchestrates the transfer of data from files to staging tables. After a file completes processing, the ‘moving_files_to_archieve’ activity seamlessly relocates it to the ‘achieve’ folder, nestled within a specific directory in Blob Storage.

Following this, the ‘update_path_in_api_log_table’ lookup activity diligently updates the file’s altered path within the table for accurate tracking and logging purposes. Finally, the ‘delete_source_files’ activity effectively removes the file from its initial location in the primary storage, concluding this data processing workflow.

For connecting to the source
For connecting to the target
For moving to achieve folder location

update_path_in_api_log_table’ lookup activity diligently updates the file’s altered path within the table for accurate tracking and logging purposes.

@concat('update public.seed_api_log set file_url = ''https://stageblobchq.blob.core.windows.net/seed-bradlee/',pipeline().parameters.parent_folder,'/archive/',concat(substring(item().name, 0, lastIndexOf(item().name, '.')),'.csv'),''' 
where file_url = ''https://stageblobchq.blob.core.windows.net/seed-bradlee/',pipeline().parameters.parent_folder,'/',item().name,''';','select 1 as dummy;')

delete_source_files’ activity effectively removes the file from its initial location in the primary storage, concluding this data processing workflow.

When incorporating a child pipeline within a ‘For Each’ loop, it’s crucial to assign values derived from metadata, specifically the folder names. This assignment ensures that files from the respective folders are precisely processed as per the workflow’s requirements.

Having successfully completed the migration process from files to staging tables, our next phase involves migrating data from staging to target tables. This pivotal step necessitates the execution of stored procedures embedded within the ‘bradlee_sp_calling’ pipeline. Within these procedures lies the essential migration scripts vital for seamlessly transferring data to our target tables.

Now, let’s automate this pipeline using an Azure trigger. The first question that comes to mind might be: ‘How do I create a trigger in Azure?’ Don’t worry, I’ll explain that. In Azure Data Factory, you’ll find the ‘Manage’ icon. Click on it, then select ‘Triggers’ and click on ‘New.’

Create a new trigger:

Congratulations, we have completed all the steps. We are now masters of data orchestration in Azure Data Factory! :-)

Conclusion

Azure Data Factory’s robust capabilities streamline data workflows, offering efficiency from storage to seamless data loading. Leverage its power to transform operations and foster innovation in data orchestration.

For more updates on the latest tools and technologies, follow the Simform Engineering blog.

Follow Us: Twitter | LinkedIn

--

--