Data Factory 101: A Guide to the Key Concepts with an Example

Supradeep
7 min readNov 13, 2023

--

Azure Data Factory is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and all in one dashboard monitoring and management.

We will look into following features offered by Data factory below.

Linking code to Code Repository: Data factory is a SaaS offering by Azure but still the code of the workspace can be hosted in code repositories like GitHub or Azure Devops.

Some of the important features of DF to start with:

Integration Runtimes act as a bridge between activities and linked services within Azure Data Factory pipelines.
— They provide the necessary compute environment for executing data integration tasks.
— Integration Runtimes allow you to perform actions in close proximity to the target data store or compute service, optimizing performance while adhering to security and compliance requirements. Below are the integration run time types:

- Azure Integration Runtime — This type supports connecting to data stores and compute services in both public networks and hybrid scenarios. It covers Data Flow execution, data movement, and activity dispatch.
Self-hosted Integration Runtime — Designed for scenarios where you need to integrate with resources in your own network (on-premises or virtual machines). It handles data movement and activity dispatch.
Azure-SSIS Integration Runtime — Specifically for executing SSIS packages within an Azure-managed environment.

Linked Services are a way to connect to external data sources and destinations in Azure Data Factory. They are used to define the connection properties and credentials required to connect to the data source or destination. Once a Linked Service is created, it can be used in a pipeline to read from or write to the data source or destination. For example, you can create a Linked Service for an Azure SQL Database, and then use that Linked Service in a pipeline to copy data from the database to an Azure Blob Storage account.

Triggers are used to schedule the execution of a data flow pipeline in Azure Data Factory. They can be used to run a pipeline on a specific schedule, or to trigger the pipeline based on an event. For example, you can create a trigger that runs a pipeline every day at 12:00 PM, or a trigger that runs a pipeline when a new file is added to an Azure Blob Storage account. You can also configure triggers to pass parameters to the pipeline, which can be used to control the behavior of the pipeline at runtime. To create a trigger, you need to define the trigger properties such as the start time, end time, recurrence, and event-based conditions.

Global parameters are a way to define and manage variables that can be used across multiple data flow pipelines in Azure Data Factory. They are used to store values that are common to multiple pipelines, such as connection strings, file paths, and other configuration settings. By using global parameters, you can avoid duplicating the same values across multiple pipelines, which can make it easier to manage and update your pipelines.

For example, you can create a global parameter called sqlConnectionString that stores the connection string for an Azure SQL Database. You can then use this parameter in multiple data flow pipelines to read from or write to the database. If you need to update the connection string, you can simply update the global parameter, and all pipelines that use the parameter will automatically use the updated value.

To create a global parameter, you need to define the parameter properties such as the name, data type, default value, and description. Once a global parameter is created, it can be used in any data flow pipeline within the same Azure Data Factory instance.

Pipelines: A pipeline in Azure Data Factory is a logical grouping of activities that together perform a task. A pipeline can contain one or more activities, which define actions to perform on your data. For example, you may use a copy activity to copy data from SQL Server to an Azure Blob Storage. Then, use a data flow activity to process and transform data from the blob storage on top of which business intelligence reporting solutions are built.

Datasets is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob Storage from which the activity should read the data123.

Before you create a dataset, you must create a linked service to link your data store to the service. Think of it this way; the dataset represents the structure of the data within the linked data stores, and the linked service defines the connection to the data source.

Datasets can be created using either UI or JSON format. The JSON definition of a dataset includes details such as name, description, structure, availability, policy, and more.

Triggers In Azure Data Factory, triggers are used to schedule pipeline runs. A trigger is a processing unit that determines when to begin or invoke an end-to-end pipeline execution in Azure Data Factory.

There are three main types of triggers in Azure Data Factory:

  • Schedule trigger: Executes the pipeline on a wall-clock schedule.
  • Tumbling window trigger: Executes the pipeline on a periodic interval and retains the pipeline state.
  • Event-based trigger: Responds to a blob-related event.

Example: Here’s a simple example to start with:

We will try to copy a file from one container folder to another using copy activity. This will give us a overall idea on how to work with data factory.

The first thing to start with is the Linked service which will provide us basic access to other Azure services to read and write data.

Here we are connecting the Data factory to storage account where the files are stored.

Next step is to create Datasets to define the data format of data and the path associated with it. One dataset for using as Source and another for sink. Note that, in this example, since we are transferring the data within same storage account, we keep the same storage account linked service. There is an option to add dynamic names passed from the pipeline as well.

Now, add a new pipeline and search for copy data activity from the activities column and drag it to the board. Here, we will the Dataset used in previous step. Similarly use the dataset in sink to setup output path.

Upon clicking the add trigger button, there is an option to click on adding a new trigger, which will trigger the pipeline based on preset time and day.

To Run the pipeline, click on debug button. If there an activity runtime setup it will prompt for option to run in activity run time or debug runtime. If there are private endpoints setup and used inside enterprises cloud, then activity runtime would be suitable option or else debug runtime would work for most of the cases.

Alternatively, there is an option in monitor section to watch the current and history of debug/triggered activities as shown below.

Dataflow is used to modify the source data as per the requirement.

Let’s add a source file from which we need to transform the data.

After this let's add a select schema modifier to select only title and rating column from source data. In this section, there is an option to modify the name of the column as well.

We will add a filter to select rows only with rating above 8.

Add the filter condition and click on refresh

Add the sink with folder path and file name option as write to single file by providing the name of the file in the text box.

Add this data flow to the pipeline we created earlier to copy the files.

Now click on add trigger button which will run the pipeline in the integration environment, and we can see the status in the monitor section.

On the left there is the monitor where the activity is completed successfully and, on the left, it the file which is created as a result, and we can see there are only records with rating above 8.

In conclusion, Azure Data Factory is a powerful tool that can help you streamline your data integration and processing workflows. By leveraging its metadata-driven pipelines, you can easily build dynamic ETL processes that can handle both full and incremental loads with ease.

--

--

Supradeep
0 Followers

Technology + Economics + Management