Azure Data Factory

Ayşegül Yiğit
Plumbers Of Data Science
5 min readDec 19, 2022

Most analytics systems need to be able to start a large-scale data flow or establish a regular routine. One of the greatest services available to fulfill this need is Azure Data Factory (ADF). ADF is a cloud-based data integration solution that manages the transfer of data across different data repositories and computational resources as well as its transformation.

In a summary, Azure Data Factory is a cloud-based ETL and data integration service that lets you create data-driven workflows to orchestrate data movement and transform data at scale.

We will follow certain subjects from Abdullah Kise’s course “DP-203T00: Data Engineering on Microsoft Azure” in this series. Additionally, I want to thank Abdullah Kise for his insightful advice.

Competencies of Azure Data Factory

1. Orchestration: The ability to give commands to other services.

To use an analogy for the ADF, consider a symphony orchestra. The central member of the orchestra is the conductor. The conductor does not play the instruments but guides the members of the symphony through the entire piece of music they are performing. Musicians use their skills to produce certain sounds in the various stages of the symphony, so they can only learn certain parts of the music. The conductor directs the entire piece of music and is therefore aware of the entire score that is performed. It also uses special arm gestures that give musicians instructions on how a piece of music should be played.

ADF can use a similar approach while having native functionality to retrieve and transform data, sometimes another service instructs it to perform the actual work required on its behalf, such as Databricks to execute a conversion query. in this case, it will be Databricks, not ADF, that does the job. ADF only manages the execution of the query and then provides pipelines to move the data to the next step or target.

2. ETL: Data integration definition capability.

Data integration primarily involves collecting data from one or more sources. Optionally, it includes a process where data can be cleaned and transformed or enriched and prepared with additional data. Finally, the aggregated data is stored in a data platform service that handles the type of analysis you want to perform. This process can be automated by Azure Data Factory in a model known as Extract, Transform, and Load (ETL).

3. Ability to repeat things according to rules.

Can create loops such as while for if and run logic.

4. Ability to copy data.

It is necessary to define the dataset. For example; It is used to format and edit the dataset when it is desired to copy data from a source to a destination.

Azure Data Factory Components

• Pipeline can be created.

• Alerts and schedules can be set up with activities.

• Connection information can be defined with Linkedin Service.

Control Flow: It functions like the Control Flow level we often use in SSIS. Chaining events in an array is an orchestration of activities that involve defining parameters or passing arguments at the Control Flow level.

Parameters: Parameters are defined at the Control Flow level. Arguments for defined parameters are passed from a run context created by a trigger during execution, or from a manually executed pipeline. Activities inside Control Flow run according to parameter values.

Integration Runtime: It has an integration runtime that allows it to bridge the activity and associated service objects.

Copy Operation with Wizard in Azure Data Factory (Web to Disk)

Wizard

We will copy with the copy data tool over the interface.

Source Data Store

We have determined our source as HTTP because we will transfer the data to the disk over the web.

New Connection

We copy and paste the data we will pull from the web into the Base URL.

Link: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

Destination Data Source

We made our choice because we created our disk in Azure Data Lake Storage Gen2.

Source Connection

We define the Connection information.

Destination

We define where the Destination Data Set will be created.

Deployment

Deployment of the model has been completed.

Pipeline Tab

We can see the model created with the wizard above in the pipeline.

Trigger

We run the pipeline by clicking the Trigger button.

You can see the container in our data lake disk at the target and our data on the web.

--

--