Merge Multiple Datasets with Azure Data Factory

Ayşegül Yiğit
Plumbers Of Data Science
7 min readJan 4, 2023

Create data-driven workflows to organize data movement and convert data at scale with Azure Data Factory (ADF), a cloud-based ETL data integration solution.

In conclusion, Azure Data Factory facilitates the transfer of data across different computational resources and data repositories. You can construct and schedule data-driven workflows and access data from various data storage. Additionally, you can use compute services like Azure HDInsight, Azure Databricks, Azure Synapse Analytics, and Azure SQL Datasets to build intricate ETL processes that graphically change data.

In this article, after giving general information about Azure Data Factory, I will explain how the join process progresses on the platform.

Azure Data Factory Components

Azure Data Factory consists of many components that allow you to create workflows such as copying, importing, and transforming data. You can create pipelines to execute one or more activities, access data sources or services through linked services, and then create a specific pipeline after creating a pipeline. We can add triggers to run our processes automatically at times or according to changing scenarios.

To better understand the Azure Data Factory environment, we can see the equivalent of SSIS components, which we are familiar with from ETL processes, compared to ADF components in the image below.

ADF vs. SSIS

Using Inner Join

After making a general introduction to Azure Data Factory, let’s move on to the example of the “Join” operation on the platform.

1) Creating Data Flow

Right-click on the data flows section and create a new flow from the “new data flow” tab to create a Mapping Data Flow.

Creating Data Flow

2) Data Flow Naming Stage

After creating a new flow, the “Properties” section will open on the right side of the screen. Complete the naming of the data flow (ProductDF) here.

Data Flow Naming Stage

3) Add Source

Get your source by clicking the add source field on the data flow level.

Add Source

4) Resource Naming and Defining Dataset

After naming the source (Product) in the Output stream name field, proceed to the data set definition stage.

You can reach your source by clicking the “New” button next to the dataset section of the file (product.csv) that you previously published to the container field.

Choosing Storage Blob

After clicking the “New” button, you need to specify your path on the right side of the screen, where you will get the resource. In this scenario, our dataset product.csv file is located in Azure Blob Storage.

Note: Azure Cosmos, Azure SQL Database, etc. You can also access the datasets that you have defined in different areas such as.

Click on the Azure Blob Storage area and select the format of the dataset. In our example, since our dataset is CSV, we select the “DelimitedText” field and continue.

Selecting the Dataset Format

After naming the Set Properties field, select the previously defined linked service field.

To define where the file is located, click on the file sign on the right side of the “file path” section as the last step.

Since the file (Product.csv) is in the alvedonustur container, we continue by clicking on the alvedonustur container.

The root of the file (Product.csv) has been accessed. At this stage, select the file and confirm. You can preview your file in the “Data Preview” tab.

Data Preview

We can create the ProductCategory file, which is the 2nd resource to be joined, as we defined our Product resource, and proceed to our join process.

5) Join Operation

First of all, you need to define what operation you will do by clicking the + sign on the bottom right of the Product resource created. To combine the Product and ProductCategory resources, we must choose to join.

Join Task

The left stream field comes as default because we choose from the product source. In the Right Stream field, we need to select the ProductCategory field, which is another source we will join.

Join Settings

In the join conditions, we must specify that we want to inner join both sources and define the ProductCategoryID column common to both sources as the join condition.

Join Conditions

In the data preview tab, we can see that the data is coming without any problems by inner joining according to the ProductCategoryID field in both sources.

Data Preview

6) Stage of Creating the Destination Task (Sink)

To determine in which area the output will be stored, the “sink” task should be added to the data flow area by clicking the + icon at the bottom right of the Join task.

Sink

To select the stored area, we must specify the path of the container after clicking the “new” button next to the dataset section.

The stage sources here are the same as the defined datasets, except for the last stage. Since we want to save it as a new CSV file when it comes to the container stage, we only need to specify and confirm the alvedonustur container without selecting any dataset.

By default, Azure DF defined the name of the resource as DelimitedText5. If you want to change its name, you can change the name of your source in the “open” tab on the right side of the dataset field.

In the Mapping section, first of all, the columns should be matched between the two sources by opening the Auto Mapping field.

Mapping

In the Data Preview tab, we can see that our data is coming in without any problems.

Data Preview

7) Pipeline Creation Stage

Right-click on the Pipeline section and select new pipeline and give a new name(ProductJoinPipeline) to the pipeline area from the properties section on the right side of the screen.

Pipeline

To run the ProductDF flow after creating the pipeline, drag and drop it to the Data flow task pipeline area under the Move&Transform field in the Activities section. (Optionally, you can also drag and drop the data flow task we created in the previous stage to the pipeline area.)

Settings

To select the data flow you will run, define the data flow (ProductDF) you have created in the Data flow field under the Settings tab.

8) Pipeline Publish and Run Phase

Publish it by clicking the publish button in the upper left corner of the pipeline you created.

Publish

After your publish process is complete, you can run the pipeline. For this, we need to click on the debug button in the upper corner of the screen. After the debug process is completed, you can see that the data flow is running smoothly from the Output tab.

Debug

9) Dataset Preview and Debug

You can see the file, which is subjected to the join process, stored as CSV in the container area in the image below.

To download the file to your desktop, you can also save it in CSV format by clicking the Download button at the top.

Excel

--

--