Merge Multiple Datasets with Azure Data Factory
Create data-driven workflows to organize data movement and convert data at scale with Azure Data Factory (ADF), a cloud-based ETL data integration solution.
In conclusion, Azure Data Factory facilitates the transfer of data across different computational resources and data repositories. You can construct and schedule data-driven workflows and access data from various data storage. Additionally, you can use compute services like Azure HDInsight, Azure Databricks, Azure Synapse Analytics, and Azure SQL Datasets to build intricate ETL processes that graphically change data.
In this article, after giving general information about Azure Data Factory, I will explain how the join process progresses on the platform.
Azure Data Factory Components
Azure Data Factory consists of many components that allow you to create workflows such as copying, importing, and transforming data. You can create pipelines to execute one or more activities, access data sources or services through linked services, and then create a specific pipeline after creating a pipeline. We can add triggers to run our processes automatically at times or according to changing scenarios.
To better understand the Azure Data Factory environment, we can see the equivalent of SSIS components, which we are familiar with from ETL processes, compared to ADF components in the image below.
Using Inner Join
After making a general introduction to Azure Data Factory, let’s move on to the example of the “Join” operation on the platform.
1) Creating Data Flow
Right-click on the data flows section and create a new flow from the “new data flow” tab to create a Mapping Data Flow.
2) Data Flow Naming Stage
After creating a new flow, the “Properties” section will open on the right side of the screen. Complete the naming of the data flow (ProductDF) here.
3) Add Source
Get your source by clicking the add source field on the data flow level.
4) Resource Naming and Defining Dataset
After naming the source (Product) in the Output stream name field, proceed to the data set definition stage.
You can reach your source by clicking the “New” button next to the dataset section of the file (product.csv) that you previously published to the container field.
After clicking the “New” button, you need to specify your path on the right side of the screen, where you will get the resource. In this scenario, our dataset product.csv file is located in Azure Blob Storage.
Note: Azure Cosmos, Azure SQL Database, etc. You can also access the datasets that you have defined in different areas such as.
Click on the Azure Blob Storage area and select the format of the dataset. In our example, since our dataset is CSV, we select the “DelimitedText” field and continue.
After naming the Set Properties field, select the previously defined linked service field.
To define where the file is located, click on the file sign on the right side of the “file path” section as the last step.
Since the file (Product.csv) is in the alvedonustur container, we continue by clicking on the alvedonustur container.
The root of the file (Product.csv) has been accessed. At this stage, select the file and confirm. You can preview your file in the “Data Preview” tab.
We can create the ProductCategory file, which is the 2nd resource to be joined, as we defined our Product resource, and proceed to our join process.
5) Join Operation
First of all, you need to define what operation you will do by clicking the + sign on the bottom right of the Product resource created. To combine the Product and ProductCategory resources, we must choose to join.
The left stream field comes as default because we choose from the product source. In the Right Stream field, we need to select the ProductCategory field, which is another source we will join.
In the join conditions, we must specify that we want to inner join both sources and define the ProductCategoryID column common to both sources as the join condition.
In the data preview tab, we can see that the data is coming without any problems by inner joining according to the ProductCategoryID field in both sources.
6) Stage of Creating the Destination Task (Sink)
To determine in which area the output will be stored, the “sink” task should be added to the data flow area by clicking the + icon at the bottom right of the Join task.
To select the stored area, we must specify the path of the container after clicking the “new” button next to the dataset section.
The stage sources here are the same as the defined datasets, except for the last stage. Since we want to save it as a new CSV file when it comes to the container stage, we only need to specify and confirm the alvedonustur container without selecting any dataset.
By default, Azure DF defined the name of the resource as DelimitedText5. If you want to change its name, you can change the name of your source in the “open” tab on the right side of the dataset field.
In the Mapping section, first of all, the columns should be matched between the two sources by opening the Auto Mapping field.
In the Data Preview tab, we can see that our data is coming in without any problems.
7) Pipeline Creation Stage
Right-click on the Pipeline section and select new pipeline and give a new name(ProductJoinPipeline) to the pipeline area from the properties section on the right side of the screen.
To run the ProductDF flow after creating the pipeline, drag and drop it to the Data flow task pipeline area under the Move&Transform field in the Activities section. (Optionally, you can also drag and drop the data flow task we created in the previous stage to the pipeline area.)
To select the data flow you will run, define the data flow (ProductDF) you have created in the Data flow field under the Settings tab.
8) Pipeline Publish and Run Phase
Publish it by clicking the publish button in the upper left corner of the pipeline you created.
After your publish process is complete, you can run the pipeline. For this, we need to click on the debug button in the upper corner of the screen. After the debug process is completed, you can see that the data flow is running smoothly from the Output tab.
9) Dataset Preview and Debug
You can see the file, which is subjected to the join process, stored as CSV in the container area in the image below.
To download the file to your desktop, you can also save it in CSV format by clicking the Download button at the top.