Dimension Design on Azure Data Factory

Published in

Get Started with Azure Data factory

7 min readMar 30, 2022

Azure Data Factory(ADF) is the ETL service available on azure portal which enables us to create workflows similar to workflow design on SQL Server Integration service(SSIS). However, there are slight differences of workflow design on ADF than to that of SSIS.

Connection Setup

First and foremost, to get started, create a ADF on azure portal and open the ADF where you will find four tabs namely Home, Author, Monitor and Manage. Click on the Monitor tab to set up the connection to other services such as databases and storage.

Scenario: we will be using Blob storage as the source where source files will be uploaded to Azure blob storage and SQL database as the target to create our dimensions

Setup Source Connection

In this case, we can create a connection to source and destination by clicking on the ‘new’ icon of linked services in Monitor tab

select Blob storage and provide credential details.

Setup Target Connection

Similar to the source connection setup, click on the select the Azure SQL database and provide credentials

Workflow Design

Now that we are done with setting up the connection, let's dive into workflow design. Switch to the author tab from monitor tab in order to proceed with the workflow design. basically you can observe four sections there.

Pipeline
Dataset
Data flow
Power Query

Dataset

First we should create a connection to refer source and target table. This can be done by adding a new dataset by clicking on the dataset section. There we can select the type of dataset similar to how we did in setting up the connection to SQL database and blob storage. Now we try to access the tables within those storage locations and add them as datasets.

Next, we get to select the file format of the dataset. If it is a file uploaded to blob storage, you will have the following options for you where you can select the relevant file type you uploaded(most commonly csv or excel).

If it is a table on SQL database, you can select the link service you created earlier in the monitor tab and select the relevant table from the list of tables you have created in that specific database.

Note: You should have a table already created on SQL database. If you haven’t created one, you can create a table with a simple SQL query on your database.

Once you’ve set the connection to the datasets, your datasets will appear as follows.

Dataflow

Now let’s check the most important section: Dataflow. You can add a new data flow by clicking on Dataflow section as in Dataset section.

Dataflow design for dimensions

Dataflow design on dimensions should have two sources selected, namely source and target. Source is the source file as its literal meaning. The other one is the target table on database. Now you might be wondering why we select target table as a source. The reason is this. Ideally we should never truncate the dimensions.

Why shouldn’t we truncate dimension tables :
we have an auto generated key field called surrogate key in dimensions which gets allocated by mapping with the business key. So if we truncated a dimension table, it is possible for a specific business key to be allocated with a different surrogate key after truncating. However, we refer this surrogate key in Facts and hence if we carry several months of data, the results on facts for earlier months might be different from current month due to this conflict. This will also cause the dashboard created for earlier months to have different surrogate keys for the same business key which might result incompatibility or perhaps wrong visualizations on dashboards when range of months are concerned. Therefore, we should never truncate dimension tables in incremental load.

Alright, back to the topic. Since we should not truncate dimensions, we should lookup and update dimensions. In order to compare the incoming results with data that is currently available in the database, We should have a reference to the target table and hence we should select target table also as a source.

First and foremost, we should select the add source and select the source file on blob storage and target table on SQL database.

Optionally we can define a schema modifier with select statement and rename all the columns returned from source file as S_columnName so that they can be clearly identified at the merging. This can be done by clicking on the small plus icon that appears with the source containing actual source file and selecting ‘select’ transformation from the options.

Now double click on select transformation and rename them as intended. However, this step is completely optional.

Then we can select a join transformation and left join the result from the select transformation to target table. they can be mapped with the business key.

Now we should check if the relevant business key(in this case ProductCode) is available in the target table. If so, we can go ahead and update and if not we can simply insert since it is a new record. we can use a conditional split to achieve this.

you will get two branches when you select conditional split. We can rename the branches and insert our condition.

Upon condition being satisfied, we can update the existing result with new data. But unlike in SSIS, before directly updating a table in ADF it is required to have Alter row transmission. Therefore let’s add a alter row transmission and pass a true condition.

Now we can update the changed data on our target table. We can select the sink option for that.

Now we should provide the update method as allow update and should select the filed that is being used for update. In this case it is ProductCode.

Now we can map the result to target in order to update.

On the other branch, when the condition is not met, we can directly select a sink destination and configure the mapping.

Now we are done with the workflow design and the overall workflow should look as below:)

Cool. Now we are done with the main section. Let’s check how to use pipeline to run the workflow.

Pipeline

Similar to other sections, a new pipeline can be selected by clicking on the new pipeline option on pipeline section.

We have two options in ADF pipelines. We can either copy data or use a dataflow.

copy data is used when we need to directly copy source data to the target table without any condition or transformation. Typically this can be used for loading data from source to landing tables.

Since we have a dataflow already created at the dataflow, let’s select Dataflow option.

All we have to do now is to move to dataflow settings and to select the dataflow we design.

Now we are ready to debug the dataflow. But keep in mind, since we selected dataflow option, we should first turn on the debug mode. If we selected copy data, this would not be a problem. But still, it’s not a huge deal. you just have to toggle dataflow debug button.

Cheers!!. we are done with the dimension design on Azure Data Factory. Thank you for sticking till the very end and hope you got something out of this blog post!