Replicating on-premise data pipeline on Azure Data Factory

Published in

DataPebbles

13 min readJan 20, 2023

In this blog, we will be going over how we can replicate an on-premise data engineering pipeline on Azure cloud using Azure services.

On-Premise Big data pipeline:

Let us go over a generic example of a data engineering pipeline in an on-premise setup. We will be taking Cloudera’s distribution of Hadoop setup as an example.

Every pipeline starts with sourcing the data, which can be done in two ways. You connect to the data source and extract the data or you pick the data file from a landing folder. In either scenario, a lot of configuration and some level of coding is required to collect, load and parse the data. Then we move on to the transformation part where we write the code to transform the data and load it into a table, which can either be a staging table or a final table in some data warehouse (or a data lake). This part can be complex as it requires a good amount of coding experience, knowledge to write code (in Map reduce or Apache Spark) and set up infrastructure that can handle such large volumes of data in an efficient way. This also adds to the burden of maintaining the required dependencies and libraries. Once the data has been processed and stored in a meaningful way then it will be used to draw insights.

To orchestrate this entire process we end up using a cron scheduler to run the jobs and it also requires some level of Unix shell scripting to be done.

we only discussed a scenario where we were processing data and writing data in Hadoop. But in a production environment for a large organisation you will have sources/targets of different types and each of them will require a different type of connector or different approach to sending/receive data. It requires a lot of code. Maintaining the code, versioning it and implementing CI/CD just adds more complexity and requires the involvement of people with expertise in different areas of software engineering. That’s why the entire process can be slow.

Challenges:

We can list down the challenges below

→ Setting up, scaling and managing the infrastructure

→ Managing software and library dependencies.

→ Writing quality code utilising multiple frameworks and integrating them with each other.

→ Difficulty in setting up CI/CD pipelines

→ orchestrating and automating the process.

However, when we look at the other side of the spectrum, Azure Data Factory is one solution, using which we can achieve all the above without having to be too tech-savvy.

But before that let us refresh a few key concepts and benefits of Azure Data Factory.

What Is Azure Data Factory?

Azure Data Factory is a fully managed, serverless data integration service that enables us to set up a data pipeline from scratch without having to write code. We can design and implement an optimised ETL/ELT workflow within a short period of time without having to concentrate on technical implementations.

The key benefits of Azure Data Factory are

Connect and Collect:

The variety of data sources is always growing and the setup required to pull data from these sources can be complex to implement. ADF offers more than 100 connectors which enable us to simply connect to the source and get the data.

Transform and enrich:

After loading the data into the cloud store complex transformations can be performed visually without having to write code by using ADF mapping data flows. Data flows enable us to execute the transformations by leveraging the capabilities of Apache Spark without having to understand Apache Spark clusters or Spark programming

CI/CD and publish:

Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to incrementally develop and deliver your ETL processes before publishing the finished product.

Orchestrate, Manage and Monitor pipelines :

ADF gives us the flexibility to follow a horses-for-courses ideology where we can integrate other Azure services like Azure Databricks, Azure Synapse Analytics notebooks, Azure HDInsight etc., easily within a single pipeline by mapping the flow of execution. This provides us with the flexibility of designing an optimal pipeline for our big data workloads.

Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

Top Level concepts:

An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data Factory is composed of the below key components.

Pipelines
Activities
Datasets
Linked services
Data Flows
Integration Runtimes

These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data.

Setting up a Pipeline in Azure Data Factory:

Now that’s a lot of theory. let's move on to some practical scenarios and implement a small pipeline in ADF. I have signed up for a one-month free trial account in Azure and will be using the same.

In this pipeline we will try to achieve the below:

Upload a file to a container and read it from Azure Data Factory as a source.
Perform a simple transformation and write it to a table in a dedicated SQL pool in Azure Synapse Analytics.
Create a simple graph comparing two columns from the table.

Assuming you have an azure account, let's start by first creating a resource group. You can select the region of your choice, I went with the one closest to me.

Create and define the Source:

Now from the home page, create a storage account under the resource group we created and the same location as well. Keep the default configuration but make sure to enable “hierarchical namespace” in the advanced settings tab. This will make the storage account to be Azure Data Lake Storage Gen2 type. Wait till you get a message saying deployment is successful.

We will be using the same storage account throughout this article. Go to the all resources page and select the storage account you created. Click on the container and create a container with the name “source” and upload a sample file using the GUI from local. I will be using a small dataset about books and their reviews in Parquet format.

We have set up the data required for the current task. Now let's create an Azure data factory resource. From the home page, click on “Create a resource” and search for Azure data factory. Use the same data group and location along with default configurations. Once the Data factory resource is created, go to the all “resources page” and select the data factory resource you created. Click on “Launch studio” as shown below.

This will open up the Azure Data Factory Studio portal in a new tab. But before we move further let us assign permissions to the data factory resource to access the storage account. From the all resources page, navigate to the storage account you created. Follow the three steps highlighted below.

Search and select “storage blob data contributor” and click on “next”.

Now follow the steps highlighted below. Select “managed identity” and in the side panel that pops up, select your subscription and choose data factory type in managed identity, your data factory resource name should pop up in the select section. select the resource and click on “review + assign”

Now the Azure Data Factory has access to read, write and delete access for the storage account. You can regulate the access as required at the container level as well, but for the current exercise, we will give access at the storage account level.

Now, let's jump over to the Azure data studio tab where we will be using the UI to develop our pipeline. Let's start by first connecting the data factory with the storage account using a Linked service. Click on manage → Linked service → New. Then in the side panel search and select “Azure data lake storage Gen2” and click on continue.

Then in the details window, assign a name to the linked service or leave it with default. Choose “AutoResolveIntegrationRuntime” and select “System Assigned managed Identity” for the Authentication type. Select your storage account and click on “ Test connection”. This should give a connection succeeded message and click on create.

Now that we have established the connection, let us import the sample file we uploaded earlier. Follow the steps in the below picture in the same order.

Then search and select “Azure data lake storage Gen2” and click on continue.

Then select the file format for your source file and continue. I have selected “parquet”.

In the “Set properties” page section, select the linked service you created and use the browse icon to navigate to your file as shown below. If you want to use multiple files then you can use wildcards in the path to identify the files. Click on “Ok”.

Now a dataset for your file will be created and it should be open in the canvas. Assign a name to the dataset in the properties section. You can preview the data quickly by using the “preview data” option.

Now that we have set up the source data we need to set up the target where we will be writing the data after performing a simple transformation. I will be using a dedicated SQL pool in Azure Synapse Analytics and writing the results to the table.

Note: We haven’t saved our work in the Azure data factory yet. So don’t close the tab until you have published your progress.

Create and define target:

Let us create an Azure Synapse workspace. Switch to a new tab and open the home page for your Azure account. Click on “create a resource”, search and select “Azure Synapse Analytics”. Fill in the details required and assign a unique name to the workspace. A workspace requires an Azure data lake storage Gen2 container to be associated with it to store the workspace data. We can use our existing storage account and create a new container in it.

Click on ‘Next: Security’ and provide the username, and password for the workspace.

Click on “review + create” → “create”. Wait for the deployment to complete, which should take a couple of minutes. Once it's completed click “go to resource group” and select your workspace that was created now.

Now from the workspace portal, open the synapse studio.

Now from the synapse studio let us create a dedicated SQL pool to host our table. Move to the “Manage” portal from home and select “Sql pools” and click on “new”. Give a name to the pool and adjust the slider for the performance level. I have opted for the lowest performance level as it will be enough for this exercise. leave the rest as default, and click on “review + create” → Create.

Wait for the pool to get created and wait for the status to be active.

Now move to the develop portal and create a new SQL script as shown below.

Now rename the script and add a create table statement for the target table. Ensure that you are connected to the correct pool and execute the query. Once the table is created, run a select query on it. It should return zero records. Click on “Publish all” to save and publish your work.

NOTE: While creating the table make sure that the data type matches the type of data you will be loading into it

Now let's go back to the Data factory studio and create a linked service with the dedicated pool we created. Click on manage → Linked service → New. Then in the side panel search and select “Azure Synapse Analytics” and click on “Continue”. In the properties provide a Name and ensure the integration runtime is set to “AutoResolveIntegrationRuntime”. Select your subscription, synapse analytics workspace as server and database as the SQL pool we created.

Select the authentication type as SQL Authentication and provide the username, and password used while creating the Synapse workspace. Test the connection and once it succeeds click on Create.

Setting up the data flow:

Now that we have configured both the source and target let us create a mapping data flow using those two with a simple filter condition. Go to Author portal → Data flows → new data flow

This opens up the canvas. Give a name to the data flow in the right side panel. Click on Add source in the canvas. In the bottom panel select the source dataset you created and disable the “Allow schema drift” option.

In the canvas click on the ‘+’ icon and select ‘filter’ activity from the list of available activities. Provide a name for the activity and the filter condition. The sample book ratings dataset I am using contains a field which indicates whether the reviews were from a verified purchase. I have added a condition to select only those which were verified.

Now click on the ‘+’ symbol next to the filter activity and from the list select ‘Sink’. In the bottom panel provide the name for the sink and disable the schema drift option.

Click on ‘New’ to add a new dataset, select the type to be ‘Azure Synapse Analytics’ and click Continue. Then select the linked service for the SQL pool, the table you created and click Ok.

Click on ‘Publish all’ and publish your work so far. Now let's create a pipeline and add the data flow we created to that pipeline.

Setting up the pipeline:

From the same portal click on pipelines → new pipeline.

This opens up the pipeline canvas. Provide a name to the pipeline and drag the data flow object you created onto the canvas.

Now switch to the settings tab and set the logging level to ‘None’ (for this exercise). For staging, select your storage account and use the browse option to select a container. I am reusing the container linked with Azure Synapse workspace.

Click on Validate icon to ensure everything is ok. Now we have a simple pipeline created with a single data flow activity. We have the option to link this ‘dataflow’ activity to another data flow object or to other azure objects like Synapse notebooks, Azure Databricks notebooks, Azure HDInsight services etc., and also add dependency conditions to map them. But for the sake of simplicity, I will be stopping with this.

Now we can run this job by adding a trigger. We can either run it now or configure a schedule to run it at a specific time. Let's trigger it now. But before that publish the pipeline by clicking on ‘Publish All’ and then trigger the pipeline.

Now monitor the pipeline using the ‘Monitor’ hub. It will take a couple of minutes for the pipeline to complete as it takes time for the cluster to start up. Keep refreshing the data until the pipeline is completed.

Now that the pipeline has been completed, verify if the data was loaded in the target table by running a select query.

Conclusion:

That wraps up this exercise where we created a simple ETL pipeline using a data factory. However, this is just a small example and the Data factory is capable of doing a lot more.

When we compare the complexity of authoring and managing a data pipeline using on-premise infrastructure with cloud i.e., Azure in our example we come to realise that one doesn’t have to be a technical wizard to set up a scalable pipeline. This provides an opportunity for people from other backgrounds like Database administration or Business Intelligence to easily update their skillset to move on to Data Engineering, for Data engineers an opportunity to collect and organise data more efficiently. Which in turn will provide more business value.

For any questions or suggestions please reach out @ abdulkhader.sakivelu@datapebbles.com