Javarevisited
Published in

Javarevisited

Azure Data Factory as an ETL Tool and its Use Cases

Azure Data Factory

Nowadays we see that most of the developers are developing an interest in data engineering or data science although the era of software developers would never gonna end shortly, but! surely due to increasing demands of consistent electronic content creation, transactional data generation, and streams of data logs data science and data engineering will be in demand.

Many companies know that they have ample data with them, but are not sure how to use them! Although there are many ETL tools in the market, how does one know which one is the best among all? Which one will provide us with the optimal result with less cost? Which one will help us manage all our data in-house and provide accurate information from our data?
There are many such questions that a company has to go through before they make any decisions on the tool they need to use for their data!

One of the very famous ETL tools is Azure Data Factory

Azure Data Factory allows us to:
· Copy data from many supported sources both on-premises and cloud sources
· Transform the data (cf. below paragraphs)
· Publish the copied and transformed data, sending it to a destination data storage or analytics engine
· Monitor the data flow using a rich graphical interface

Azure Data Factory is the cloud-based ETL and data integration service that allows us to create data-driven workflows for enabling data movement and transforming data at scale. It might be somewhat similar to Talend but, it has more features and is more powerful than Talend

Let us first understand the basic components of Azure Data Factory:-

Fig 1.1 Figure showing three different activities
  • Activity: An activity is just like a logical operation or the action that we perform on our data. Some of the main important activities in Azure Data Factory are:
    - Copy Activity: This activity involves copying the data from different datasets for further processing
    - Lookup Activity: This activity involves the addition of datasets from data sources supported by the Azure Data Factory
    - Validation Activity: This activity is to ensure that the pipeline is only to be executed if it passes the necessary validations from the attached dataset reference, or the timeout has been reached
    - Get Metadata Activity: This activity retrieves the metadata of any data from the Azure Data Factory
    - Conditional Activity: This activity checks for the given condition, if it turns out to be true it executes a certain set of activities and another set of activities when the condition evaluates to false
Fig 2.2 Figure showing a linked service
  • Linked Services: Linked services are a very important component to link your data store to the Azure Data Factory or Synapse Workspace. They are much like connection strings, which define the connection information needed for the service to connect to external resources
Fig 3.3 Figure showing a pipeline
  • Pipelines: It’s the logical groups of activities that perform a specific job. The pipeline is basically the complete execution of the job that can be performed by individual activities. The pipelines are very useful when you talk about triggering the whole process from copying data to its transformation without individually triggering different activities separately
Fig 4.4 Figure showing one of the datasets
  • Datasets: Datasets are the named view of data that simply points or references the data you want to use in your activities as inputs and outputs
Fig 5.5 Figure showing one of the dataflows
  • Mapping Dataflows: Mapping data flows are visually designed data transformations in Azure Data Factory. It helps to design the data transformation logics without actually coding and writing scripts
  • Triggers: The triggers in ADF are another way that you can execute a pipeline run. Triggers represent a unit of processing that determines when a pipeline execution needs to be kicked off

There are some features that distinguish Azure Data Factory from other tools.
· It also has the ability to run SSIS packages
· It auto-scales (fully managed PaaS product) based on the given workload.
· It allows to run up to once per minute
· It bridges on-premises & Azure Cloud seamlessly through a gateway
· It can handle big data volumes
· It can connect & work together with other compute services (Azure Batch, HDInsights) to even run truly big data computations during ETL
· It supports both pre-and post-load transformations
· It integrates with about 80 data sources, including SaaS platforms, SQL and NoSQL databases, generic protocols, and various file types
· It supports around 20 cloud and on-premises data warehouse and database destinations
· Pricing for Azure Data Factory’s data pipeline is calculated based on number of pipeline orchestration runs; compute-hours for flow execution and debugging; and number of Data Factory operations, such as pipeline monitoring, so we only need to pay as per we are using the resources, no less no more!

The various use cases of Azure Data Factory are as follows:
· Used in supporting the data migration for advanced analytics projects.
· Can be used to reform the EEL process from SQL server integration services to extra data
· Can be used as a solution for getting data from a client’s server, or online data, to an Azure Data Lake. We create pipelines to orchestrate the data flow from source to target
· It can be used for carrying out various data integration processes and also, is one of the most famous ETL tools
· It can be used in integrating data from different ERP systems and loading it into Azure Synapse for reporting, it also supports analysis and reporting through power BI as well
· The ADF tools are impressively well-integrated, allowing quick development of ETL, big data, data warehousing, and machine learning solutions with the flexibility to grow and adapt to changing or enhanced requirements

Data Factory offers us the possibility to easily integrate cloud data with on-premises data. It’s unique in its ease of use yet its ability to transform and enrich complex data. It delivers data integration, which is scalable, available and at low costs. Today this service is a crucial building block in any data platform & machine learning tasks.

Keep learning and keep growing and also keep exploring more!

All the very best!

For more interesting and informative articles and tips follow me on Medium and Linkedin

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Swapnil Kant

Hi, I am Swapnil Kant, an avid programmer, and a full-time learner! One who is highly interested in Algorithm Optimization and Development