The introduction of Azure Data Factory

Published in

Unitfly

5 min readMar 5, 2018

To make use of all those data flowing around us, we need to understand it’s core meaning, then we need to gather that data and work on data analysis. We have described decision-making when you have the data. To ensure you have the data transferred, organized and whole process scheduled and monitored, we’ll introduce Azure Data Factory.

Azure Data Factory (ADF) is, by definition, an integration tool in a Cloud. To be more specific, ADF is data movement and data transformation tool. Azure Date Factory can also be used as a hybrid data integration tool — connecting to On-Premise sources and to the sources in a Cloud.

With the new approach (v2) Azure Data Factory can be very useful for data movement and data integration. We dare to say that Data Factory is a substitute for a failed implementation of Microsoft BizTalk Services in Azure. Maybe it’s a logical answer to Microsoft BizTalk Server as BizTalk is an enterprise integration tool with many functionalities and powerful features that comes with a high price, while Data Factory is more affordable pay per use service. Pay per use model allows the customer to have a lean and agile approach, it starts small then scales up. Every activity can be controlled and monitored, how many resources were used and how much will be charged eventually. To take an example; data movement from SQL database to a Web Service would cost 0.125$ — 0.25$ per hour. From time to time we will insert a parallel with Microsoft BizTalk server as one of the best Microsoft integration tools.

Using Azure Batch and Azure Data Factory for processing large-scale datasets is a powerful combination that will be described in one of next posts.

By design it is more Extract-and-Load (EL) and Transform-and-Load then traditional Extract-Transform-and-Load. It can be use in whatever way we like and that’s something we will focus on.

Azure Data Factory, like any other integration tool — connects to the source, collects those data, usually does something clever with that data and sends processed data to a destination. Final touch is monitoring all the processes and transfers. Let’s review these steps:

Connect and collect — connecting to various types of data sources, on-premise, in the cloud, structured, unstructured, semi structured, in different time intervals. Azure Data factory has built-in connectors or you can write (develop) your own connector.
Transform and enrich — processing and preparing data for destination. Data can be transformed using computed services in Azure like HDInsight Hadoop, Spark, Data Lake Analytics, Machine Learning, or a custom service.
Publish — transferring processed and business-ready data in a consumable form into a database, service or some analytic engine. From that point the data can be used to bring a business value to a business user using any business intelligence tool.
Monitor — monitoring scheduled activities and processes. Support is crucial in integration processes, as you always want to know did the magic happen or not.

Architecture

Azure Data Factory, as represented on the picture above can have one or more pipelines.

Pipeline & Activity

Pipeline is a logical grouping of activities that perform a task together. Within a logical grouping all tasks can be logically separated and individual. Pipelines can be scheduled and executed via triggers or run manually, on demand. Pipeline is like an orchestration or a workflow. Activities are like shape in BizTalk orchestration, they consist of four types:

Data movement activities — classical data movement activity. Copies data from source to destination.
Data Transformation activities — Data factory supports different “out of the box” transformation activities that can be executed on compute environment.
Control activities — control activities are introduced in Data Factory v2 and their role is to perform control of conditions and loops inside the pipeline.
Custom Activities — with custom activities you can perform data movement or data transformation activity. Basically you can do anything you want. In custom activity, the only limit is your imagination.

Dataset

Dataset is similar to a schema in BizTalk, except dataset can have an information about a path to a specific resource. Dataset represents a structure, a named view that points or references data you want to use in your activities as inputs or outputs. Datasets are, for example, files, folders, tables and documents.

Linked service

Linked service is used as a link between a data store and a data factory, like a connection. Linked service can have a compute resources for:

HDInsight
Azure SQL Database
Azure Blob Storage
Machine Learning

New capabilities in ADF version 2

With Azure Data factory version 2, main capabilities where summarized:

Control flow and scale
Deploy and run SQL Server Integration Services (SSIS) packages in Azure

Control Flow and scale

In order to make activities and pipeline more flexible and reusable, here are some updated capabilities that can be done in pipeline:

Control flow

Chaining activities in a sequence within a pipeline
Branching activities within a pipeline
Parameters — parameters can be defined at the pipeline level and arguments can be passed while you’re invoking the pipeline on-demand or from a trigger. Activities can consume the arguments that are passed to the pipeline.
Custom state passing — activity outputs including state can be consumed by a subsequent activity in the pipeline.
Looping containers: For-each and Until

Trigger-based flows

Pipelines can be triggered by on-demand or wall-clock time.
Delta flows — use parameters and define your high-water mark for delta copy, while moving dimension or reference tables from a relational store either on-premises or in the cloud to load the data into the data lake.

Deploy SSIS packages to Azure

Data Factory can provision Azure-SSIS Integration Runtime, so now, SSIS workloads can be executed on Azure Data Factory.

Summary

We’ve noticed that Microsoft is putting an extra amount of effort to improve functionalities and usage of Integration tools in the cloud. Azure Data Factory is a great example of how a market analysis together with a customer survey lead to a great product / service. With Azure Data Factory version 2, there is even more flexibility and usage for a potential customer. We are using Azure Data Factory as it allows an agile and lean approach for a low price. We can also have a proof of concept ready for you in a jiffy and you wont have to spend a lot. As we are using Microsoft BizTalk Server quite often we appreciate direction and strategy of Azure Data Factory. Whether you need a simple data movement and synchronization or a complex orchestration, Azure Data Factory is a product with a lot of perspective. True, you’ll have to spend a little more time on a code, but it has an engine with a lot of potential. That’s a big strength of Azure Data Factory, we have recognized that and made Azure Data Factory part of our M-Files Migration Service.

Originally published at unitfly.com.