Azure Tutorials
Published in

Azure Tutorials

Data Factory Data Flow Vs Azure Data Bricks

Image Reference : https://hevodata.com/learn/azure-data-factory-vs-databricks/

Introduction to Azure Data Factory and Data bricks

  • Azure Data Factory

Azure Data Factory is an orchestration tool for Data Integration services to perform ETL processes and orchestrate data movements at scale.

  • Azure Data bricks

Whereas Azure Data bricks provides an unified collaborative platform for Data Engineers and Data Scientists to perform ETL as well as build Machine Learning models with visualizations dashboards.

After understanding the basics of Azure Data Factory and Azure Data bricks, Let’s understand in detail about the comparison of Azure Data Factory and Azure Data Bricks.

- Flexibility of Usage

With Data bricks , we can use Python, Spark, R, Java, or SQL to perform Data Engineering and Data Science activities using notebooks.

However, ADF provides a drag-and-drop feature to create and maintain Data Pipelines visually which consists of Graphical User Interface (GUI) tools that allow delivering applications at a higher rate.

- Ease of Coding

Although Azure Data Factory facilitates the ETL pipeline process using GUI tools, developers have less flexibility as they cannot modify backend code.

On the other hand, Data bricks implements a programmatic approach that provides the flexibility of fine-tuning codes to optimize performance.

The biggest drawback of Databricks is that you must write code. Most BI developers are used to more graphical ETL tools like SSIS, Informatica or similar, and it is a learning curve to rather write code.

- Data Processing

Businesses often do Batch or Stream processing while working with a large volume of data. While batch deals with bulk data, streaming deals with either live (real-time) or archive data (less than twelve hours) based on the applications.

Image Reference : https://docs.microsoft.com/en-us/azure/data-factory/v1/data-factory-data-processing-using-batch

Azure Data Factory and Azure Data bricks supports both batch and streaming options, but Azure Data Factory does not support live streaming. On the other hand, Data bricks supports both live and archive streaming options through Spark API.

Image Reference : https://databricks.com/blog/2018/07/19/simplify-streaming-stock-data-analysis-using-databricks-delta.html

- Cost

Cost of Azure Data Factory Data Flow is more compared to Azure Data Bricks while dealing with big data processing. Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.

Mapping data flows provide an entirely visual experience with no coding required. Data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

Azure Data bricks is based on Apache Spark and provides in memory compute with language support for Scala, R, Python and SQL. Data transformation/engineering can be done in notebooks with statements in different languages. That makes this a flexible technology to include advanced analytics and machine learning as part of the data transformation process. You are also able to run each step of the process in a notebook, so step by step debugging is easy. You will also be able to see this process during job execution, so it is easy to see if your job stops.

Azure Data bricks clusters can be configured in a variety of ways, both regarding the number and type of compute nodes. Managing to set the correct cluster is an art form, but you can get quite close as you can set up your cluster to automatically scale within your defined threshold given the workload. It can also be set to automatically terminate when it is inactive for a certain time. When used with ADF the cluster will start up when activities are started. parameters can be sent in and out from ADF. Azure Data bricks is closely connected to other Azure services, both Active Directory, Key Vault and data storage options like blob, data lake storage and sql.

Conclusion

Businesses continuously anticipate the growing demands of Big Data Analytics to look for new opportunities. With rising Cloud adoptions, organizations are often in a dilemma while choosing Azure Data Factory and Data bricks. If a company wants to experience a no/low code ETL Pipeline for Data Integration, ADF is better. On the other hand, Data bricks provides a Unified Analytics platform to integrate various ecosystems for BI reporting, Data Science, and Machine Learning and MLFlow.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store