10 reasons to use Google Cloud Data Fusion for data integration

Sanjeev Garikapati
Google Cloud - Community
6 min readAug 6, 2023

The advent of ground breaking technologies & generation of unbounded data has enabled businesses to make informative decisions like never before. A plethora of data tools & technologies have evolved over time to empower users with distinguished skills. Of those, graphical UI data processing tools have a special role in equipping users to focus more on business & less on the technical side of the spectrum.

Cloud Data Fusion (henceforth referred as CDF in this article), managed version of open-source CDAP, is GCP’s native solution to enables code-free deployment of ETL/ELT data pipelines. Given what CDF can offer, I presume that it didn’t garner deserved traction & remains underrated, especially when organizations require visual point & click solutions. So, based on my experience, I wanted to enumerate 10 advantages of data fusion and few considerations that can help your tool evaluation process.

Let’s dive in right away .

Photo by Roberto Nickson on Unsplash
  1. Simplified & Centralized management of Data pipelines
  • Code-free UI based configurations enable not just engineers but also business users to perform end-end data processing intuitively.
Multiple UI based functionalities
  • Plenty of drag & drop connectors spanning GCP, Azure, AWS, on-prem services, SAAS & legacy systems make data integrations seamless. Full list of 100s of plugin can be accessed here.
  • Fully-managed Design & runtime environments empowers users to focus more on business and less on operations.
  • Configurable & browsable UI based plugin properties, In-line plugin documentation, and click-away access to not just simple but complex “speech to text” kind of transformations makes CDF very user friendly. (following images are for reference)
1. Intuitive UI based configurations 2. Detailed documentation 3.Easy access to complex transformations

2. Performant & scalable

Powerful Apache Spark execution environment powered by point & click CDAP complements the best of 2 open-source technologies

  • Decoupled design-time and run-time execution environments makes CDF greatly scalable, performant and resilient.
  • Transformation Pushdown to push some transformations to Bigquery makes it more performant & flexible.

3. Support for wide variety of use-cases

  • Efficiently not just supports batch workloads but also real-time use cases with replication & data-stream. Greatly supports structured & unstructured data.
  • With 100s of plugins, CDF caters diversified use-cases such as data integrations, aggregations, cleansing, conditional, control flow, etc.
  • Provides multiple options to build appropriate solutions by leveraging features like multi-table, multi-file, multi-object, schema evolution, etc.
  • Great flexibility to build custom plugins or execute our own code in Javascript, Pyspark, Python, etc. Supports multiple destinations.
  • Support for simple to complex transformations and data-processing such as “Speech to text”, Pseudonymization, masking, file encoding, etc.
A sample flow with a complex plugin with2 destinations

4. Economical and native availability with no lock-in

Native availability with no up-front contracts provides flexibility & acceleration with no lock-ins, contrary to many SAAS tools.

  • Segregated cost for CDF development and execution environment makes it economical & very adaptive to suit various work-loads.
  • Development environment is available in 3 edition with an hourly starting price as low as 0.35$ (at the time of this article).
  • Configurable policies such as “Max Idle time”, auto-scaling, Skip-delete, Master-node & worker-node configurations make run-time dataproc clusters very affordable and scalable to suit diversified workloads.
  • Unlimited user access to connectors & plugins across editions make it more affordable and expandable without hierarchal pricing models.

5. Enterprise-grade security

Private instance, VPC service control integration, Private Google access, VPC and Network-layer control can make Data fusion absolutely private, distinguishing it be to be part of enterprise network.

  • Namespaces, Role based access, Encrypted passwords, IAM integration make it more secure.

6. Hybrid and multi-cloud

  • Powered by 2 popular 100% open-source technologies Apache Spark & CDAP facilitates CDF to be completely multi-cloud with no lock-in.
  • Configurable run-time such as GCP dataproc, hadoop , Amazon EMR make its truly hybrid to run across multiple-clouds.
Some of execution environment options

7. Governance

  • Time-variant & UI-based Lineage: Dataplex Lineage integration powered by versioned UI based lineage at both pipeline and field level, enables effective data lifecycle traceability and detailed impact analysis
1. Pipeline Lineage 2. Field level Lineage 3. Impact analysis
  • Monitoring: Provide a decent interface to track job status, errors & warning, number of records inserted & processed.
  • Logging: Cloud logging integration powered by detailed Spark job level, dataproc and CDF logging aids any causal analysis efficiently

8. Reusability

Macros and run-time parameters make reusability a core feature of CDF to pass virtually everything at runtime

  • Argument setters, parameterized plugins like Bigquery execute make CDF pipelines and logic thoroughly reusable across ELT & ETL use-cases.

9. Native GCP integration

Native and deep integration with almost all the GCP services make CDF a best integration partner on Google cloud.

  • Integration across GCP data cloud through dataplex facilitates end-end lineage across components like CDF, cloud composer, Bigquery.
  • In-built service-account security, browsable data assets, native UI based support for services like Bigquery enables us to manage complex upsert, partition and clustering use case just from the click of a button.

10. Operational & orchestration support

  • Rest APIs, Pipeline triggers, time-based schedules, integration with cloud composer, dataform makes CDF pipeline orchestration effective.
  • Manage pipelines though github: Recent Github support for Pipelines can enable collaboration, auditing and restoring of pipelines effective.

Let me close the loop by also talking about some of the considerations on CDF.

  1. Limited community support : As I mentioned at the start of the article, CDF is not very widely adapted. So community support is limited. (Hope this article mitigates it to some extent 😀). But you have great product team support covered by SLAs.
  2. Development environment can’t be paused: Though runtime is ephemeral, the development environment can only be deleted and so has to continuously run. Still its relatively economical.
  3. No support for UI based bulk import of pipelines : Pipelines cannot be bulk imported through UI and so reinstating an instance is tedious. It can be mitigated with github managed pipelines.
  4. Cold start in case of ephemeral clusters : We need to account for couple of minutes of cold-start for ephemeral clusters. So persistent clusters that come with a cost are better for real-time processing.

Hope this article has provided you a perspective on advantages & short-comings while considering Data Fusion for your data integration needs.

--

--

Sanjeev Garikapati
Google Cloud - Community

A data nerd passionate to help clients make most of their data. I love to learn emerging technologies and apply them to business cases.