10 reasons to use Google Cloud Data Fusion for data integration
The advent of ground breaking technologies & generation of unbounded data has enabled businesses to make informative decisions like never before. A plethora of data tools & technologies have evolved over time to empower users with distinguished skills. Of those, graphical UI data processing tools have a special role in equipping users to focus more on business & less on the technical side of the spectrum.
Cloud Data Fusion (henceforth referred as CDF in this article), managed version of open-source CDAP, is GCP’s native solution to enables code-free deployment of ETL/ELT data pipelines. Given what CDF can offer, I presume that it didn’t garner deserved traction & remains underrated, especially when organizations require visual point & click solutions. So, based on my experience, I wanted to enumerate 10 advantages of data fusion and few considerations that can help your tool evaluation process.
Let’s dive in right away .
- Simplified & Centralized management of Data pipelines
- Code-free UI based configurations enable not just engineers but also business users to perform end-end data processing intuitively.
- Plenty of drag & drop connectors spanning GCP, Azure, AWS, on-prem services, SAAS & legacy systems make data integrations seamless. Full list of 100s of plugin can be accessed here.
- Fully-managed Design & runtime environments empowers users to focus more on business and less on operations.
- Configurable & browsable UI based plugin properties, In-line plugin documentation, and click-away access to not just simple but complex “speech to text” kind of transformations makes CDF very user friendly. (following images are for reference)
2. Performant & scalable
Powerful Apache Spark execution environment powered by point & click CDAP complements the best of 2 open-source technologies
- Decoupled design-time and run-time execution environments makes CDF greatly scalable, performant and resilient.
- Transformation Pushdown to push some transformations to Bigquery makes it more performant & flexible.
3. Support for wide variety of use-cases
- Efficiently not just supports batch workloads but also real-time use cases with replication & data-stream. Greatly supports structured & unstructured data.
- With 100s of plugins, CDF caters diversified use-cases such as data integrations, aggregations, cleansing, conditional, control flow, etc.
- Provides multiple options to build appropriate solutions by leveraging features like multi-table, multi-file, multi-object, schema evolution, etc.
- Great flexibility to build custom plugins or execute our own code in Javascript, Pyspark, Python, etc. Supports multiple destinations.
- Support for simple to complex transformations and data-processing such as “Speech to text”, Pseudonymization, masking, file encoding, etc.
4. Economical and native availability with no lock-in
Native availability with no up-front contracts provides flexibility & acceleration with no lock-ins, contrary to many SAAS tools.
- Segregated cost for CDF development and execution environment makes it economical & very adaptive to suit various work-loads.
- Development environment is available in 3 edition with an hourly starting price as low as 0.35$ (at the time of this article).
- Configurable policies such as “Max Idle time”, auto-scaling, Skip-delete, Master-node & worker-node configurations make run-time dataproc clusters very affordable and scalable to suit diversified workloads.
- Unlimited user access to connectors & plugins across editions make it more affordable and expandable without hierarchal pricing models.
5. Enterprise-grade security
Private instance, VPC service control integration, Private Google access, VPC and Network-layer control can make Data fusion absolutely private, distinguishing it be to be part of enterprise network.
- Namespaces, Role based access, Encrypted passwords, IAM integration make it more secure.
6. Hybrid and multi-cloud
- Powered by 2 popular 100% open-source technologies Apache Spark & CDAP facilitates CDF to be completely multi-cloud with no lock-in.
- Configurable run-time such as GCP dataproc, hadoop , Amazon EMR make its truly hybrid to run across multiple-clouds.
7. Governance
- Time-variant & UI-based Lineage: Dataplex Lineage integration powered by versioned UI based lineage at both pipeline and field level, enables effective data lifecycle traceability and detailed impact analysis
- Monitoring: Provide a decent interface to track job status, errors & warning, number of records inserted & processed.
- Logging: Cloud logging integration powered by detailed Spark job level, dataproc and CDF logging aids any causal analysis efficiently
8. Reusability
Macros and run-time parameters make reusability a core feature of CDF to pass virtually everything at runtime
- Argument setters, parameterized plugins like Bigquery execute make CDF pipelines and logic thoroughly reusable across ELT & ETL use-cases.
9. Native GCP integration
Native and deep integration with almost all the GCP services make CDF a best integration partner on Google cloud.
- Integration across GCP data cloud through dataplex facilitates end-end lineage across components like CDF, cloud composer, Bigquery.
- In-built service-account security, browsable data assets, native UI based support for services like Bigquery enables us to manage complex upsert, partition and clustering use case just from the click of a button.
10. Operational & orchestration support
- Rest APIs, Pipeline triggers, time-based schedules, integration with cloud composer, dataform makes CDF pipeline orchestration effective.
- Manage pipelines though github: Recent Github support for Pipelines can enable collaboration, auditing and restoring of pipelines effective.
Let me close the loop by also talking about some of the considerations on CDF.
- Limited community support : As I mentioned at the start of the article, CDF is not very widely adapted. So community support is limited. (Hope this article mitigates it to some extent 😀). But you have great product team support covered by SLAs.
- Development environment can’t be paused: Though runtime is ephemeral, the development environment can only be deleted and so has to continuously run. Still its relatively economical.
- No support for UI based bulk import of pipelines : Pipelines cannot be bulk imported through UI and so reinstating an instance is tedious. It can be mitigated with github managed pipelines.
- Cold start in case of ephemeral clusters : We need to account for couple of minutes of cold-start for ephemeral clusters. So persistent clusters that come with a cost are better for real-time processing.
Hope this article has provided you a perspective on advantages & short-comings while considering Data Fusion for your data integration needs.