Data Fusion Basics

Published in

Google Cloud - Community

4 min readDec 21, 2022

Fully managed, cloud-native data integration at any scale.

Cloud Data Fusion is a GUI based data integration service for building and managing data pipelines. It is based on CDAP, which is an open source framework for building data analytics applications for on-premise and cloud sources. It provides a wide variety of out of the box connectors to sources on GCP, other public clouds and on-premise sources.

Lets looks at few of the challenges which we often observe with ETL processing,

Disparate data assets in cloud and on-prem
Integration requires deep technical expertise and expensive resources
Slows down business decision making
Repetition and silos

Data fusion addresses the above challenges as it offers,

Graphical, code-free interface — provides simplicity, enables non-technical audience
Unified view over all data
100s of built-in cloud and on-prem connectors
Standardization via extensibility and reusability

Data Fusion has tow main focus to simplify the ETL processing,

Build a data pipeline without writing any code: as Data Fusion is built on top of the open-source CDAP project, it already comes with more than 100 connectors and it is constantly growing. Building a pipeline between a source and sink requires only a few clicks.
Do transformation without writing any code: Data Fusion comes with a set of built-in transformations that you can seamlessly apply to your data.

Beyond the capability to create code free GUI based pipelines, Data Fusion also provides features for visual data profiling and preparation, simple orchestration features, as well as granular lineage for pipelines.

Now we will go through with some basics concepts of data fusion to understand it more better,

Cloud Data Fusion instance

A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion.

You can create multiple instances in a single Google Cloud console project and can specify the Google Cloud region to create your Cloud Data Fusion instances in.

Based on your requirements and cost constraints, you can create a Developer, Basic, or Enterprise instance.

Each Cloud Data Fusion instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services, that handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in a tenant project.

Execution environment

Cloud Data Fusion creates ephemeral execution environments to run pipelines when you manually run your pipelines or when pipelines run through a time schedule or a pipeline state trigger. Cloud Data Fusion supports Dataproc as an execution environment, in which you can choose to run pipelines as MapReduce, Spark, or Spark Streaming programs.

Pipeline

A pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources. Building pipelines lets you create complex data processing workflows that can help you solve data ingestion, integration, and migration problems. You can use Cloud Data Fusion to build both batch and real-time pipelines, depending on your needs.

Pipeline node

In the Studio page of the Cloud Data Fusion UI, pipelines are represented as a series of nodes arranged in a directed acyclic graph (DAG), forming a one-way flow. Nodes represent the various actions that you can take with your pipelines, such as reading from sources, performing data transformations, and writing output to sinks. You can develop data pipelines in the Cloud Data Fusion UI by connecting together sources, transformations, sinks, and other nodes.

Plugin

A plugin is a customisable module that can be used to extend the capabilities of Cloud Data Fusion. Cloud Data Fusion provides plugins for sources, transforms, aggregates, sinks, error collectors, alert publishers, actions, and post-run actions.

Compute profile

A compute profile specifies how and where a pipeline is executed. A profile encapsulates any information required to set up and delete the physical execution environment of a pipeline.

User Interface

Namespaces

You can use namespaces to partition a Cloud Data Fusion instance to achieve application and data isolation in your design and execution environments.

Hub

In the Cloud Data Fusion UI, you can click Hub to browse plugins, sample pipelines, and other integrations.

Pipeline execution

Pipeline is deployed in a namespace called ‘Data Pipeline-Batch’ . It has multiple pipeline nodes which are doing specific action to read, write or transform the data. Once pipeline is ready it can be run on a dataproc cluster.

Hope you found article Helpful !! Happy Reading !!!