gft-engineering
Published in

gft-engineering

My journey with Google Cloud Data Fusion

Photo by Joshua Sortino on Unsplash

Why Data Fusion?

What would be your choice?

  • when you need a fully managed product from the Google Cloud Platform to do the ETL pipelines?
  • when you take into account minimizing the time to market?
  • when you need to ingest data from diverse data sources?
  • when you have both batch and real-time processing needs?
  • last but not least, when you require easy maintenance for your data engineering team?

How about Apache NiFi?

Apache NiFi is an open-source data ingestion tool that facilitates the processing and distribution of data flows between diverse systems. Its user-friendly interface allows for drag-and-drop integration. However, one of the challenges with NiFi is that it is not cloud-native. As the data pipeline becomes more complex, more maintenance may be required, which can be a pain point. Secondly, compared to other ETL/ELT platforms, NiFi is more focused on moving data from the source to the destination quickly as a data ingestion tool. Additionally, when customized processors are required, it can lead to real headaches.

How about DataFlow?

Maybe Cloud Data Fusion!

In this article, I will share my journey with Google Cloud Data Fusion. Initially, we considered this product to be the best choice for satisfying our client’s requirements. However, is it truly the best option? It is worth doing a deep dive into the details to find out.

What is Data Fusion?

Photo by Google Cloud
Dataset level lineage
Filed level lineage

Get started with Data Fusion

Permissions

Before implementing the data pipeline, it is vital to grant all the necessary permissions if you are not the project owner!

  • Confirm that the compute service account {project-number}-compute@developer.gserviceaccount.com is present and has at least the editor role assigned.
  • Check in IAM & admin > Service Accounts. You should have your service account (service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com) assigned the role of Service Account User or Cloud Data Fusion API Service Agent. You need to have the correct role set to grant permission (see the image below), as it is required to launchDataproc clusters when creating the instance.
  • Once the Data Fusion instance is created, you need to navigate to the IAM & admin > IAM. Then assign the service account of Data Fusion the “Cloud Data Fusion API Service Agent” role.
Service account

Batch Processing

First of all, the Data Fusion instance should be created. There are three editions available: Developer, Basic, and Enterprise. For more details on the differences and pricing, please check this link: https://cloud.google.com/data-fusion/pricing.

Data Fusion Web UI Home Page
Configuration of connections
Import the data from the Sample Buckets
  1. Sometimes, two colors are shown above some columns, indicating whether the data in the column is complete or not. The red progress bar indicates the proportion of null values in that column.
  2. To apply transformations, you can also use the CLI (command line interface). As you start typing commands, an auto-fill feature will assist you in finding matching commands.
  3. After all transformations are applied, click Create a Pipeline. A selection box will prompt you to create a batch pipeline or a real-time pipeline.

Real-Time Processing

The steps for building a real-time pipeline are very similar. Here, we will build a pipeline in which data is read from Dataflow to Pub/Sub. We create a connection with Pub/Sub in Data Fusion, use the Wrangle service to perform data transformation, and finally store the data in GCS.

Creating the ephemeral cluster
Pipeline running

Learnings from my journey

Advantages

  • Fully managed and cloud-native
From GCP Next’19
  • Code-Free pipeline development
  • Diverse plugins and hybrid enablement
  • Efficient building of batch processing pipeline

Disadvantages

  • Real-Time Processing Performance
  • Considerable cost
  • Resource wasted
  • Black box

--

--

GFT is driving the digital transformation of the world’s leading companies. On here, our tech communities from all around the globe share their tips, tricks & insights with other developers.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store