What is Google Cloud Data Fusion?
Cloud Data Fusion is the brand new, fully-managed data engineering product from Google Cloud. It will help users to efficiently build and manage ETL/ELT data pipelines. Data Fusion intents to shift the focus from code — where data engineers can spend lot of days/weeks building connectors from a source to a sink- to a focus on insights and action. Built on top of the open-source project CDAP, it leverages a convenient user interface for building data pipelines in a ‘drag and drop’ manner.
Note: this is a cross-post from the Fourcast blog. Read more there.
Data Fusion comes at a time where companies struggle to deal with a huge amount of data spread across many data sources, and to fuse them into a central data warehouse. The key challenges of integrating all these data are as follows:
Data Fusion is addressing these challenges by making it extremely easy to move data around, with two main focuses:
- build data pipeline without writing any code: as Data Fusion is built on top of the open-source CDAP project, it already comes with more than 100 connectors and it is constantly growing. Building a pipeline between a source and sink requires therefore only a few clicks.
- Do transformation without writing any code: Data Fusion comes with a set of built-in transformations that you can seamlessly apply to your data.
The following screenshot shows the interface with a simple pipeline. First step is the connector to the raw database, then there is a wrangling step that does some transformation on a set of columns, and finally the data is sent to two sinks: BigQuery for analytics purposes and Cloud Storage for backup of the data.
Some of the other relevant features of Data Fusion are these (described by one of the early adopters):
- Open-source: as mentioned above, it’s built on top of CDAP and it therefore enjoys a big community that keep on developing new connectors.
- Accessible: thanks to the user interface, Data Fusion does not require you to have any kind of coding background.
- Metadata: search integrated datasets by technical and business metadata. Track lineage for all integrated datasets at the dataset and field level.
- Flexible: if you can’t do something through the UI, Data Fusion is extensible and you can add your own code to it.
- GCP-native: fully managed, GCP-native architecture unlocks the scalability, reliability, security and privacy guarantees of Google Cloud.
Below is a list of business challenges where Data Fusion will excel:
Data Fusion is providing a fabric which allows user to fuse a lot of different technologies and products that are available on GCP in a much easier, more accessible, secure and efficient manner as shown on the following chart.
Data Fusion is the new backbone for data analytics and will become in the months to come a major game-changer for doing data engineering. Our engineers at Fourcast are already familiar with this new GCP product and will be glad to give you a demo.
Just send us a message at info@fourcast.io and we will get back to you.
For more blog posts about the latest Google Cloud & analytics technologies, check out the Fourcast — premier Google Cloud partner blog!