Dataplex — An intelligent Data Fabric | Data Governance at Scale| Google Cloud | Part — 1 | Overview

Nishit Kamdar
Google Cloud - Community
6 min readSep 24, 2023

Background:

Data continues to be one of the most important assets of any enterprise. It is essential for making informed decisions, improving efficiency, and providing a competitive edge.

Over the last decade, trends like Bigdata, Digital Transformation, Smartphone and wearables, IoT devices, AI etc have led to accelerated trends that are pushing boundaries of Data Analytics architectures.

Data volumes are growing massively, Data is stored across multiple data repositories, Data is stored across multiple clouds and Data is leveraged from edge-to-cloud, from applications to BI and AI.

With such an unprecedented growth of data, one of the biggest challenges enterprises are facing is “How do I Govern it at scale?

Challenges with the legacy approach:

With such a diverse, distributed, and complex data landscape, Legacy Data Governance techniques often fail or become too cumbersome and ineffective.

The data architectures of today look like the picture above, where the data assets are stored across multitude of storage systems — object store, OTLP Databases, Lakes, Warehouses, Marts, NoSQL DBs, Graphs DBs, Vector DBs etc as opposed to stored in one DataLake or a large Datawarehouse.

Therefore the legacy approach to managing them at an individual product or service level, often leads to silos, mixed security, redundant metadata and disjoint governance models. The need is to have a unified model to Data Governance that works across all the solution components.

Enter Data Fabric and Dataplex!

To solve the above Data Governance challenges of the modern data landscape, the industry pivoted towards developing Data Fabric model for Data governance(as described by Gartner in this paper).

Data Fabric is an architecture that facilitates the end-to-end integration and management of various data assets through the use of intelligent and automated systems to manage, monitor, and govern your data

Dataplex

Google Cloud Dataplex is an intelligent Data Fabric that provides a way to centrally manage, monitor, and govern your data across data lakes, data warehouses and data marts, and make this data securely accessible to a variety of analytics and data science tools.

Dataplex provides an integrated analytics experience, bringing together the best of Google Cloud and open source tools, so you can rapidly curate, secure, integrate, and analyze data at scale. With built-in data intelligence using Google Artificial Intelligence (AI) and machine learning (ML) capabilities and a flexible consumption model, you can now spend less time wrestling with infrastructure and more time focused on driving business outcomes.

Dataplex Conceptual Architecture:

Dataplex is designed as a “Single pane of glass for management” for all your data across GCP.

As shown in the picture below, it integrates with the various underlying data storage systems and provides a Governance layer on top to manage, secure and govern all the data assets centrally thereby unifying the operating model of implementing Data Governance at Scale across your data estate.

Dataplex — Key Features:

1. Data organization and life cycle management

One of the core tenets of Dataplex is letting you organize and manage your data in a way that makes sense for your business, without data movement or duplication. Dataplex provides logical constructs like lakes, data zones and assets to abstract away the underlying storage systems and this logical organization becomes the foundation for setting policies around data access, security, lifecycle management, and so on.

The Dataplex Organization can be applied to Central DataLake as well to the DataMesh architecture pattern.

Central DataLake Pattern:

Dataplex Lake and Zones can be mapped to the Data assets in the Landing, Structured and Refined zones (also known as Bronze, Silver and Gold Layers) of the Central Data Lake Architecture pattern. You can also attach data across multiple projects under the same zone.

DataMesh Architecture Pattern:

The data mesh architecture, first proposed in this paper by Zamak Deghani, describes a modern data stack that moves away from a monolithic data lake or data warehouse architecture to a distributed domain-specific architecture that enables autonomy of data ownership, provides agility with decentralized domain aware data management while providing the ability to centrally govern and monitor data across domains.

Dataplex provides a data management platform to easily build independent data domains within a data mesh that spans your organization while still maintaining central controls for governing and monitoring the data across domains.

For example, you can create a lake per department within your organization (Customer, Operations, Sales etc.) and create data zones that map to data readiness and usage (e.g Sales Domain Zones — Raw Data, Offline Sales, Online Sales, etc.)

2. Data discovery and Catalog

Dataplex automates data discovery, classification, and metadata enrichment of structured, semi-structured, and unstructured data, stored in Google Cloud and beyond, with built-in data intelligence. It manages technical, operational, and business metadata in a Data Catalog. Users can search, find, and understand it with built-in faceted-search interface using the same search technology as Gmail.

3. Centralized security and governance

Dataplex enables central policy management, monitoring, and auditing for data authorization and classification, across data silos. Facilitates distributed data ownership based on business domains with global monitoring and governance.

4. Built-in Profiling, Data quality and Lineage

Dataplex automates data quality across distributed data and enable access to data. Use automatically captured data lineage to better understand data, trace dependencies, and troubleshoot data issues.

5. Serverless data exploration

Interactively query fully governed, high-quality data using a serverless data exploration workbench with one-click access to Spark SQL scripts and Jupyter notebooks. It enables users to collaborate across teams with built-in publishing, sharing, and search features, and features one-click scheduling from the workbench.

We will deep-dive into each of the above areas as part of this series.

Conclusion:

Google Cloud Dataplex with its intelligent data fabric can help enterprises deploy an unified operating model towards Data Governance covering the entire lifecycle from Data Discovery, Organization, Cataloging, Enrichment, Quality and Security thereby allowing organizations to effectively manage their data assets at scale.

For more details please visit:

Part-2 of the blog series on Data Organization — Lakes and Zones can he found here

--

--

Nishit Kamdar
Google Cloud - Community

Data and Artificial Intelligence specialist at Google. This blog is based on “My experiences from the field”. Views are solely mine.