Data Governance with Dataplex

Murari Ramuka
Google Cloud - Community
4 min readApr 19, 2023

Gone are those days when Data governance process was good to have, now it is mandatory for all the enterprise organization. Robust Data governance process enable data driven culture in the organisation by providing right set of data to right user at right time.

What is Dataplex?

Dataplex’s Intelligent Data Fabric, available on Google Cloud, enables organizations to centrally discover, manage, monitor, and manage data across data lakes, data warehouses, and data stores with consistent controls that enable access data and perform analyses.

What are the Benefits of Dataplex?

Simplified data discovery

· Automates data discovery, classification, and metadata enrichment of structured, semi-structured, and unstructured data, stored in Google Cloud and beyond, with built-in data intelligence.

Data organization and life cycle management

· Logically organize your data that spans multiple storage services into business-specific domains using Dataplex lakes and data zones.

Centralized security and governance

· Enable central policy management, monitoring, and auditing for data authorization and classification, across data silos.

Built-in data quality and lineage

· Automate data quality across distributed data and enable access to data you can trust. Use automatically captured data lineage to better understand your data, trace dependencies, and effectively troubleshoot data issues.

Serverless data exploration

· Interactively query fully governed, high-quality data using a serverless data exploration workbench with one-click access to Spark SQL scripts and Jupyter notebooks.

What is Dataplex

Let’s take a use case where company XYZ has a vision to setup a datamesh landscape for their organization. They want to enable business and operational user with right set of data information to imbibe the data driven culture.

In order to solve this use case a domain based architecture for each functions (i.e. Sales, Finance, Marketing, and Operation etc) is needed. Dataplex will enable domain based architecture required by XYZ organization and then adding respective assets to the domain. With Dataplex business user can view required data set and take the respective decision. So if we segregate our entire database into particular domains such as Sales, Production, Manufacturing. It makes our visibility to our database clearer and helps us to make more effective insights over it .Under the hood, Dataplex take care of central access control along with right set of template tagging, cataloguing and data masking. With this approach any organization would be able to enable the data for their end user by applying right governance and eliminating the data deduplicacy.

One of the important value add for Enterprises offered by Dataplex is to ensure that high-quality data is easily discoverable and accessible for analysis, across multiple silos, to a growing number of people and tools within their organization. They are often forced to make trade-offs — moving and duplicating data across silos to enable different analytic use cases, or keeping their data distributed but limiting decision-making flexibility.

With Dataplex, an intelligent data fabric that provides a way to centrally manage, monitor, and control your data across data lakes, data warehouses, and data marts, and make that data available to a variety of analytics and data science tools.

Dataplex provides an integrated analytics environment that combines the best of Google Cloud and open source tools so you can quickly manage, secure, integrate and analyze data at scale. With built-in data intelligence using Google’s artificial intelligence (AI) and machine learning (ML) capabilities, and a flexible consumption model, you can now spend less time fighting infrastructure and more time focused on driving business results.

Dataplex, have a built in feature named as Data Catalog. Data Catalog is a fully managed, scalable metadata service. Dataplex provides the organization with all the data governance as they to handle their issues related to data.

Data governance is a critical aspect of data management. It ensures that data is accurate, secure, and used appropriately. Data Catalog provide a central location for managing data governance policies and procedures, ensuring that data is managed in accordance with relevant regulations and best practices.

Advanced data catalog features:

Advanced data catalog does much more than manage data or make data available. Modern data catalog provides all of the above features and more, making them even more powerful tools for any business aiming to become a truly data-driven organization.

Data lineage:

Data lineage is another important aspect of data management and one for which a modern data catalog should enable a seamless user experience. A data lineage refers to the historical record of a data asset, from its origin to its current state, any transformations and processes it has undergone.

A modern data catalog helps trace the data lineage and enables organizations to understand where their data came from, how it was transformed, and who used it. Data pipelines are important for data quality and compliance purposes because they allow data professionals to understand how the data was processed and identify potential problems.

Enterprises have data that is distributed across data lakes, data warehouses, and data marts. Dataplex allows you to discover, manage and unify this data without any data movement, organize it according to your business needs and centrally manage, monitor and control it. Dataplex helps you standardize and unify metadata, security policies, governance, classification and data lifecycle management across this distributed data.

Securing your Data Lake with Dataplex:

The Dataplex security model allows you to manage who has access to perform the following tasks:

· Administering a lake (creating and attaching assets, zones, and additional lakes)

· Accessing data connected to a lake through the mapping asset (Google Cloud resources such as Cloud Storage buckets and BigQuery datasets)

· Accessing metadata about the data connected to a lake

· An administrator for a lake controls access to Dataplex resources (lake, zone, and assets) by granting the following basic and predefined roles.

Dataplex Labs : https://github.com/GoogleCloudPlatform/dataplex-labs/tree/main/dataplex-quickstart-labs

Thanks for your time to read this blog. Stay tuned for part # 2.

Follow me at Linkedin : https://www.linkedin.com/in/murari-ramuka-98a440a/

--

--

Murari Ramuka
Google Cloud - Community

Data Enthusiast who help in key data driven outcome with Cloud Data Platform implementation