Data Governance on GCP— Quick Bites

Sneha Choudhary
Google Cloud - Community
6 min readDec 11, 2022

Let’s start with a quick understanding of Data Governance.

Data Governance is all about interrogating Data. It is about knowing who, what, how, when, where and why of data, to extract true value from all the information collected and stored across the business. This perspective makes data governance a business program or a strategy, which requires stakeholders from all business areas be involved — in order to put appropriate policies in place.

“What are we aiming to achieve with Data Governance? Which of the above interrogatives we need to respond to?” — Our answer to these, will swiftly lead us to the appropriate tools and services to use.

Just in case you are curious about why bother about data governance … next paragraph is meant especially for you my friend!

Data is at the heart of every business and every data driven enterprise intends to embed data in all decision making. I know it’s cliché to say that, but data is the most important business driver today and an irreplaceable corporate asset for that matter. For organisations — small, medium or large, governing data — builds user trust, increases their brand value and reduces their chances of compliance violations. Eventually, all this helps save enterprises from substantial fines and eventual loss of business.

More than ever, we have an unprecedented amount of data these days — more data, more power, hence, greater responsibility. That is exactly where the need arises for Data Governance!

We will use a set of driving principles to classify various services available on Google Cloud platform. These principles apply at various levels of the life cycle of data to help keep it compliant with end user goals.

Our driving guidelines being:

  • Data discovery, management and classification
  • Data encryption and protection
  • Data lineage tracking and profiling
  • Data quality management
  • Data access management
  • Data auditing and democratization

Here is how GCP services map to one or more of the dynamics mentioned above.

Data discovery, management and classification

This assessment helps us know what data assets we have and classify them appropriately. As an outcome, enterprises can make informed decisions on applying correct data governance policies and procedures.

Data catalog helps with centralized metadata management and data discovery. Also, it is fully managed and easy to scale.

Google’s Cloud Data Loss Prevention (DLP) applies data masking and tokenization techniques to help obfuscate sensitive information in the data.

Together, Google Cloud and its partners like Collibra & Informatica provide unified governance across multiple cloud platforms, which enables companies to stay compliant while also improving business outcomes.

Data encryption and protection

Perimeter security of the cloud is a must, but not enough to protect data as it moves through the various stages of data pipeline.

Google cloud platform always encrypts data by default, at rest as well as in transit. However, customers can also use Cloud KMS to create, import, and manage their own keys.

For security and risk management, Google Cloud offers a centralized threat detection service — Security Command Centre. Security Command Center cannot just identify security and compliance threats, but also prevent and remediate the same with the help of actionable recommendations.

Data Loss Prevention (DLP) integration can help classify and mask/tokenize sensitive data elements for better data management, while also minimizing risk of compliance violations.

Data lineage tracking and profiling

Whether it comes to compliance needs or cost control or even data quality, tracking lineage is paramount. This allows to put a process in place to trace each data asset from the moment it enters an enterprise. Primary feature of lineage tracking involves identifying how data has changed over time — whether it’s been optimized/improved and who was involved in the processing, as it moves through the data pipeline.

Data Fusion provides end to end data lineage tracking lineage at both dataset level and field level. This helps with impact analysis, as well as meets governance and compliance needs.

Big query also supports data profiling, which when turned on, uses Cloud DLP to automatically scan all tables and columns and identify the location of sensitive data. It then creates data profiles at the table, column, and/or project levels.

Aside from the above, several other GCP services, including but not limited to Pub/Sub, Audit Logs, Cloud Logging, Dataflow, Dataplex, Data Catalog — can be used to create a robust architecture for data lineage tracking.

Products and Services on GCP

Data quality management

Data quality measures ensure that the data across the organization adheres to the desired standards. Data quality is about introducing controls and having checks in place for validating accuracy, completeness, consistency and timeliness of data to assist with quality monitoring and reporting.

Dataprep by Trifacta is a powerful serverless and scalable service that provides an intelligent visual interface to explore data. Not just that, it is intelligent enough to predict next transformation at every UI input, which makes it practically a no-code/low-code solution.

GCP’s Dataplex uses intelligent data fabric to automatically discover, classify, manage and monitor data. This service enables centralized security and governance features, while allowing for distributed domain based data ownership.

There are several Google partner providers like Informatica, that together with Google provide excellent data quality capabilities/tools across domains for better analysis.

Data access management

Applying access controls gives visibility to administrators and other related actors for streamlining management of cloud resources. This helps prevent exposure of sensitive data to unauthorized parties(internal/external) and avert security threats and data breaches from happening.

Identity and Access Management (IAM) provides a simple unified interface for management of access control across all Google Cloud resources. It helps manage access rights by grouping resource permissions into roles, segregating identities into groups and granting roles to authenticated groups (aka principals).

BigQuery also allows for fine-grained access control policies to be created to manage permissions on projects, datasets and tables. It has features for enabling row-level or column-level security for granular management.

Data auditing and democratization

Having audit controls and regular audit (internal or external) checks in place, act as an important guardrail in identifying security threats and system vulnerabilities. Audit records are often a compliance requirement too in a lot of industries.

Cloud Logging is Google’s fully managed real-time log management solution that collect, store, search and analysis log data from across the platform — all in one place. It is also integrated with Cloud Monitoring and Error Reporting services, that further assist with troubleshooting needs.

Safe and secure means for data exchange is the need of the hour, for many data-driven organizations that are inclined towards data democratization. This is best met by GCP’s Analytics Hub, powered by BigQuery, which enables organizations to share their data and analytics assets, without moving locations, in the form of curated exchanges.

Products and Services on GCP (continued)

Every organization is different and they might each have their own legal, economical and/or technical challenges. An organization might also be at a different level of Data Governance maturity model, hence, they need to adapt their governance practices accordingly and opt for a suite of services that best meets their business goals.

Before we wrap up, wanted to mention that this post attempts to explore some (… not all!) powerful and easy-to-use GCP services to help one get started with Data Governance.

Happy Learning, and …

In case, you are interested in exploring GCP’s Database Migration Service via REST call, here is a quick link to my previous article on the same.

--

--