Dataplex: Unified Data Governance, Central Data Management & Enhanced Data Quality

Vasu Mittal
Google Cloud - Community
7 min readMay 30, 2024

Dataplex is an intelligent data fabric that helps you unify distributed data and automate data management and governance across that data to power analytics at scale.

Data Catalog is a metadata management service within Dataplex.

What problems does Dataplex solves?

  1. Data in Silos: Data is fragmented without a single view of data universe in GCP.
  2. Data Security: Having different security models across GCP services is an extra operational overhead.
  3. Tooling & Usage: Using various GCP tools and services to query data makes it a tedious task.
  4. Data Standardization: Lack of consistent metadata, loss in data quality, and unstable data governance strategy.
  5. Data Governance: Having different models and unpredictable pricing is definitely difficult to manage.

What can we achieve using Dataplex?

Dataplex allows you do the following:

  1. Data Mesh: Build a domain-specific data mesh across data that is stored in multiple Google Cloud projects, without any data movement.
  2. Data Governance: Consistently govern and monitor data with a single set of permissions. Centrally manage, monitor, and govern data.
  3. Data Catalog: Discover and curate metadata across various data silos using data catalog capabilities.
  4. Data Security: Securely query metadata by using BigQuery and open source tools, such as SparkSQL, Presto, and HiveQL.
  5. Data Quality: Run data quality and data lifecycle management tasks, including serverless Spark tasks.
  6. Data Usage: Explore data using fully managed, serverless Spark environments with simple access to notebooks and SparkSQL queries.

How does Data Governance looks like in an Ideal world?

Data Governance is actually a trinity which encompasses the below 3 data frameworks:

  1. Data Quality: Standardization, Quality Checks & Data Lifecycle
  2. Data Security: Access, Protection & PII Management
  3. Data Compliance: Lineage, Classification & Auditing

You can’t build a successful data environment without each of these pillars linked together in harmony.

Dataplex Terminology

Lake: A logical construct representing a data domain or business unit. For example, to organize data based on group usage, you can set up a lake for each department (for example, Retail, Sales, Finance, IT etc.).

Zone: A subdomain within a lake, which is useful to categorize data by the following:

  • Stage: For ex, landing, raw, curated data analytics, and curated data science.
  • Usage: For ex, data contract.
  • Restrictions: For ex, security controls and user access levels.

Zones are of two types: Raw and Curated.

  • Raw zone: Contains data that is in its raw format and not subject to strict type-checking.
  • Curated zone: Contains data that is cleaned, formatted, and ready for analytics. The data is columnar, Hive-partitioned, and stored in Parquet, Avro, Orc files, or BigQuery tables.

Asset: Maps to data stored in either Cloud Storage or BigQuery. You can map data stored in separate Google Cloud projects as assets into a single zone.

Entity: Represents metadata for structured and semi-structured data (table) and unstructured data (fileset).

Getting Started with Dataplex

Creating a Lake

In Dataplex, a lake is the highest organizational domain that represents a specific data area or business unit. For example, you can create a lake for each department or data domain in your organization, so that you can organize and provide data for specific user groups.

Let’s create a lake named “Product Catalog”:

  1. In the Google Cloud Console, in the Navigation menu, navigate to Analytics > Dataplex or search for “Dataplex” in the search bar. If prompted “Welcome to new Dataplex Experience”, click Close.
  2. Under Manage lakes, click “Manage”.
  3. Click “+Create lake”.
  4. Enter the required information to create a new lake i.e Enter display name as “Product Catalog”, leave the ID value as it is(to the default value) and select an appropriate region from the list. Leave the other default values as it is.
  5. Click Create.

Creating a Zone

After you create a lake, you can add zones to the lake. Zones are subdomains within a lake that you can use to categorize data further. For example, you can categorize data by stage, usage, or restrictions. Also, we saw above that Zones are of two types: Raw and Curated.

Let’s try to add a “Raw Zone” for working with files in a Cloud Storage bucket:

  1. On the Manage tab, click on the name of your lake(in our case it is “Product Catalog”).
  2. Click “+Add zone”.
  3. Enter the required information to create a new zone i.e enter the display name as “Product Raw Zone”, leave the ID value as it is(to the default value), select the type as “Raw Zone” and data locations as “Regional”. Leave the other default values as they are. For example, the option for “Enable metadata discovery” under Discovery Settings is enabled by default and allows authorized users to discover the data in the zone.
  4. Click Create.

Attach an asset to a zone:

Data stored in Cloud Storage buckets or BigQuery datasets can be attached as assets to zones within a Dataplex lake.

For this demo, let’s attach a Cloud Storage bucket to the zone that we have just created:

  1. On the Zones tab, click on the name of your zone(in our case it is “Product Raw Zone”).
  2. On the Assets tab, click “+ADD ASSEST”.
  3. Click “+Add AN ASSET”.
  4. Enter the required information to attach a new asset For ex: select type as “Storage Bucket”, enter the display name as “Product Catalog Bucket” & leave the ID value as it is(to the default value).
  5. For Bucket, click Browse. (Please note that: You can attach an existing Cloud Storage bucket or create a new one without leaving Dataplex. In the next steps, you create a new Cloud Storage bucket and attach it to the zone.)
  6. Click “+Create new bucket”.
  7. Provide your project ID as the bucket name and click “Continue”
  8. For “Location type”, select “Region”. Leave other default values.
  9. Click “Create”. (Note: If prompted “public access will be prevented”, click Confirm.)
  10. Click “Select” to select the bucket you just created, and then click “Continue”.
  11. For “Discovery settings”, select “Inherit” to inherit the Discovery settings from the zone level, and then click “Continue”.
  12. Click “Submit”.

And all set!

Deleting Assets, Zones and Lakes

To delete a lake, you must first detach the assets and then delete the zones.

Detach an Asset

Detaching an asset removes the Cloud Storage bucket from being accessible or discoverable using the lake in Dataplex.

  1. On the left menu, click on “Manage” tab, and then click on the name of your lake(in our case it is “Product Catalog”).
  2. On the “Zones” tab, click on the name of your zone(in our case it is “Product Raw Zone”).
  3. On the “Assets” tab, enable the checkbox to the left of the asset name(in our case it is “Product Catalog Bucket”).
  4. Click “Delete assets”.
  5. Click “Delete” to confirm.

Delete a Zone

  1. On the left menu, click on “Manage” tab, and then click on the name of your lake(in our case it is “Product Catalog”).
  2. On the “Zones” tab, enable the checkbox to the left of the zone name(in our case it is “Product Raw Zone”).
  3. Click “Delete zone”.
  4. Click “Delete” to confirm.

Delete the Lake

  1. On the left menu, click on “Manage” tab, and then click on the name of your lake(i.e “Product Catalog” in our case).
  2. At the top of the page, click “Delete”.
  3. Confirm deletion by typing “delete” into the text box.
  4. Click “Delete lake” to confirm.

Dataplex APIs

We can perform all the above tasks like creating a lake, zone and attaching a asset using command line interface as well. Let’s see that in action:

  1. To enable the Dataplex API, in Cloud Shell run the following command:
gcloud services enable \
dataplex.googleapis.com

2. Now let’s create a variable called PROJECT_ID:

export PROJECT_ID=$(gcloud config get-value project)

3. Also, let’s create a variable called REGION:

export REGION=""
gcloud config set compute/region $REGION

Creating a Lake

Let’s create the same Lake “Product Catalog”(that we created above using GCP Console) now using CLI:

gcloud dataplex lakes create productcatalog\
--location=$REGION \
--display-name="Product Catalog" \
--description="Product Catalog Repo"

Creating a Zone

By now, we know that Zones are of two types i.e Raw and Curated. We have already experimented with a raw zone using GCP console above. Now, let’s create a curated zone named “Product Curated Zone”:

gcloud dataplex zones create product-curated-zone \
--location=$REGION \
--lake=productcatalog \
--display-name="Product Curated Zone" \
--resource-location-type=SINGLE_REGION \
--type=CURATED \
--discovery-enabled \
--discovery-schedule="0 * * * *"

Attaching an Asset

While using GCP Console, we have attached a GCS Bucket to the Raw Zone that we created. Now using CLI, let’s see how we can do the same for a Bigquery Dataset.

Let’s first create a Bigquery dataset named “products”:

bq mk --location=$REGION --dataset products

Now let’s attach the dataset “products” to the “Product Curated Zone” that we created above:

gcloud dataplex assets create product-curated-dataset \
--location=$REGION \
--lake=productcatalog \
--zone=product-curated-zone \
--display-name="Product Curated Dataset" \
--resource-type=BIGQUERY_DATASET \
--resource-name=projects/$PROJECT_ID/datasets/products \
--discovery-enabled

Deleting Assets, Zones & Lakes

For detaching the bigquery dataset from the zone, execute the below command:

gcloud dataplex assets delete product-curated-dataset --location=$REGION --zone=product-curated-zone --lake=productcatalog

We can delete the zone by executing the below command:

gcloud dataplex zones delete product-curated-zone --location=$REGION --lake=productcatalog

Finally, we can delete the lake by executing the below command:

gcloud dataplex lakes delete productcatalog --location=$REGION

Keep Learning, Keep Growing!!!

--

--