Dataplex — An intelligent Data Fabric | Data Governance at Scale| Google Cloud | Part — 2 | Organization — Lakes and Zones

Published in

Google Cloud - Community

6 min readSep 25, 2023

This is Part-2 of the Dataplex series focussed on how to set up Organization — Lakes and Zones in Dataplex. Part-1 provides an overview of the Dataplex service.

Dataplex organization

One of the core tenets of Dataplex is letting you organize and manage your data in a way that makes sense for your business, without data movement or duplication. Dataplex provides logical constructs like lakes, data zones and assets to abstract away the underlying storage systems and this logical organization becomes the foundation for setting policies around data access, security, lifecycle management, and so on.

Terminology:

Before we begin creating the Dataplex organization, let’s look at the key terminologies.

Lake:

A Dataplex Lake is a logical metadata abstraction on top of your assets (structured and unstructured) representing a data domain or business unit. For example, to organize data based on group usage, you can set up a lake per department (for example, Retail, Sales, Finance).

Zone:

A sub-domain within a lake, useful to categorize data by stage (for example, landing, raw, curated), or (for example, Bronze, Silver and Gold).

Zones are of two types, raw and curated.

Raw zone: Raw zones store structured, semi-structured, and unstructured data in any format from external sources. This is useful for staging raw data before performing any transformations. Data can be stored in Cloud Storage buckets or BigQuery datasets.There are no restrictions on the type of data that can be stored in raw zones.
Curated zone: Curated zones store structured data. Data can be stored in Cloud Storage buckets or BigQuery datasets.Supported formats for Cloud Storage buckets include Parquet, Avro, and ORC. This is useful for staging data that requires processing before it’s used for analysis, or for serving data that is ready for analysis. BigQuery tables must conform to a well-defined schema and Hive-style partitions.

Asset:

An asset maps to data stored in either Cloud Storage buckets or BigQuery Datasets. You can map data stored in separate Google Cloud projects as assets into a single zone.

Creating a Dataplex Organization:

Dataplex Organization is a logical mapping of the actual data platform architecture, and therefore its important to have a view of the as-is platform set up, so as to map it to Dataplex Lakes, Zones and Assets.

We will be using the following Consumer Banking example for setting up our Dataplex organization.

The Consumer Banking platform is organized as a DataMesh architecture with 4 Domain Lakes and a Central operations Lake. Each of the Lakes have 2 to 3 Zones in it with data assets across GCS bucket and Bigquery Datasets.

Let’s get started!

Step 1.0— Create a Lake:

We will start with creating Customer Domain Lake.

From the Navigation menu > under Analytics > select Dataplex > Create new lake and enter Customer Domain lake details.

Click create at the bottom of the page and the lake gets listed under Manage.

Step 2.0 — Create Zones:

Customer domain has 3 data zones viz. Raw, Curated and Data Products.

Click on the newly created Customer Domain data lake > Click add zone.

Select Raw Zone for Raw Data and Curated for Curated and Data Products Zones.

Step 2.1 — Data Discovery:

Expand the Discovery Settings options.

The Discovery settings capture the Job configurations for automatic object scanning and metadata collection for the zone. Dataplex fully automates and manages the metadata harvesting through its discovery process.

Click Enable Metadata Discovery and Enable JSON and CSV options if the files in the GCS bucket contain these file types.

You can set the frequency and schedule of the job run as well.

Create the Zone and repeat this task for all the 3 zones. Following will be the view of the Customer Domain once the Lake and Zones are created.

Step 3.0 — Create Assets:

With the Customer Domain Lake and Zones set up, let’s now map the actual data assets to it.

Click on the Customer Domain -> Raw Zone > Click Add Asset

Raw data is usually stored on GCS buckets and Curated in Bigquery Dataset.

For Raw data, select type-> Storage Bucket and specify GCS bucket path to Customer Raw Data bucket.

Select Discovery settings as inherit, if you wish to continue the Zone level discovery setting or override if you wish to exclude some objects from the bucket for this asset and click submit. The Raw Asset will be created within the Raw Data Zone.

Click on the Raw Asset Zone and it will show all the details of the data assets mapped. The External URL shows the GCS bucket name and the automatic metadata harvesting, Dataplex now has all the details of the data assets.

Repeat the same process to register Data Assets for the other 2 Zones.

Repeat the same process for all the remaining Consumer Banking Data Lakes to complete the creation of the entire DataMesh on Dataplex as shown below.

Yay!! You have successfully created a full hierarchy of the Customer Domain Lake in Dataplex!!!

Explore the Dataplex Organization:

With the entire organization set up and assets linked, we should now be able to search and discover all the assets we have linked to the Dataplex organization.

Once the discovery job finishes, all the metadata is loaded into Dataplex’s integrated Data Catalog service.

Go to Search -> and type Customer in the search bar:

Dataplex shows all the assets related to the “Customer” search string across Lake, Zones and Data assets we have configured.

Click on one of the Tables from the Search results.

It will open up a Bigquery like view inside of Dataplex and provides all the details of the Bigquery table like dataset, Schema, Lineage, Data Quality and Data Profile —everything within Dataplex!!!

You can also naviage to Bigquery by clicking the “Open in Bigquery” link above.

GCloud Commands

Along with UI, Dataplex also provides glcoud commands to create Lakes, Zones, Assets if you like to create it as a script that can be replicated to multiple environments.

Conclusion:

Google Cloud Dataplex provides an easy to use UI to create a logical organization aligned to the Data Platform Architecture. The key thing is, it does this without moving or duplicating any data assets of the platform.

Dataplex also manages and fully automates ingestion of all the metadata into its built-in Data Catalog service which we will deep dive into, in our next blog post of this series. Stay tuned!

For more details please visit:

https://cloud.google.com/dataplex