Dataplex — Data Catalog | Auto Discovery and Metadata Harvesting | Part — 3.1

Published in

Google Cloud - Community

6 min readOct 17, 2023

This is Part-3.1 of the Dataplex blog series focussed on Data Catalog. Part-1 & Part-2 provides an overview of the Dataplex service.

Background:

Data Catalog, an independent GCP service is now integrated into Dataplex Platform to provide fully managed, data discovery and metadata management service that allows you to quickly discover, manage, and understand your data in Google Cloud.

Data Catalog uses the same search technology that supports Gmail and Google Drive, allowing you to quickly find data by table name, column name, or business metadata in tags using various filters.

A data catalog makes it possible and easy for data scientists, data analysts, and other users to use and analyze datasets. Through the identification, characterization, and categorization of datasets, a data catalog keeps an organized inventory of data assets. It offers a meaningful context that enables data scientists, analysts, and other data consumers to search for and comprehend an appropriate dataset in order to derive commercial value.

Business stakeholders like data and business analysts use data catalogs extensively to locate and comprehend business data. Data catalogs also facilitate faster collaboration by automating data management. The most well-known data catalogs rely on crucial elements like data governance & data discovery that support a successful data strategy.

Lets look at the key features of Dataplex — Data Catalog in detail:

1.0 Automated Metadata Harvesting:

Metadata harvesting is the automated collection of metadata descriptions from different sources to create useful aggregations of metadata and related services.

As covered in the Part-2 of the blog series, while creating a zone or registering an asset in dataplex, the following page is provisioned to capture the Discovery settings.

Based on the discovery settings, schedule etc, the Data Catalog harvester process extracts, standardizes, and indexes all the metadata information of the data asset automatically to create a unified and searchable repository in the Data Catalog.

Let’s see Automated Discovery in action.

1.1 Register BigQuery Assets

1.1.1 Create a “crimes” BigQuery Dataset and Table and Register it

bq --location=us-central1 mk \
    --dataset \
    $PROJECT_ID:oda_crimes_staging_ds

bq --location=us-central1 query \
--use_legacy_sql=false \
"CREATE OR REPLACE TABLE oda_crimes_staging_ds.crimes_staging AS SELECT * FROM bigquery-public-data.chicago_crime.crime"

Reload the BQ UI, you should see the table created. Query the table-

1.1.2 Register the crimes BigQuery Dataset as an asset into the RAW zone

PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
PROJECT_NBR=`gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 |  tr -d "'" | xargs`
UMSA_FQN="lab-sa@${PROJECT_ID}.iam.gserviceaccount.com"
LOCATION="us-central1"
METASTORE_NM="lab-dpms-$PROJECT_NBR"
LAKE_NM="oda-lake"
DATA_RAW_ZONE_NM="oda-raw-zone"
DATA_RAW_SENSITIVE_ZONE_NM="oda-raw-sensitive-zone"
DATA_CURATED_ZONE_NM="oda-curated-zone"
DATA_PRODUCT_ZONE_NM="oda-product-zone"
MISC_RAW_ZONE_NM="oda-misc-zone"

CRIMES_ASSET="chicago-crimes"
CRIMES_STAGING_DS="oda_crimes_staging_ds"

gcloud dataplex assets create $CRIMES_ASSET \
--location=$LOCATION \
--lake=$LAKE_NM \
--zone=$DATA_RAW_ZONE_NM \
--resource-type=BIGQUERY_DATASET \
--resource-name=projects/$PROJECT_ID/datasets/$CRIMES_STAGING_DS \
--discovery-enabled \
--discovery-schedule="0 * * * *" \
--display-name 'Chicago Crimes'

Once the Discovery jobs finishes, the above command automatically updates the assets details into Data Catalog. Go to Dataplex and verify:

Within the oda-raw-zone, Dataplex shows “Chicago Crimes” as one of the assets which is of type Bigquery dataset.

Click on the Entities tab and it shows the Bigquery Table under the new Crime Dataset we created.

Click on the Table and it will show all the Technical metadata about the table it has discovered and stored in the Data Catalog automatically.

Lets now register a GCS storage into Dataplex.

1.2 Register Google Cloud Storage Assets

I have a GCS bucket which consists of the following datasets. Each of the folders have .csv data files in it.

Let’s register this bucket with the oda-raw-zone inside Dataplex as “Miscellaneous Datasets”

gcloud dataplex assets create misc-datasets \
--location=$LOCATION \
--lake=$LAKE_NM \
--zone=$DATA_RAW_ZONE_NM \
--resource-type=STORAGE_BUCKET \
--resource-name=projects/$PROJECT_ID/buckets/raw-data-$PROJECT_NBR \
--discovery-enabled \
--discovery-schedule="0 * * * *" \
--display-name 'Miscellaneous Datasets'

1.2.1 Review the assets registered in the Dataplex UI

Navigate to Dataplex UI -> Manage -> ODA-LAKE -> ODA-RAW-ZONE -> and check if ““Miscellaneous Datasets” is available.

Click on the “Miscellaneous Datasets” and it will show all the technical data of the GCS bucket including its location on GCS etc.

2.0 Automated Schema inference and external table creation

Along with pulling all the technical metadata of the GCS bucket, Data Catalog also infers the data from the CSV files within the bucket and automatically creates an external tables in Metastore and Bigquery.

Click on “Miscellaneous Datasets” page above, click on the “Entities Tab”.

You will see each of the data files in the bucket listed as Table.

But these were just .csv files on GCS — how did they become a table?

Dataplex automatically infers Schema for objects in Cloud Storage and creates external table definitions in Dataproc Metastore Service (Hive Metastore Service) and BigQuery.

Click on the icecream_sales_forecasting. It will show the details of the external table registered within Bigquery and Metastore.

Click on the Schema and Column Tags to see the inferred schema.

Lets also check if the same is available in Bigquery.Go to Bigquery and check for the table in the oda_raw_zone.

To summarize, Data Catalog’s Automated Discovery and Metadata Harvesting provides the following features :

1. Assets Discovery

2. Schema Inference for objects in Cloud Storage

3. External table definition creation, based on schema inference in Dataproc Metastore Service (Hive Metastore Service)

4. External table definition based on schema inference in BigQuery

5. The tables registered as Dataplex Zone level entities

6. The tables cataloged in Data Catalog and Searchable

3.0 Search

Dataplex provides an intuitive search interface to search the assets by systems, Lakes , Zones, Projects etc. Goto Dataplex → Search

It also provides various faceted search to enable attribute based search

Conclusion:

Data Catalog is integrated into the Dataplex service and provides fully automated Metadata Discovery, Harvesting, Automated Schema inference and external table creation out-of-the-box.

It also provides an easy-to-use Search platform for Business Users and Stewards to quickly find the registered Data assets across multiple data lakes and zones across the organization’s Data estates.

For more details please visit:

https://cloud.google.com/dataplex

We will look into the Data Catalog’s Tagging features as a continuation of the Data Catalog Blog post.