Google Cloud Dataplex API- Automate your Data Lakes and Zones

Published in

Google Cloud - Community

3 min readNov 21, 2022

With the rise in popularity of Data Mesh and Data Fabric, Google Cloud Dataplex is now an essential component to have a single plane of glass to discover and govern all the data across lakes, data warehouses, and data marts.

However, a customer asked me: can I call an API to create all my lakes, zones and assets in bulk? Also, can I automate this process? For example, any time a new lake/zone is created, a new GCS bucket or BigQuery dataset is also created.

This is why I love working with customers — there is always a new challenge! The situation is as follows: Imagine having 1000’s+ buckets and datasets across multiple projects that you would like to attach to your Dataplex Zones:

This is where Dataplex’s CLI and REST APIs come in. In this post, we are going to get started with the latter.

Getting Started

In the following example, we will create a new Lake based on the GCS bucket’s region if it doesn’t exist.

The first step is to install the python library using pip (pro tip: you can do all of the following in a Jupyter Notebook via Vertex AI workbench).

pip install google-cloud-dataplex

The next step is to define a python dict that will hold our base variables. Once we have defined it, we can initialise the environment:

#### Initialize the environmentfrom google.cloud import storage
from google.cloud import dataplex_v1
import os# initialize the common variables
dataplex_dict= {
    "project": "{PROJECT_ID}", 
    "region": "none",
    "gcs_bucket_name":"{SOURCE_BUCKET}",
    "zone_type": "RAW",
    "zone_location_type": "SINGLE_REGION",
    "zone_id": "{ZONE_ID}",
    "asset_type":"STORAGE_BUCKET",
    "asset_id":"{ASSET_ID}",
    "asset_name":"projects/{PROJECT_ID}/buckets/{SOURCE_BUCKET}",
    "bq_dataset":"none"
}# authenticate using service account key
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'key.json'# obtain the bucket that we will attach to the zone
storage_client = storage.Client()
bucket = storage_client.get_bucket(dataplex_dict["gcs_bucket_name"])# update the dict's location. This will be used to create a lake if doesn't exists
dataplex_dict["region"] = bucket.location.lower()# Create a dataplex client
dataplex_client = dataplex_v1.DataplexServiceClient()

Now that we have initialised our environment, we can get started with creating a new Lake if it doesn’t exist in the region:

### Step 1 Dataplex Lake creation# Setup the parent URL with format : project/{PROJECT_ID}/locations/{REGION}
parent="projects/" + dataplex_dict["project"] + "/locations/"+dataplex_dict["region"]# Setup the lake_id for the newly created lake
lake_id= dataplex_dict["region"] + "-lake"# List lakes by location
request = dataplex_v1.ListLakesRequest(
    parent=parent,
)# Initialize request argument(s)
request = dataplex_v1.CreateLakeRequest(
        parent=parent,
        lake_id=lake_id,
)
operation = dataplex_client.create_lake(request=request)print("Waiting for operation to complete...")response = operation.result()# Handle the response
print(response)

Notice that the type of Client that we have created is a Synchronous one. Therefore, we will need to wait until the Lake has been created to go to the next step. You can leverage the Asynchronous Client to cover those async use cases.

### Step 2 Create zone into lake # zone
zone = dataplex_v1.Zone()
zone.type_ = dataplex_dict["zone_type"]
zone.resource_spec.location_type = dataplex_dict["zone_location_type"]
zone.discovery_spec.enabled = True
zone_id=dataplex_dict["zone_id"]
zone_parent =  parent+"/lakes/"+lake_idrequest = dataplex_v1.CreateZoneRequest(
    parent=zone_parent,
    zone_id=zone_id,
    zone=zone,
)# Make the request
operation = dataplex_client.create_zone(request=request)print("Waiting for operation to complete...")response = operation.result()# Handle the response
print(response)

Finally, we will attach our GCS bucket and BQ dataset as assets to the zone!

### Step 3 add asset to zoneasset_parent = zone_parent + "/zones/" + zone_id
# bucket asset
asset = dataplex_v1.Asset()
asset.resource_spec.type_ = dataplex_dict["asset_type"]
asset.resource_spec.name=dataplex_dict["asset_name"]
asset_id=dataplex_dict["asset_id"]request = dataplex_v1.CreateAssetRequest(
    parent=asset_parent,
    asset_id=asset_id,
    asset=asset,
)# Make the request
operation = dataplex_client.create_asset(request=request)print("Waiting for operation to complete...")response = operation.result()# Handle the response
print(response)## Step 4 : attach a BQ dataset to the zone# create a BQ asset
asset = dataplex_v1.Asset()
asset.resource_spec.type_ = "BIGQUERY_DATASET"
asset.resource_spec.name="projects/"+ dataplex_dict["project"] +/datasets/product_dataplex_eu"
asset_id="product-transactions-bq"
asset_parent = zone_parent + "/zones/" + zone_idrequest = dataplex_v1.CreateAssetRequest(
    parent=asset_parent,
    asset_id=asset_id,
    asset=asset,
)# Make the request
operation = dataplex_client.create_asset(request=request)print("Waiting for operation to complete...")response = operation.result()# Handle the response
print(response)

From here, the number of use cases are endless.

Conclusion

Dataplex provides a REST API to automate the creation of all your lakes and zones at scale. Here is the documentation reference for you to get creative!

Happy automating.

Google Cloud Dataplex API- Automate your Data Lakes and Zones

Getting Started

Conclusion

Written by Christian Silva