Google Cloud Dataplex -Part 2-Dataplex gCloud CLI
Introduction
Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.
I intend to publish a series of posts on Google Cloud Dataplex, starting from the very basics, to eventually executing tasks which becomes gradually more complex. This is part 2 of the series and aimed at L100 practitioners, who would like to get started with Dataplex.
Here’s how the series looks like, so far :
Part 1 : Google Cloud Dataplex -Part 1-Lakes, Zones, Assets and discovery
Part 2 : Google Cloud Dataplex — Part 2-gCloud CLI for Dataplex (Current post)
In this post, I would demonstrate how gCloud CLI works with Dataplex. The last post explained how to create a Dataplex lake, zone and asset with Google Cloud Storage (GCS) objects using the Google Cloud console. This post will be heavy on gCloud CLI commands.
gCloud CLI on Google Cloud Shell
As a first step, let us enable required Google Cloud APIs using Google Cloud Shell.
Open Google Cloud Shell by clicking the “terminal” icon on top-right corner, the one before the “bell / notification” icon.
This would open up the Google Cloud Shell in the bottom pane.
We need to enable Dataproc Metastore, Dataplex and BigQuery APIs.
Type the following commands to enable these services :
gcloud services enable metastore.googleapis.comgcloud services enable dataplex.googleapis.comgcloud services enable bigquery.googleapis.com
You may decide to set the project id in Cloud Shell before you proceed :
gcloud config set project <YOUR_PROJECT_ID>
Or, you could also run the three API enable commands by appending the Project Id after the command :
gcloud services enable bigquery.googleapis.com --project=<PROJ_ID>
Create a Dataproc Metastore service using gCloud CLI
Now that the services are enables, we need to create a Google Cloud Dataproc Metastore so that, the Dataplex metadata can be accessed. Type the folllowing gCloud command on Cloud Shell to create a gRPC enabled Dataproc Metastore service.
If you have already created the service following Part 1 of my series of articles, you can skip this step. (this step took around 15 minutes to complete for me).
gcloud beta metastore services create dpms-mesh-1 \
--location=us-central1 \
--hive-metastore-version=3.1.2 \
--endpoint-protocol=GRPC
Once, the Dataproc Metastore service “dpms-mesh-1” is created, we will move forward by creating a Dataplex lake, in the next section.
You could also run the following gCloud CLI in Cloud Shell to check if the Dataproc metastore service got created alright, or not :
gcloud metastore services describe dpms-mesh-1 \
--project <YOUR_PROJECT_ID> \
--location us-central1 \
--format “value(endpointUri)”
This should return the API endpoint of the metastore service.
Create a Dataplex Lake using gCloud CLI
Hope you know what a Dataplex Lake is, if not, please read Part 1 of this series. We will create a Dataplex lake using the following gCloud CLI. If you have already created a Dataplex lake in Part 1, feel free to skip this step.
gcloud dataplex lakes create marketing1 \
--project <YOUR_PROJECT_ID> \
--location=us-central1 \
--metastore-service=projects/<YOUR_PROJECT_ID>/locations/us-central1/services/dpms-mesh-1
Once, the command returns, you should see that the Dataplex lake is created :
Load Data Into BigQuery using gCloud CLI
So far, we have used gCloud CLI to create the Dataplex lake. Now, we will create a Google BigQuery dataset using gCloud CLI and load some test data into the dataset for Dataplex to discover the schema of the BigQuery table.
Let’s first create a BigQuery Dataset within our project using the folowing gCloud CLI command on cloud shell :
bq mk -d --location=us-central1 --project_id=<PROJ_ID> baby_names
Once, the BigQuery dataset is created, we will download the US Social Security Administration’s dataset from the following URL :
https://www.ssa.gov/OACT/babynames/names.zip
Let’s use the curl command to download the zip file, from Cloud Shell :
curl -O https://www.ssa.gov/OACT/babynames/names.zip
If you extract the ‘names.zip’ file, it will produce many files of babynames, each for a year, from 1880 to 2021.
Let us load the file ‘yob2010.txt’ in a BigQuery table in the baby_names dataset. First, let’s create a table with the right schema :
bq mk --schema=name:string,gender:string,count:integer --table <PROJ_ID>:baby_names.names2010
This will create an empty table ‘names2010’ in BigQuery with the schema supplied :
Once, the table is successfully created, we can load ‘yob2010.txt’ in this table, running the following gCloud CLI command in Cloud Shell :
bq load --project_id=<PROJ_ID> <PROJ_ID>:baby_names.names2010 yob2010.txt
This will create a bq job and the file will be loaded in the table that we created :
So, with all the steps above, we have successfully created a BigQuery dataset, created a table with custom schema and loaded a flat file in the table, using gCloud CLI command. In the next section, we will create Dataplex zones and assets to discover and harvest the metadata using Dataplex and gCloud CLI commands.
Dataplex asset creation using gCloud CLI
We already have our Dataplex Lake ‘marketing1’ created.
Let’s create the Dataplex zone first. Let’s create this as a curated zone, considering the dataset is already cleaned and sanitized and it is loaded into BigQuery.
gcloud dataplex zones create marketing1curated \
--lake=marketing1 \
--location=us-central1 \
--project=<PROJ_ID> \
--resource-location-type=SINGLE_REGION \
--type=CURATED
It will be worthwhile to go back and check the steps that was followed to create a RAW Dataplex zone in Part 1 of the tutorial, here.
Once, the zone gets created, confirm using the console that it got created within the lake ‘marketing1’ :
Once, the zone is created successfully, let’s add the BigQuery dataset as an asset in the zone. We will select the resource type as a BigQuery dataset within the zone that we just created :
gcloud dataplex assets create babynamestab \
--location=us-central1 \
--lake=marketing1 \
--zone=marketing1curated \
--resource-type=BIGQUERY_DATASET \
--resource-name=projects/<PROJ_ID>/datasets/baby_names \
--discovery-enabled \
--project=<PROJ_ID>
See how the asset ‘resource-type’ and the ‘resource-name’ has been added. The ‘discovery-enabled’ option runs the discovery job. We could also used the ‘discovery-schedule’ option which takes a CRON expression to run discovery jobs in a preset schedule.
It is similar to the steps that we have followed using the console in Part 1 of the series. This creates the Dataplex asset ‘babynamestab’ as shown below :
Once, the asset is created, we can click the ‘Discover’ left-hand menu and see that the BigQuery table is discovered by Dataplex :
Notice the hierarchy of a Dataplex Lake (marketing1) > Dataplex Curated Zone (marketing1curated) > Dataplex discovered table (names2010)
So, to summarize, in this Part 2 of the Dataplex series, gCloud CLI commands have been tested for Dataplex, for creating lakes, zones and assets. We also took a new data source with BigQuery, since, in Part 1 we took Google Cloud Storage as the data source in a raw Dataplex zone.
In Part 3, we’ll go deeper into Dataplex security and understand the nuances of securing tables discovered by Dataplex.