Google Cloud Dataplex -Part 2-Dataplex gCloud CLI

Introduction

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

I intend to publish a series of posts on Google Cloud Dataplex, starting from the very basics, to eventually executing tasks which becomes gradually more complex. This is part 2 of the series and aimed at L100 practitioners, who would like to get started with Dataplex.

Here’s how the series looks like, so far :

Part 1 : Google Cloud Dataplex -Part 1-Lakes, Zones, Assets and discovery

Part 2 : Google Cloud Dataplex — Part 2-gCloud CLI for Dataplex (Current post)

In this post, I would demonstrate how gCloud CLI works with Dataplex. The last post explained how to create a Dataplex lake, zone and asset with Google Cloud Storage (GCS) objects using the Google Cloud console. This post will be heavy on gCloud CLI commands.

gCloud CLI on Google Cloud Shell

As a first step, let us enable required Google Cloud APIs using Google Cloud Shell.

Open Google Cloud Shell by clicking the “terminal” icon on top-right corner, the one before the “bell / notification” icon.

Open Google Cloud Shell
Open Google Cloud Shell

This would open up the Google Cloud Shell in the bottom pane.

We need to enable Dataproc Metastore, Dataplex and BigQuery APIs.

Type the following commands to enable these services :

gcloud services enable metastore.googleapis.comgcloud services enable dataplex.googleapis.comgcloud services enable bigquery.googleapis.com

You may decide to set the project id in Cloud Shell before you proceed :

gcloud config set project <YOUR_PROJECT_ID>

Or, you could also run the three API enable commands by appending the Project Id after the command :

gcloud services enable bigquery.googleapis.com --project=<PROJ_ID>

Create a Dataproc Metastore service using gCloud CLI

Now that the services are enables, we need to create a Google Cloud Dataproc Metastore so that, the Dataplex metadata can be accessed. Type the folllowing gCloud command on Cloud Shell to create a gRPC enabled Dataproc Metastore service.

If you have already created the service following Part 1 of my series of articles, you can skip this step. (this step took around 15 minutes to complete for me).

gcloud beta metastore services create dpms-mesh-1 \
--location=us-central1 \
--hive-metastore-version=3.1.2 \
--endpoint-protocol=GRPC

Once, the Dataproc Metastore service “dpms-mesh-1” is created, we will move forward by creating a Dataplex lake, in the next section.

Check Dataproc Metastore Service got created
Check Dataproc Metastore Service got created

You could also run the following gCloud CLI in Cloud Shell to check if the Dataproc metastore service got created alright, or not :

gcloud metastore services describe dpms-mesh-1 \
--project <YOUR_PROJECT_ID> \
--location us-central1 \
--format “value(endpointUri)”

This should return the API endpoint of the metastore service.

Create a Dataplex Lake using gCloud CLI

Hope you know what a Dataplex Lake is, if not, please read Part 1 of this series. We will create a Dataplex lake using the following gCloud CLI. If you have already created a Dataplex lake in Part 1, feel free to skip this step.

gcloud dataplex lakes create marketing1 \
--project <YOUR_PROJECT_ID> \
--location=us-central1 \
--metastore-service=projects/<YOUR_PROJECT_ID>/locations/us-central1/services/dpms-mesh-1

Once, the command returns, you should see that the Dataplex lake is created :

Dataplex Lake created using gCloud CLI command
Dataplex Lake created using gCloud CLI command

Load Data Into BigQuery using gCloud CLI

So far, we have used gCloud CLI to create the Dataplex lake. Now, we will create a Google BigQuery dataset using gCloud CLI and load some test data into the dataset for Dataplex to discover the schema of the BigQuery table.

Let’s first create a BigQuery Dataset within our project using the folowing gCloud CLI command on cloud shell :

bq mk -d --location=us-central1 --project_id=<PROJ_ID> baby_names
Create BigQuery Dataset
Create BigQuery Dataset

Once, the BigQuery dataset is created, we will download the US Social Security Administration’s dataset from the following URL :

https://www.ssa.gov/OACT/babynames/names.zip

Let’s use the curl command to download the zip file, from Cloud Shell :

curl -O https://www.ssa.gov/OACT/babynames/names.zip

If you extract the ‘names.zip’ file, it will produce many files of babynames, each for a year, from 1880 to 2021.

Let us load the file ‘yob2010.txt’ in a BigQuery table in the baby_names dataset. First, let’s create a table with the right schema :

bq mk --schema=name:string,gender:string,count:integer --table <PROJ_ID>:baby_names.names2010

This will create an empty table ‘names2010’ in BigQuery with the schema supplied :

BigQuery table created using gCloud CLI
BigQuery table created using gCloud CLI

Once, the table is successfully created, we can load ‘yob2010.txt’ in this table, running the following gCloud CLI command in Cloud Shell :

bq load --project_id=<PROJ_ID> <PROJ_ID>:baby_names.names2010 yob2010.txt

This will create a bq job and the file will be loaded in the table that we created :

Inserted records into BigQuery table using gCloud CLI
Inserted records into BigQuery table using gCloud CLI

So, with all the steps above, we have successfully created a BigQuery dataset, created a table with custom schema and loaded a flat file in the table, using gCloud CLI command. In the next section, we will create Dataplex zones and assets to discover and harvest the metadata using Dataplex and gCloud CLI commands.

Dataplex asset creation using gCloud CLI

We already have our Dataplex Lake ‘marketing1’ created.

Let’s create the Dataplex zone first. Let’s create this as a curated zone, considering the dataset is already cleaned and sanitized and it is loaded into BigQuery.

gcloud dataplex zones create marketing1curated \
--lake=marketing1 \
--location=us-central1 \
--project=<PROJ_ID> \
--resource-location-type=SINGLE_REGION \
--type=CURATED

It will be worthwhile to go back and check the steps that was followed to create a RAW Dataplex zone in Part 1 of the tutorial, here.

Once, the zone gets created, confirm using the console that it got created within the lake ‘marketing1’ :

Dataplex zone created with gCloud CLI
Dataplex zone created with gCloud CLI

Once, the zone is created successfully, let’s add the BigQuery dataset as an asset in the zone. We will select the resource type as a BigQuery dataset within the zone that we just created :

gcloud dataplex assets create babynamestab \
--location=us-central1 \
--lake=marketing1 \
--zone=marketing1curated \
--resource-type=BIGQUERY_DATASET \
--resource-name=projects/<PROJ_ID>/datasets/baby_names \
--discovery-enabled \
--project=<PROJ_ID>

See how the asset ‘resource-type’ and the ‘resource-name’ has been added. The ‘discovery-enabled’ option runs the discovery job. We could also used the ‘discovery-schedule’ option which takes a CRON expression to run discovery jobs in a preset schedule.

It is similar to the steps that we have followed using the console in Part 1 of the series. This creates the Dataplex asset ‘babynamestab’ as shown below :

Dataplex asset created using gCloud CLI
Dataplex asset created using gCloud CLI

Once, the asset is created, we can click the ‘Discover’ left-hand menu and see that the BigQuery table is discovered by Dataplex :

Dataplex discovers the BigQuery table
Dataplex discovers the BigQuery table

Notice the hierarchy of a Dataplex Lake (marketing1) > Dataplex Curated Zone (marketing1curated) > Dataplex discovered table (names2010)

So, to summarize, in this Part 2 of the Dataplex series, gCloud CLI commands have been tested for Dataplex, for creating lakes, zones and assets. We also took a new data source with BigQuery, since, in Part 1 we took Google Cloud Storage as the data source in a raw Dataplex zone.

In Part 3, we’ll go deeper into Dataplex security and understand the nuances of securing tables discovered by Dataplex.

Useful Links :

https://cloud.google.com/sdk/gcloud/reference/dataplex

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Diptiman Raichaudhuri

Diptiman Raichaudhuri

Cloud Data platform specialist. Working closely with developers to design and build data platforms on public cloud using open source tools.