Google Cloud Dataplex -Part 2-Dataplex gCloud CLI

Introduction

Diptiman Raichaudhuri
Google Cloud - Community
6 min readJun 16, 2022

--

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

I intend to publish a series of posts on Google Cloud Dataplex, starting from the very basics, to eventually executing tasks which becomes gradually more complex. This is part 2 of the series and aimed at L100 practitioners, who would like to get started with Dataplex.

Here’s how the series looks like, so far :

Part 1 : Google Cloud Dataplex -Part 1-Lakes, Zones, Assets and discovery

Part 2 : Google Cloud Dataplex — Part 2-gCloud CLI for Dataplex (Current post)

In this post, I would demonstrate how gCloud CLI works with Dataplex. The last post explained how to create a Dataplex lake, zone and asset with Google Cloud Storage (GCS) objects using the Google Cloud console. This post will be heavy on gCloud CLI commands.

gCloud CLI on Google Cloud Shell

As a first step, let us enable required Google Cloud APIs using Google Cloud Shell.

Open Google Cloud Shell by clicking the “terminal” icon on top-right corner, the one before the “bell / notification” icon.

Open Google Cloud Shell
Open Google Cloud Shell

This would open up the Google Cloud Shell in the bottom pane.

We need to enable Dataproc Metastore, Dataplex and BigQuery APIs.

Type the following commands to enable these services :

You may decide to set the project id in Cloud Shell before you proceed :

Or, you could also run the three API enable commands by appending the Project Id after the command :

Create a Dataproc Metastore service using gCloud CLI

Now that the services are enables, we need to create a Google Cloud Dataproc Metastore so that, the Dataplex metadata can be accessed. Type the folllowing gCloud command on Cloud Shell to create a gRPC enabled Dataproc Metastore service.

If you have already created the service following Part 1 of my series of articles, you can skip this step. (this step took around 15 minutes to complete for me).

Once, the Dataproc Metastore service “dpms-mesh-1” is created, we will move forward by creating a Dataplex lake, in the next section.

Check Dataproc Metastore Service got created
Check Dataproc Metastore Service got created

You could also run the following gCloud CLI in Cloud Shell to check if the Dataproc metastore service got created alright, or not :

This should return the API endpoint of the metastore service.

Create a Dataplex Lake using gCloud CLI

Hope you know what a Dataplex Lake is, if not, please read Part 1 of this series. We will create a Dataplex lake using the following gCloud CLI. If you have already created a Dataplex lake in Part 1, feel free to skip this step.

Once, the command returns, you should see that the Dataplex lake is created :

Dataplex Lake created using gCloud CLI command
Dataplex Lake created using gCloud CLI command

Load Data Into BigQuery using gCloud CLI

So far, we have used gCloud CLI to create the Dataplex lake. Now, we will create a Google BigQuery dataset using gCloud CLI and load some test data into the dataset for Dataplex to discover the schema of the BigQuery table.

Let’s first create a BigQuery Dataset within our project using the folowing gCloud CLI command on cloud shell :

Create BigQuery Dataset
Create BigQuery Dataset

Once, the BigQuery dataset is created, we will download the US Social Security Administration’s dataset from the following URL :

Let’s use the curl command to download the zip file, from Cloud Shell :

If you extract the ‘names.zip’ file, it will produce many files of babynames, each for a year, from 1880 to 2021.

Let us load the file ‘yob2010.txt’ in a BigQuery table in the baby_names dataset. First, let’s create a table with the right schema :

This will create an empty table ‘names2010’ in BigQuery with the schema supplied :

BigQuery table created using gCloud CLI
BigQuery table created using gCloud CLI

Once, the table is successfully created, we can load ‘yob2010.txt’ in this table, running the following gCloud CLI command in Cloud Shell :

This will create a bq job and the file will be loaded in the table that we created :

Inserted records into BigQuery table using gCloud CLI
Inserted records into BigQuery table using gCloud CLI

So, with all the steps above, we have successfully created a BigQuery dataset, created a table with custom schema and loaded a flat file in the table, using gCloud CLI command. In the next section, we will create Dataplex zones and assets to discover and harvest the metadata using Dataplex and gCloud CLI commands.

Dataplex asset creation using gCloud CLI

We already have our Dataplex Lake ‘marketing1’ created.

Let’s create the Dataplex zone first. Let’s create this as a curated zone, considering the dataset is already cleaned and sanitized and it is loaded into BigQuery.

It will be worthwhile to go back and check the steps that was followed to create a RAW Dataplex zone in Part 1 of the tutorial, here.

Once, the zone gets created, confirm using the console that it got created within the lake ‘marketing1’ :

Dataplex zone created with gCloud CLI
Dataplex zone created with gCloud CLI

Once, the zone is created successfully, let’s add the BigQuery dataset as an asset in the zone. We will select the resource type as a BigQuery dataset within the zone that we just created :

See how the asset ‘resource-type’ and the ‘resource-name’ has been added. The ‘discovery-enabled’ option runs the discovery job. We could also used the ‘discovery-schedule’ option which takes a CRON expression to run discovery jobs in a preset schedule.

It is similar to the steps that we have followed using the console in Part 1 of the series. This creates the Dataplex asset ‘babynamestab’ as shown below :

Dataplex asset created using gCloud CLI
Dataplex asset created using gCloud CLI

Once, the asset is created, we can click the ‘Discover’ left-hand menu and see that the BigQuery table is discovered by Dataplex :

Dataplex discovers the BigQuery table
Dataplex discovers the BigQuery table

Notice the hierarchy of a Dataplex Lake (marketing1) > Dataplex Curated Zone (marketing1curated) > Dataplex discovered table (names2010)

So, to summarize, in this Part 2 of the Dataplex series, gCloud CLI commands have been tested for Dataplex, for creating lakes, zones and assets. We also took a new data source with BigQuery, since, in Part 1 we took Google Cloud Storage as the data source in a raw Dataplex zone.

In Part 3, we’ll go deeper into Dataplex security and understand the nuances of securing tables discovered by Dataplex.

Useful Links :

https://cloud.google.com/sdk/gcloud/reference/dataplex

--

--