Google Cloud Dataplex Explore : Introduction to the “in-house” Exploration Workbench

Anish Nair
Google Cloud - Community
8 min readFeb 1, 2023
Photo by Claudio Schwarz on Unsplash

Disclaimer: The opinions expressed in this article are my own, and do not represent views of my current or any previous employers.

Thanking my colleague Madhu Jain, for her contribution towards this blog.

Background

Dataplex Explore or Exploration Workbench is an important addition to the Dataplex functionality. It has been said and found that, generally teams intending to perform data exploration ends up consuming more time to enable it. This could range from activities like

  • commissioning compute/infrastructure to access the data
  • checking the quality and completeness of the data
  • making sure metadata is managed

and many more..

In general, we have seen that, to accomplish all the mentioned activities, multiple tools might need to be integrated. This becomes a hectic task, and may push us to a situation where we are left with little time to actually perform analytics and derive insights.

Presenting Dataplex Explore

This is where Dataplex Explore or the Exploration Workbench comes to the rescue. It gives the user a simplified but unified platform to:

  • Query interactively the data across GCS, BigQuery and Biglake(to some extent), using standard SQL syntax
  • Analyse data using plain and simple SQL syntax or using Notebooks
  • Share and collaborate these artefacts with wider teams.
  • Operationalise your work by scheduling your SQLs and Notebooks

All this and more, without worrying about commissioning the infrastructure or working on checking data quality or even metadata management.

The exploration workbench works on serverless spark backed compute, which helps in managing the required infrastructure by auto-scaling. Hence from a Data Consumer perspective, the Dataplex Exploration Workbench(Explore) becomes their best companion.

Some Basic Definitions

Apart from focusing on the terminologies which are common to Dataplex, as a whole, like Dataplex Lake, Zones, Assets etc. we will focus on a few terms which are very important from the Exploration Workbench(Explore) perspective.

  1. Environment: It provides serverless compute resources to run Spark SQL and Notebooks on a dataplex lake. Environments can also be authorised to be used by specific users(principles) using IAM. The environments are created with certain compute configurations. This includes how many nodes to be available for the environment, enabling maximum number of nodes while auto-scaling and also auto-shutdown feature. We can also install specific software packages required for any specific environment.
  2. Session: A session is created when an authorised user runs queries or uses notebooks to analyse data. While doing so, the user needs to select the environment on which they want to run their operations. And hence, the session created would have the same configurations as the environment chosen. Please note that, one user can have only one active session per environment.
  3. Nodes: Nodes are the backbone for the environments created. They specify the compute strength of an environment. In general one Node = 4DCU(Data Compute Units) = 4 vCPUs and 16 GB RAM.
  4. Default environment[Warm-Pool Cluster]: Its a single node environment with Fast-Startup enabled, which provides pre-provisioned session thus cutting down the session start-up time.

Prerequisites

To leverage the offerings and features of Dataplex Explore(Exploration Workbench), there are two important prerequisites.

  1. Have an environment mapped to the Dataplex Lake.
  2. Have a Dataproc Metastore mapped to the Dataplex Lake.

On a high level, we will see the environment creation in the next section. But for DPMS setup, there are some guidelines which can be followed by referring here and here.

Dataplex Explore In [Brief]Action

The complete working and walk through of each of the access methods will be a part of another blog.

With the background and the prerequisites known, let’s move forward to understand the different ways in which we can leverage Dataplex Explore. Primarily, this could be done in 3 ways

  1. Ad-hoc Query using SQL Workbench/Editor
  2. Saved Queries
  3. Notebooks

Let’s take a brief look on the various access methods.

Please note : To perform below mentioned activities, the Dataplex Lake, Zones and Assets should be already present. This amazing blog can be referred.

Ad-hoc Query using SQL Workbench/Editor : The SQL editor could be used to perform any ad-hoc analysis on the assets under the dataplex lake. The SQL(mostly common across analysts) commands can vary from a simple select, to performing join operations, aggregations and also queries with analytical functions.

Basic navigation would be:

Navigate to Google Cloud Console and click on the “Navigation Menu” (≡):

Go To “Dataplex” → “Explore” → Click on “+” on the SQL Editor → this opens the SQL Workbench

Running an Ad-hoc query on Dataplex Explore SQL Editor

Saved Queries : In real-world, the analytical queries might have to be executed on a repetitive basis. Also there might be scenarios where we have to share and collaborate these SQL artefacts with a wider team. This can be made possible using “Saved Queries”.

Basic navigation would be:

Navigate to Google Cloud Console and click on the “Navigation Menu” (≡):

Go To “Dataplex” → “Explore” → Click on “+” on the SQL Editor → this opens the SQL Workbench

In the editor, write any query which needs to be used on a regular basis.

Now click on the “Save” → “Save Script” → provide a name to the script → click “Save”

The saved script will be visible under the “Spark SQL Scripts” option

Creating a Saved Query

These Saved queries can be scheduled to be run at regular intervals. The scheduling can be done from the same window, which is a very interesting. Also these Saved Queries can be shared with a wider team with proper governed access control using IAM.

Notebooks : The Notebook environment helps to analyse data across the dataplex lake , BigQuery native tables as well as big lake tables. We can perform activities ranging from analytics[ad-hoc queries using joins , aggregations etc] to advanced analytics[machine learning] to Business Intelligence[visualisation]. This could be very helpful approach as this notebook as a whole, can be shared across teams for better collaboration.

The Notebook can be used by following the PySpark standards.

Basic Navigation would be:

Navigate to Google Cloud Console and click on the “Navigation Menu” (≡):

Go To “Dataplex” → “Explore”

Expand the dataplex lake → click “Notebooks” → click “Create Notebook”

A form opens up where we can provide the name of the Notebook and click “Create Notebook”

Once the notebook is created its available to be worked upon.

The “Preview” option available helps us to view the code and the last output(if the code has been executed) of the notebook, without opening a new session.

Notebook Preview Option

Some Cool Features

Now that we have an idea on the capabilities and usage of the workbench, let’s checkout few amazing features that can be leveraged

  • Dataplex Explore is a complete unified and simplified platform to run interactive SQL(Spark SQL) queries on your underlying enterprise data across GCS, BigQuery and BigLake(to some extent)
  • It works on fully managed Apache Spark serverless session with auto session creation and management of user session lifecycle.
  • It’s a fully qualified platform to issue ad-hoc queries to analyse the data and at the same time, has the capability to create and support data analytics and machine learning pipelines via its integrated Notebook offering.
  • We can collaborate the artefacts created(Notebooks and saved SQLs) effectively within your organisation with proper governed access control using IAM.
  • Different environments with varying compute capacity can be pre-configured. These environments can have their own sets of software packages installed, based on the team or use case for which it’s being used. The created environments can have governed access control using IAM.
  • By default, a user session(based on a chosen environment) is active for a pre-defined duration(minimum 10 mins) only. After this duration the session becomes deactivated. As an alternative, Warm-Pool cluster(aka default environment) can be used to keep sessions always active(max 10 hours).
  • The Notebooks comes with a “Preview” Option. This helps the users to view the content of a previously executed notebook and also the outputs of every cell, without actually creating a session or executing any statement.
  • Within the SQL Editor, we have the Query History Tab available. This again has great usage as, all the queries run in the SQL Editor window is persisted along with their outputs. The outputs are saved to an auto-created Google Cloud Storage bucket and thus we can use the query history tab to look for previously executed queries and their outputs without actually re-running the queries.

Features Yet To Be In Place

With all the good things we discussed, there are still some areas which needs to be addressed. Few of them are listed below for reference.

  • Direct BigLake Table Support : Currently, BigLake Tables are not supported by Dataplex Explore. The BigLake Table still will appear in the Catalog(Search option) in Dataplex to refer to their metadata information. But Still we can access and perform analysis on BigLake Tables via the Notebook Offering of Dataplex Explore. In this case you can consider the BigLake Table as a BigQuery Table and access using one of the following methods within the Notebook

a. Use the Magic Command[%%bigQuery]

%%bigquery
SELECT * FROM <bq_dataset>.<biglake_table>

b. Using Spark Reader[PySpark]

df_bq_bigtable_query = spark.read \
.format('bigquery') \
.load('<project_id>.<bq_dataset>.<biglake_table>')

df_bq_bigtable_query.show()
  • DPMS-BigQuery Metadata Issue : Dataplex Explore leverages DPMS(backed by Hive Metastore) to get all metadata information for all underlying assets, which include data across Google Cloud Storage, Bigquery and BigLake. Now when it comes to Bigquery assets added to dataplex, the way the metadata information is held by Hive Metastore versus how it’s maintained internally by BigQuery, is different. And this may cause few issues which are listed below:

a. If BigQuery Tables has with an uppercase character, that table might not reflect under Dataplex Explore.

b. Similarly, if BigQuery Tables has a column with an uppercase character, that column might not reflect while accessing the asset via Dataplex Explore.

c. Any hyphen[-] in a table name will be automatically converted into an underscore[_] in Dataplex Explore. And this is not just restricted to BigQuery, but to other asset types also.

A work-around for above points (a) and (b) will be to access the tables via Notebook

Conclusion

Dataplex Exploration Workbench(Explore) is a great platform for all Data Consumers who want to focus on generating better insights from the underlying enterprise datasets, without worrying about all the supporting integrations like environment commissioning, metadata management etc.

Dataplex Exploration Workbench being a new offering, we are sure to see a lot more growth in its functionalities and capabilities. Till that time, be focused and keep learning.

References:

https://cloud.google.com/dataplex/docs/explore-data

https://cloud.google.com/blog/products/data-analytics/dataplex-provides-spark-powered-data-exploration-experience

--

--