Google Cloud Dataplex -Part 1-Lakes, Zones, Assets and discovery
Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.
I intend to publish a series of posts on Google Cloud Dataplex, starting from the very basics, to eventually executing tasks which becomes gradually more complex. This is part -1 of the series and aimed at L100 practitioners, who would like to get started with Dataplex.
A data fabric, according to Gartner (here), is a data platform design concept, where, data discovery, data curation and data processing is handled continuously over discovered metadata.
While enterprise data engineering teams create cross LoB (Line of Business) data applications / data products, the business need of senior decision makers often transcend LoBs to extract insights from multiple LoBs.
An intelligent data fabric like Google Cloud Dataplex provides the ability to discover, harvest metadata for disparate data sources and run data curation jobs using popular framework like Apache Spark.
On top of discovery, Dataplex also acts as the central data governance component for modern data platforms. In subsequent articles, I will show how Dataplex makes it easy to accomplish tasks such as, data quality check, managed data security policies etc ..
With the advent and subsequent widespread adoption of ELT ( Extract-Load-Transform) as a pattern for running analytics workloads on cloud, the importance of products like Dataplex gets accentuated. Today, many enterprise customers across the globe, copy massive datasets extracted from diverse data sources such as, ERP, OLTP RDBMS systems, real-time streaming systems as well partner organization data, into Cloud. Most often than not, the copied data is kept in designated raw zones, before subsequent transformations are applied on the raw data.
Cloud providers such as Google Cloud, AWS and Azure, provide durable storage of such extracted data on Data Lake raw zones, where files are kept in object stores.
For the cloud data lake raw zone, transformation of raw data is mostly done using high-performance distributed data processing frameworks like Apache Spark, Apache Beam and Apache Flink (for streaming systems).
ELT is also prevalent where raw data is kept within a cloud data warehouse, in a designated raw zone, good examples are Google BigQuery, Snowflake etc ...
There are powerful `T `ransformation frameworks like dbt and dataform which allow authoring sophisticated data curation, data quality and data freshness checks on raw data kept in cloud data warehouses.
For both the above cases of a Data Lake and a Cloud Data warehouse, the fundamental aspect of data curation and subsequent data transformation relies on data discovery and harvesting of metadata and taking decisions with the metadata.
In this blog series, the data discovery, data curation and data processing aspects of Google Cloud Dataplex will be covered. Part-1 will focus on ingesting structured CSV files on Google Cloud storage(GCS) and subsequent discovery of metadata in the form of tables.
Dataplex Data Discovery
To understand the steps needed to execute Dataplex data discovery, let’s first simulate a business scenario. “sales.csv” and “customers.csv” are two simulated examples of OLTP data being copied into Google Cloud Data Lake and being stored as objects with Google Cloud Storage.
Step-1 : Simulated data creation
Save the above file as “sales.csv”
Save this file as “customers.csv”.
Step-2 : Google Cloud storage and prefix
Select the Google Cloud project, where, team experiments are done and enable Google Cloud Storage API, if not enabled already. Upload the files created above to Google Cloud Storage. First, let’s create a bucket named “rawoltpsales”. Within, this bucket, let’s create a folders-”sales”. Similarly, create another bucket “rawoltpcustomers” and create a folder “customers” inside this bucket as well.
Both the buckets will have the following configurations :
name : rawoltpsales / rawoltpcustomers
Location Type : Region ( us-central1)
Default Storage Class : Standard
Access Control: Uniform
Data Encryption : Google-managed key
Once, the bucket is created, create two folders each within respective buckets and name those folders : sales and customers respectively
Now, upload “sales.csv” to “rawoltpsasles/sales” folder and “customers.csv” to “rawoltpcustomers/customers” folder, respectively.
Step-3 : Dataplex lakes, zones and assets
Now that the OLTP simulated data is copied to the respective GCS bucket / folders, data discovery can start using Dataplex.
Dataplex organizes data stored in Google cloud storage and BigQuery into “lakes” and “zones”. Refer Dataplex documentation.
A Dataplex lake most commonly maps to a Data Mesh domain. Loosely, we can correlate a lake with a LoB of a business, which ingests and processes data. For, our example it could be marketing, which uses sales and customers data to build a “Data Product”.
Within a Dataplex lake, zones logically segregate raw vs. curated data.
Within zones, Dataplex organizes structured and unstructured data as “Assets”. An asset maps to data stored in either Cloud Storage or BigQuery. You can map data stored in separate Google Cloud projects as assets into a single zone within a lake. You can attach existing Cloud Storage buckets or BigQuery datasets to be managed from within the lake.
Assets follow Hive styled conventions, so, a folder (which is also known as a “prefix”) within a bucket represents an entity or a group of entities with similar schema. This is the reason, a folder each, has been created inside the bucket — one for sales and one for customers, so that, Dataplex discovers these assets as two separate tables. Each folder could host many Hive style partitioned files and dataplex will discover those files and create a table, as long as the files have the same schema.
Step-4 : Create a Dataplex lake
Now that, we have defined lakes, zones and assets, let’s create these constructs one by one.
In order to access Dataplex metadata using Hive Metastore in Spark queries a Dataproc Metastore service instance needs to be associated with the Dataplex lake. So, a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) should be created first.
In the search bar of Google Cloud Console, search for “Dataproc”.
Ensure you have the Dataproc Metastore API enabled on your google cloud project.
Create the Dataproc Metastore service
Assign a service name and select service tier as developer, and select your VPC network of choice, for simplicity, I’ve selected the default network
Leave the rest as default, but, select gRPC as the endpoint protocol :
It takes a while to create the service (mine took around 15 minutes).
Now, we can create the Dataplex lake. Search for “Dataplex” within Google Cloud console and enable Google Cloud Dateplex API, if not enabled already. More often than not, a lake and a domain will have a 1:1 mapping. Data products within a domain will be built with the lake as the unified storage abstraction.
Let’s call the Dataplex lake “marketing1”
While creating the lake, also select the Dataproc metastore service created in the earlier step.
The lake will be created within a couple of minutes. Once created, click on “marketing1” to observe the details.
While there are multiple tabs which provides separate functionality, let us focus on “Zones”. Within a lake, zones host assets. Assets are the logical equivalent of either a BigQuery table or a Google Cloud Storage set of files within a folder. Zones could be either “raw zone” or “curated zone”. These tags define the level of enrichment/curation present in the dataset. While raw files, like CSV, JSON etc .. will be stored in a raw zone, curated and efficiently compressed file formats like PARQUET,ORC, AVRO etc will be part of the curated zone. Dataplex does not allow users to create CSV files within a “curated zone”.
Since, our dataset consists only of two CSV files, let us click on “Add Zone” and create a zone “marketing1raw”.
Also, enable discovery settings by selecting “Enable metadata discovery” and “Enable csv parsing options”. Since, our dataset has a header, let us insert “1” in “Header Rows” textbox. Leave the rest of fields with default values.
Once, the zone is created, click on the created zone and click on add an asset.
An asset maps the physical data stored in BigQuery or Google Cloud Storage with Dataplex. In this example, we create two assets, one for “rawoltpsales” and another for “rawoltpcustomers”.
Select the Type as Storage Bucket and select the “rawoltpsales” bucket created earlier and then click “Done”.
Click continue and select Inherit for the Discovery Settings and click continue and Submit.
Similarly, following the exact same steps, click on “Add Assets” and create another asset “rawoltpcustomer”. While creating this asset, select the “rawoltpcustomers” GCS bucket.
Wait till the discovery status changes from “In Progress” to “Scheduled” for both assets.
Once the Datapelx discovery job is scheduled and complete, let us check the “Discover” link on the left hand menu.
We can see that the “marketing1” lake has a “marketing1raw” zone which has two tables discovered-customers and sales.
For more accurate search results, use Dataplex-specific filters, such as lake and data zone names. The top 50 items per facet are displayed on the filters list. You can find any additional items using the search box.
Each entry contains detailed technical and operational metadata.
Clicking the sales table describes the configuration, notice that the source bucket is mentioned as well. Dataplex has read the sales table from the “gs//rawoltpsales/sales” prefix and has created one table for the prefix.
Click on the “Open in BigQuery” link, and you will be able to query the table. This is possible since, each Dataplex data zone maps to a dataset in BigQuery or a database in Dataproc Metastore, where metadata information is automatically made available.
Notice how “marketing1raw” appears as a dataset within bigquery
Also, the fact, that, Dataplex has inferred the schema with possible types
You can run the following simple query to get results from the table.
With Dataplex you no longer need to create external tables for BigQuery. Moreover, you can also query these tables using Spark SQL from a Jupyter notebook (will explain with examples in a later post), since, Dataplex provides a unified metadata layer.
Now, if you go back to the Dataplex zone “marketing1raw”, you would see that the data item discovery dashboards mentions that two tables were detected.
In the next post, I will explain how security and IAM works with Dataplex entities and in subsequent posts will explain Data curation using custom spark jobs, data exploration (currently in PREVIEW) and data quality checks.
Have a look at the following documentation to understand more on Dataplex :
Discover Data : https://cloud.google.com/dataplex/docs/discover-data
Best Practices : https://cloud.google.com/dataplex/docs/best-practices