Start splashing in the data lake: with Dataplex (2)

Mandar Chaphalkar
Google Cloud - Community
4 min readJun 12, 2022

In my previous blog, I discussed how Dataplex can simplify data lake management. Creating the lake, defining the zones, adding assets and exploration can be done with few clicks !

Once the data zones are in place, it is imperative to define the properties of the zone. For example, if the raw zone is in GCS, what could be the recommended / acceptable file formats from the data sources? should there be any inflight transformations before the data lands into the raw zone? how long should the data be stored? should history be maintained for the data or delta only? and many more .. these considerations typically differ from use case to use case and choice of technology made.

In this blog, I discuss two key use cases -one addressing data management and other on data quality.

Scene 1: Aligning data (file) formats to zone specific formats

One of the most common patterns of processing the data between zone is file format transformation. For example, the data sources push formats such as CSV, JSON, etc and these files end up in the raw zone. For data engineering and analytics work loads one of the common formats for processing and storing the data (and consuming) is Parquet. So the question is, is there a way in which the JSON to Parquet format conversions can be handled within the data lake and that too with few clicks ?

The answer is YES ! Let us see how this can be done..

Step 1: Create a task

From the Dataplex console > Click Process > Create Task

Step 2: Select the data preparation task

Click > Convert to Curated formats

Step 3a: Select the Dataplex lake

The dataflow transformation template is auto populated.

Step 3b: Enter the source & target asset paths

Enter the target format for the data, in this case PARQUET.

Step 3c: Specify the optional parameters

Compression format (SNAPPY, GZIP), Action (destination file overwrite, fail, skip, etc.), Workers, machine type, network, subnet et al.

Step 4: Set the schedule

And click create ..

The data pipeline gets associated with the dataplex lake is ready to be executed !

Scene 2: Validate Data Quality

Now that we have started loading data in the lake, the question is how do we ensure data quality? Dataplex enables a DQ validation task with few simple steps. Under the hood it uses Dataproc Serverless to execute the DQ task.

Step 1: Create a task

From the Dataplex console > Click Process > Create Task > Click Check Data Quality

Step 2a: Select the dataplex data lake

Step 2b: Configure the task

Select the yaml file with the DQ rules defined. Set the target dataset and BQ table for storing the output of validation.

A sample yaml for DQ rules is shown below.

Step 2c: Set the service account and network config

Step 3: Set the schedule

And click create ..

The output of the every validation run gets stored in a BQ table. The validation report is available with key attributes such as dataplex data lake, run info, dataplex asset table, table/column validated, no of runs validated, success/failure statistics.

In conclusion, Dataplex addresses some of the key data management / data quality capabilities in a simple way. This is enabled via seamless integration with GCP native tools such as Dataflow, Dataproc Serverless and BigQuery amongst others.

To explore some more interesting features, watch this space !

--

--

Mandar Chaphalkar
Google Cloud - Community

Data Analytics Specialist at Google | *Views, opinions are personal*