Simplify data lake management : with Dataplex (1)

Mandar Chaphalkar
Google Cloud - Community
3 min readMay 30, 2022

IT functions have been spending a lot of time defining, designing and implementing data lakes. The entire life cycle process can be tedious, time and effort intensive. Not to mention the issues around the completeness of the setup and possible errors as well.

So how does Google Cloud Platform solve this problem?

Google Cloud has recently launched Dataplex, a service that addresses key data governance capabilities such as data management, metadata, data quality, data security, data lifecycle management.

And all of this in the simplest possible manner !

In this blog we will focus on the management part. Let us understand how..

Step 1: Create a data lake.

From the Navigation menu > under Analytics > select Dataplex > Create new lake

Click create at the bottom of the page and the lake gets listed under Manage.

Step 2: Define the data zones. One the most important step during the setup of the data lake is defining the data zones. All of us are familiar with the standard zones such as Raw, Curated, Consumption et al. The definitions do vary a bit based on the industry, organisation and the use cases they are intended to solve.

Click on the newly created data lake > Click add zone.

Repeat this task for the intended zones such as Raw, Curated.

Step 3: Add assets, essentially associate the zones to the physical storage location. In this scenario we associate the Raw zone to a GCS bucket that has been specifically created to store the raw data.

Click on the Raw Zone > Click Add Asset > Click Storage bucket > Select the desired bucket for raw data > Review assets > Job done !

It is important to note here that there is an option to select a BigQuery dataset as well for the physical storage.

The new asset gets listed under the raw zone.

So the question is how do we browse the files in the raw bucket? Data catalog comes to the rescue over here. Dataplex has a natural integration with Data catalog. The association is automatic !

Step 4: Explore the lake and start using it! Now that all the data lake components are in place, let us browse them ..

Click on Explore (Preview) and expand the list under the data lake that has been created.

All the zones, objects upto the schema details get listed in this view.

In conclusion, while with traditional ways of building the data lakes a lot of redundancies and silos got introduced leading to inefficiencies and cost overheads. With Google Cloud’s Dataplex service, it is possible to define & manage the data lake using very simple steps.

We will explore some of the other cools features on data quality, data life cycle management next in this series of blogs. Stay tuned !

--

--

Mandar Chaphalkar
Google Cloud - Community

Data Analytics Specialist at Google | *Views, opinions are personal*