Data Mesh Enabler: GCP Dataplex and BigLake Integration

Published in

Google Cloud - Community

10 min readJan 11, 2023

Introduction

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

Before jumping right into how to automatically create and manage BigLake tables on GCP through Dataplex, let us first understand briefly what a BigLake table is.

In the modern scenario, data is generally stored in Data Warehouses (like BigQuery, etc.) and/or in Data Lakes (like GCS, AWS S3, ADLS, etc.). Because of this, we have seen the rise of the term “LakeHouse Architecture” in Data Engineering projects. Since our data is stored on multiple platforms, there was a lack of a common platform where we can access or even govern it. Managing multiple access policies in different places and hitting different services to get data from different sources is a headache that no one wants.

This is where BigLake comes into play. It unifies different data sources in the backend and provides a single interface to query your data (through BigQuery) and also govern it (through BigQuery and Dataplex) without actually changing where the data is stored physically. BigLake tables are essentially a type of table that can be queried directly in BigQuery. To a certain degree, they are similar to an external table, but they offer additional features like column-level security, row-level security, data masking, and data exchange with Analytics Hub, no matter where your data is stored originally.

BigLake was launched in GA in July 2022. but at that time, there was no integration of BigLake tables with Dataplex. Dataplex is an intelligent data fabric that helps you unify distributed data and automate data management and governance across data lakes, data marts, and data warehouses to enable analytics at scale. It makes the implementation of Data Mesh architecture possible. We can separate our data assets into logical groups of lakes and zones with Dataplex. When we create an asset in Dataplex pointing to a GCS bucket, it automatically discovers all data inside that bucket and creates multiple tables in BigQuery. Up until very recently, Dataplex was able to discover data from GCS and create external BigQuery tables (non-BigLake), but after December 20, 2022, we have an option to automatically create BigLake tables on BigQuery where data is physically stored in GCS buckets.

Putting Dataplex and BigLake together

Although the offerings of Dataplex and BigLake may appear to be similar, they serve entirely different purposes.

BigLake is a storage runtime that extends BigQuery’s data management capabilities to object stores. Because of this users directly interact with tables and not with files and apply fine-grain security.

Dataplex on the other hand, is solving data governance and management problems. It creates an inventory of data, classifies it, and secures it. From Dataplex’s point of view, BigLake tables are no different than BigQuery tables and can be managed by Dataplex for governance.

Different colours in Data Fabric represent data from different projects. Lines represent Data Sharing

In this article we are going to

Create a Dataplex managed bucket asset and auto-discover BigLake tables.
See what happens under the hood and what components are created.
Provide access to another user in a different project using Dataplex and Govern it.
Assign a policy tag for column-level security.

Create a Dataplex managed bucket asset and auto-discover BigLake tables:

Prerequisites:

Make sure that you have enabled these APIs: 1) BigQuery Connection API 2) Cloud Dataplex API 3) BigQuery Data Policy API 4)
Google Cloud Data Catalog API
Prepare yourself to work with many interconnected services at once, like GCS, Dataplex Lakes, Zones, Assets, BigQuery, BigLake tables, Policy Taxonomies and Tags, IAM roles, Service Accounts, and Projects.

Create a bucket (in the us-east-1 region or anywhere else) and upload a file named sample.csv (shown below) inside a folder called “employee”.

sample.csv file contents

Go to Dataplex->Manage and create a sample lake and zone in the same region as the bucket (us-east-1 or anything else). Ensure that regions are same. I kept it simple by using the names “sample-lake” and “sample-zone”. Here, make sure that the zone is a raw zone since curated zone doesn’t support CSV files.

Once done, create an asset in the “sample-zone”, choose “Storage bucket” and select the bucket where you uploaded the data. Make sure to select “Upgrade to Managed” option. This is the latest feature, which essentially makes sure that the corresponding BigQuery table that we create is of the BigLake type and not a simple non-BigLake external table.

Creating the Dataplex asset from a bucket

In the next screen, you can update the CSV discovery settings if required, but I am not making any changes here. Discovery settings can be updated even at the zone level. For now, simply inherit the discovery settings. Just sit back and see the magic happen! We can’t do anything but wait at this point.

After some self-contemplation, click on the asset to check out the discovery logs. Hit refresh if you don’t see anything.

Discovery logs for the asset creating the table called “employee”

Now, let’s go to Dataplex->Search. Select “sample-lake” from the “Lakes and Zones” filter and check if the newly created table is visible there.

You can also click to check the schema detected for the table.

But how do we know if it’s a BigLake table? Simply click on the “Open in BigQuery” button. This will show the schema of the table inside BigQuery, and you will be able to see “BigLake” type on the screen right beside the table name.

You can also upgrade your existing lakes to managed to automatically convert all your external tables corresponding to that asset to BigLake table.

Let us see what happened behind the scenes

Process of auto-discovery of GCS data and creation of BigLake tables

Dataplex automatically discovered all the folders and data in the bucket when the asset was created.
A dataset was created in BigQuery with the name of Dataplex Zone. Verify it from the Explorer tab. Notice that name of the zone was slightly changed from “sample-zone” to “sample_zone”, to follow BigQuery’s naming convention.

3. An external connection was created in BigQuery with the same name as your zone and region. Observe that the connection type is “BigLake and remote functions”

External Connection us-east1.sample-zone is automatically created

4. An IAM role assignment was created in the bucket from which we created this asset. From the above snapshot, notice the “Service account id”. This needs to be given permission to discover the data on the GCS bucket. To verify this, go to your GCS bucket->Permissions->View by Principals. You should find a corresponding entry for this service account.

Service id role mapping automatically created

5. After all this, a BigLake table is finally created inside your zone’s dataset using the external connection id and IAM role mapping created. Let’s query the data to see how it looks.

Going back to Dataplex and providing access to another user

Let us go back to Dataplex. We can allow another user from a completely different project to access this data. (One of the use cases of “data mesh”)

For this, go to Dataplex->Secure and go to your zone. We can assign permissions at the lake, zone, and asset levels. Permissions are automatically inherited by child resources unless they are explicitly denied. For now, let’s assign permission directly to Lake and see if that works. (Since we only have 1 asset inside the lake) By default, this list should be empty.

Click on “Grant access” and add the user you want to give permission to. If you don’t have another user, you can also give access to the service account to one of the resources you are using and then check it from there. For now, I will add another user (that I can access) in the permission in another project. You can also notice that the inheritance of lake permissions is propagated automatically to zones and assets.

Assigning viewer/reader roles to principal

After doing this, you will be able to see the asset in your consumer account’s Dataplex console and will also be able to query it with BigQuery.

Let us assign a policy tag on this table

This is great, but we still have not used one of the main features of BigLake tables. That is fine-grained access control, so let’s do that, shall we?

Go to Dataplex->Policy Tags and notice that it redirects you to BigQuery! You will also see a message saying, “You have been redirected to Policy Tags in Bigquery. Policy Tags has moved from Dataplex to Bigquery. In the future, we will remove the link to Policy Tags from Dataplex”

That means we will have to create a policy tag taxonomy on BigQuery first, and then it will be applied to the BigLake table.

Click on the “Create Taxonomy” button and create a sample taxonomy with a sample tag like shown below. Make sure that your region is the same (us-east-1 in my case)

You can make it complicated by adding some subtags as well, but for now, let’s keep it simple. Once created, don’t forget to select the radio button saying “Enforce access control”. This step is essential.

Okay, so a policy tag taxonomy has been created. We should go back to Dataplex and apply this to our table. If you go to Dataplex->Search and type in “employee” then we would get this result.

We have 2 outputs. Weird, right?

Well, this is actually expected. If you look closer, only one of these tables belongs to the “sample-lake” and “sample-zone” that we created. Another one is simply a BigQuery dataset’s table. While applying fine-grain access control, we have to select the table that has System = BigQuery. After this, we have to apply the tag created to 2 columns: email and phoneNumber. Another option is to apply the tag created directly on the BigQuery console.

Option for fine-grain security in Dataplex if source is BigQuery

I am assuming this inconsistency will soon go away as fine-grained access control is fully launched in Dataplex. For now, it is a little confusing to select 1 out of 2 tables with the same name while applying policies.

Now if you select * on the table in BigQuery, you will get this error message (from the producer or consumer account):

Which makes sense. Now let’s select only first and last name from the table. This works well.

From consumer user, go to Dataplex->Search and search for employee BigQuery table and go to schema and column tags. It will showcase this result.

You can see that we are unable to see exactly which policy tag is applied since we have opened it from the consumer account.

Common Troubleshooting:

Before applying policy tags to columns for fine-grained access, ensure that you have shared the data assets, zones, and lakes with the consumer. If you do not, then you might run into this error:

This happens because, if you remember, when we created the policy tag, we went to the BigQuery console and created the taxonomy. So it had no connection to Dataplex. Our Dataplex service account was not the owner of that taxonomy, so it can’t apply an IAM policy with it. As a workaround, share the assets first and then apply policy tags like I did above.

2. Make sure that region selection is the same across all resources. Dataplex is used to share data without moving it across projects, not across regions.

3. Do not forget to enable all the APIs mentioned in the prerequisite.

4. Do not delete a taxonomy that is already attached to a table. This will cause tags to turn into hidden policy tags, which can be confusing. Always remove the tag from the table and then delete the taxonomy or tag.

5. If you run into permission issues while creating an asset from your Storage Bucket, ensure that your Dataplex service account has the Dataplex Service Agent role assigned to your bucket.

Concluding my observations:

The ability to automatically discover data in GCS buckets is absolutely amazing. Storing that as a BigLake table now is the cherry on top because of additional features like fine-grain Security. But I think this integration will go through some iterations before it is completely seamless. For example, if we create a BigLake table directly inside BigQuery, then you get the option to select a storage layer like AWS S3 or ADLS as well, but for Dataplex auto-discovery for BigLake, it is currently limited to GCS. The inconsistency of where we are creating policy tags and where we are applying or propagating them (as mentioned in troubleshooting point 1) is also something that might be refined, which can make this integration a really perfect solution for implementing the Data Fabric layer for your Data Mesh architecture.

This is my first Medium blog, so I am still figuring out how to write well, but feel free to let me know your thoughts or questions 😊