GCP BigLake introduction

Published in

Google Cloud - Community

7 min readApr 15, 2022

BigLake is the name given by Google to an underlying data access engine used to provide access to data stored in either BigQuery or in structured formats stored on Google Cloud Storage (GCS). The concept is that data owners store their data in data warehouses (for example BigQuery) or in data lakes (for example GCS). With the data stored in distinct places, the access patterns and properties of these differ. For example, there are things one can do with data stored in a data warehouse that one can’t do with data stored in a data lake. BigLake provides a platform on which unification and commonality can be achieved by providing a single interface and semantics to access the data irrespective of where it physically resides or what formats are used to store the data.

Google released the initial release of BigLake functions in April 2022. You should continually check-in with the documentation to see what has been added or changed.

The following shows my perspective of BigLake. It starts with the data being stored in both BigQuery and in GCS. Within GCS, the data is expected to be in a structured format such as CSV, JSON, Parquet, ORC and Avro.

Next, we can define a mapping to the underlying data to expose it through BigLake. The access mapping will take place through BigQuery. Let’s unpack that notion. Imagine you have tables that are owned by BigQuery. In this case, there is nothing further to do as you can already immediately access the data in the tables. Now imagine the alternative case where you have the data stored as objects in GCS. What we now do is create what appears to be a BigQuery table but point the source of data for that table at the underlying GCS storage. Once done, what we then have is the ability to query or retrieve data from this BigQuery table and BigLake takes care of retrieving and serving up the data. Pausing here, we now see that we can access the data using the BigQuery interface and no longer have immediate knowledge where the data actually resides. Whether the data is in a native BigQuery table or stored as objects in cloud storage is hidden from us the consumers.

Various data processing systems such as Apache Spark have connectors for BigQuery. This means that we can surface our BigLake fronted data to anything that can consume from the BigQuery interface.

If you are skilled in BigQuery you may be scratching your head right about now. You may be saying “Doesn’t BigQuery already have this concept in the form of External Tables?” and you would be absolutely correct. The idea of External Tables has been with BigQuery for quite some time. Just like BigLake, External Tables provides a BigQuery interface onto underlying data that is stored outside of BigQuery including GCS, Drive, Google Sheets and more. What BigLake currently provides is an alternative interface that is designed to provide a unified data warehouse / data lake interface. In the initial set of BigLake functions, we are given an immediate bonus that is impossible with BigQuery External Tables … we are given the opportunity to perform finer grained security at the row and column level. To make that notion crystal clear, we can now have structured data stored in GCS such that access to specific rows or columns can be constrained to authorized requests. This capability has previously only been available with data stored in native formats and storage managed exclusively by BigQuery.

Another distinction from External Tables is that to query a BigQuery table mapped to an external source, the identity submitting the query would normally need to have read permission on the underlying data source. For example, an External Tables mapped to a GCS object would require that the identity executing the query would need table permissions and corresponding GCS object permissions. By using BigLake, we authorize BigLake to have the GCS object permissions and then the identity running the query is exclusively allowed or denied based exclusively on BigQuery permissions.

Let us now walk through some usage of BigLake and see the story in action.

Create a GCP project and make a note of its project number. We will use that as unique ids for our GCS buckets.
Enable the following GCP APIs (if not already enabled)

BigQuery Connection API

3. Create a GCS bucket called [PROJECT_NUMBER]-biglake and create it in us-east1.

4. Create a local file that contains CSV data. In this example we call our file data.csv and it contains:

country,product,price
US,phone,100
JP,tablet,300
UK,laptop,200

Upload this file into the GCS bucket:

5. Open the BigQuery workbench in the GCP Console

6. Click + ADD DATA and then External data source

This will bring up a panel on the right.

7. Fill in the external data source panel

8. Open the connection details for the newly added external connection

Make a note of the generated service account (Service Account ID).

9. Give the service account permissions to read from the GCS buckets. by granting it Storage Object Viewer role.

10. Create a BigQuery dataset:

11. Create a BigLake table within the new dataset:

We select the dataset and then from its menu, select Create table:

In the panel that appears, we select:

The source of the data should be Google Cloud Storage
The GCS object from which the data should be retrieved
The file format being CSV
The name of the BigQuery table we wish to create. In this example we called it biglake
The check box for creating a BigLake based table and using a connection id. This is the core concept.

After creating the table, we can see its definition. Note that it is tagged BigLake.

12. Run a simple query

If we now run a query from within the BigQuery workbench, we see that we have data returned. If we were to change the CSV file in GCS, we will then find that a subsequent query would return the latest data.

13. Apply a row level security policy.

We want to test row level security. To do this, we run the following SQL statement:

CREATE OR REPLACE ROW ACCESS POLICY mypolicy ON biglake.datatable
GRANT TO (“user:[YOUR IDENTITY]”)
FILTER USING (country = “US”);

You will have to change your own identity in the GRANT TO portion.

14. Re run the data access

This time we see that we are constrained to only see the rows that we are entitled to see. This demonstrates that row level security is now in effect.

This is not an option available to us with BigQuery External Tables; we can only achieve this using the BigLake technology.

If we were to add a new row in our CSV file in GCS that allowed us to query it, it would indeed be returned:

So far we have shown accessing a BigLake table through BigQuery queries but we can also access a BigLake table through other technologies such as Spark. Here for example is a Spark fragment that queries a BigLake table:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
 .appName(‘BigLake Query’) \
 .config(‘spark.jars’, ‘gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12–0.24.2.jar’) \
 .getOrCreate()
df = spark.read \
 .format(‘bigquery’) \
 .load(‘biglake.datatable’)
df.show()

And finally … a video walking us through some of the items discussed in this article.

GCP BigLake introduction

References

Written by Neil Kolban