Extend Cloud Storage filtering with BigQuery Object Table

guillaume blaquiere
Google Cloud - Community
5 min readMar 21

--

Cloud Storage is very popular and one of the first services that you use when you move to the cloud or when you want to import/export data to/from Google Cloud.
It’s a convenient, serverless, limitless scalable object storage service.

Cloud Storage Limitations

Serverless services always come with limitation because of opinionated choice made by the development team.

The most popular “limitation” (or misunderstood feature) is the folder management. Because folders do not simply exist!
It’s a common misunderstanding by many newcomers. It’s a blob storage system with a flat file system (named Object and not File), and not a block storage system with a hierarchical file system (like your own computer). More explanation here.

Another annoying limitation is the filtering capacity. You can only filter on the object’s path prefix. No wildcard path filtering, no date range, no extension/content-type selection,…
It’s really painful when you want to use the CLI to change the storage class of objects older than 90 days, or to select only the objects with a typical extension.

How to extend this Cloud Storage filtering limitation?

Filter your files as simply as a SQL query

The filtering by prefix comes from the underlying solution used by Cloud Storage. All the stored objects are indexed in a database, and BigTable is used for that.

If you are familiar with BigTable, the primary definition is very important because you can only query on it, and only on the prefix of it.
Exactly as you do with Cloud Storage.

To minimize the object filtering limitation, the answer comes from another very popular service on Google Cloud: BigQuery.

Yes, there isn’t an immediate connection between Cloud Storage and BigQuery, but a brand new feature named “Object Table” allows listing the Cloud Storage object’s metadata directly in BigQuery.

And because you have the Cloud Storage object metadata accessible by BigQuery, your file (object) selection becomes as easy as a SQL query!

Let’s make this BigQuery connection!

BigQuery Object Table

An Object Table is a table with the capacity to read a Cloud Storage Bucket. To achieve that, 3 mains operations

  • Create a BigQuery Connection
  • Grant the BigQuery Connection the permission to read the target bucket(s)
  • Create the BigQuery table based on the BigQuery connection and the bucket(s) to read

BigQuery Connection creation

BigLake feature offers to create a connection between BigQuery and a lake, like Cloud Storage.

By command line, you can do that

bq mk --connection --location=us \
--connection_type=CLOUD_RESOURCE medium_conn

Take care of the location. It must be the same as your Cloud Storage bucket location. Use US or EU if you have multi-region buckets in that part of the world.

Permission on BigQuery Connection

It’s the most tricky part. When you create a BigQuery connection, a service account is automatically provisioned. Therefore, we have to get this service account to grant the role roles/storage.objectViewer on it to read the bucket metadata.

You can get the service account like that

bq show --format json \
--connection us.medium_conn | jq -r .cloudResource.serviceAccountId

jq simplify the JSON parsing and value extraction!

To grant the service account on the Cloud Storage bucket, you can combine the role grant and the service account extraction, like that

#Ideally for least previlege enforcement. But seems not working for now.
gcloud storage buckets add-iam-policy-binding \
--member=serviceAccount:$(bq show --format json \
--connection us.medium_conn | jq -r .cloudResource.serviceAccountId) \
--role=roles/storage.objectViewer <YOUR BUCKET>

#Grant at project level instead
gcloud projects add-iam-policy-binding \
--member=serviceAccount:$(bq show --format json \
--connection us.medium_conn | jq -r .cloudResource.serviceAccountId) \
--role=roles/storage.objectViewer <PROJECT_ID>

Of course, you can grant the service account at the project/folder/organization level if required. Project level could be a good option if your table read in several buckets (tradeoff between gcloud CLI repetition and least privilege principle)

Note: you will have to wait 2–3 minutes between the permission grant and the propagation. If you use it too quickly, you will have permission issue (even if the BigQuery error message mentionned SLA and outage on BigQuery!)

Object Table Creation

Finally, you can create your BigQuery external table based on your connection and the bucket(s) to read

#First create the dataset if not exist
bq mk --dataset --location=us medium_dataset

#Create the table
bq mk --table \
--external_table_definition=gs://<YOUR BUCKET>/*@us.medium_conn \
--object_metadata=SIMPLE \
medium_dataset.my-object-table

I used the bq command line to create the table instead of the DDL to be able to script this code and add it to bash script if required!

Extend command line capability

With your Object Table, you can now query the objects through different dimensions for a powerful search

SELECT uri FROM `medium_dataset.my-object-table`
WHERE size > 100000
ORDER BY updated

But a SQL query is not really useful when you want to perform a CLI operation (manual or automated) on the objects.
One of great part of the Object Table feature is the capacity to query them with bq CLI and to link the result to perform operations on the objects.

Here, some examples who use the same pattern:
jq is the JSON tool used to simplify the formatting and data selection. If you don’t have it, install it!

  1. Run the SQL query with bq CLI, and request the result in JSON format
  2. Use jq to extract the object’s uri
  3. For each object, run the gcloud storage command, thanks to xargs

Remove objects with PDF extension

Here, we have to select all the objects with the .pdf extension. You could also select the objects with the application/pdf content-type.

The query is the following

SELECT uri FROM `medium_dataset.my-object-table`
WHERE uri LIKE "%.pdf"

Combine that with the bq CLI, jq, xargs and gcloud storage

bq query --format=json --use_legacy_sql=false \
'SELECT uri FROM `medium_dataset.my-object-table`
WHERE uri LIKE "%.pdf"' \
| jq .[].uri | xargs gcloud storage rm

Change storage class of objects older than 90 days

This time, we have to select all the objects with the creation date beyond 90 days from now.

The query is the following

SELECT uri FROM `medium_dataset.my-object-table` 
WHERE updated < DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)

Combine that with the bq CLI, jq, xargs and gcloud storage

bq query --format=json --use_legacy_sql=false \
'SELECT uri FROM `medium_dataset.my-object-table`
WHERE updated < DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)' \
| jq .[].uri | xargs \
gcloud storage objects update --storage-class=nearline

Limitless combination

Cloud Storage is a limitless storage service.
BigQuery is a limitless data warehouse query engine.

The combination of both unlocks new features and creates an advanced object query solution!

With the assembly of bash pipes to link the CLI commands, you can achieve powerful and limitless queries of Cloud Storage to solve all your usual issues.
Use it limitless!

--

--

guillaume blaquiere
Google Cloud - Community

GDE cloud platform, Group Data Architect @Carrefour, speaker, writer and polyglot developer, Google Cloud platform 3x certified, serverless addict and Go fan.