BigQuery Authorised View verification workflow

Ismael Yuste
Google Cloud - Community
3 min readNov 15, 2020
Photo by Paul Skorupskas on Unsplash

TL;DR: Verify your Views in a BigQuery dataset, to make sure the Authorised Views are going to work without disrupting your ETL.

Customer wants to verify all the Views on a BigQuery Data Warehouse (DW), to assess whether all are authorised against the incumbent datasets inside or outside the DW.

The BigQuery API has not implemented a method for this already, so we choose to develop a Python solution using the BigQuery Objects gathering information on project, dataset, table level to get to the point.

Use Case 1 (New dataset added to the DW):

A new dataset is added to the DW.

A process is launched, and gathers the needed information.

Review the dataset, looking for Views.

Review each view, and gather the SQL code.

Get external data sources from the SQL.

Get the incumbent dataset/s and verify the View has been authorised against the dataset.

Case not, send a notification (email, slack…) to the origin and destination dataset owners so they can take action on it.

The dataset is verified as authorised and is added to the catalog system.

Use Case 2 (periodical check on existing datasets looking for changes):

Recurrent verification of the Use Case 1.

Batch verification to be launched every (day, week…) for all the datasets and views on the DW.

A report is created, throwing the exceptions, and individual notifications are sent to dataset owners.

Solution:

Step 1 — On a given project list all datasets. This step will gather all the datasets on a given project. Project = client.project (contains the current project).

Running the code, the output will be something like this:

Datasets in project bq-sme:dataset1dataset2

References:

https://cloud.google.com/bigquery/docs/listing-datasets#python

https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/list

Gathering the list of datasets will move us to the next step.

Step 2 — On a given dataset, list all views. This step will gather all the views on a given dataset.

Running the code, the output will be something like this:

Got dataset 'bq-sme.dataset1' with friendly_name 'None'.Description: sample dataset 1Labels:type: sampleTables:Table1 - TABLEView1 - VIEWView2 - VIEW

References:

https://cloud.google.com/bigquery/docs/dataset-metadata#python

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list

This sample code will retrieve all the tables, so you can get only the type VIEW for this sample.

Step 3 — Get View metadata gathering the SQL definition of the View. This step will gather metadata on a given view, retrieving the corresponding SQL metadata.

The metadata will tell us if there are external datasets that are used in this view.

Running the code, the output will be something like this:

View at bq-sme:dataset1.View1View Query:SELECT station_id FROM `bq-sme.dataset2.ny_citibike`

References:

https://cloud.google.com/bigquery/docs/view-metadata#python

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get

As we can see, there is a View (Table Type VIEW) on a dataset, pointing to a another dataset, so this is an example of a View who needs authorisation.

Step 4 — Verify if a Dataset contains the authorisation for a View in another dataset. Get the dataset metadata, to view the access property and verify if a given view is on the list for authorised views.

Running the code, the output will be something like this:

Got dataset 'bq-sme.dataset2' with friendly_name 'None'.Description: sample dataset 2Labels:type: sampleAccess:[<AccessEntry: role=WRITER, specialGroup=projectWriters>, <AccessEntry: role=OWNER, specialGroup=projectOwners>, <AccessEntry: role=OWNER, userByEmail=email@domain.com>, <AccessEntry: role=READER, specialGroup=projectReaders>, <AccessEntry: role=None, view={u'projectId': u'bq-sme', u'tableId': u'View1', u'datasetId': u'dataset1'}>]

References:

https://cloud.google.com/bigquery/docs/dataset-metadata#python

https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/get

https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets#Dataset

With this response, you can capture a view, authorised from the original dataset, and close the loop.

If no authorisation is found, you can send a notification to the dataset owner, in order to fix the issue.

Scalability:

  • Deploying the code on a scalable task manager as Google App Engine (GAE), will make atomic requests for a given dataset or view, and can manage throughput of request and manage potentially 100s or 1000s of datasets and Views.
  • Also, Cloud Functions or GKE or Cloud Run, can make this process in a automated way for any new view on BigQuery and run a nightly run for the day verification.

Evolution:

The process described in this post can be extended to Table ACL on Bigquery, and potentially can verify if a given table is shared with the incumbent group or user.

--

--

Ismael Yuste
Google Cloud - Community

Strategic Cloud Engineer @google . Cloud. Big Data. Gsuite. GIS.