A step-by-step guide to Row and Columns Access policies in BigQuery

Google’s Datawarehouse offers amazing ways to restrict access to the information contained in tables and views, on both vertical and horizontal dimensions, let’s discover how to set this up.

Pierre-Louis Bescond
Roquette Data & Adv. Analytics
7 min readMar 5, 2023

--

Photo from Martin Olsen on Unsplash

This article was originally published in Towards Data Science.

Managing a data warehouse implies that multiple users, with different roles and rights, should be able to query the data they are looking for.

You might decide to build one specific table for each role but this would be time and resource-consuming, not to mention the difficulties to maintain this system over time.

In this tutorial, we will create an “EMPLOYEES” table, containing various levels of information, from the “Country” where the employee is located… to his/her salary… and consider 3 roles:

Illustration of the 3 roles — Created with www.freepik.com sources
  • Human Resources: they should access all data for their daily work, without any restrictions.
  • Data Scientists: they might only access some selected and anonymized columns for modeling purposes.
  • Data Citizen: most of the columns should be hidden from them and only allowing to create high-level queries and analysis.

Creating a new project in Google Cloud Platform

I will assume that you already have a GCP account. If not, you can easily open one and get $300 of credits for 30 days. If you already have one, this exemple will cost you close to nothing.

You should start by creating a new project to isolate your work.

Click on the project selector in the top left and then “New Project”.

Make sure you can create projects under an organization (like your company) otherwise, you will not be able to enforce some of the policies:

Creating a new project linked with an organization in GCP

Note: The project created here is “row-columns-policies”.

Creating a fake employees information table

“salaries.csv“ content displayed with BigQuery

I have created and uploaded a fake table available here on my GoogleDrive, containing, for each of the 8600 employees:

  • First Name
  • Last Name
  • Department
  • Country
  • Email
  • Salary
  • Currency
  • Salary Positioning
  • Last Performance Evaluation
  • Seniority
  • Commute Time

Let’s jump on “BigQuery”, click on “Add Data”, and “Local Storage”, keeping in mind two important points:

  1. A table must be stored in a dataset: we will create one called “EMPLOYEES” (see second screenshot). You should pay attention to the region you select: you will have to use the same when defining policies.
  2. As the scheme of the CSV file is straightforward, we can choose the “Auto-Detect” Scheme option to relieve us from declaring each column’s type individually.
Creating a new table in BigQuery
Creating a new dataset in BigQuery

Our “SALARIES” table is now properly built (1st screenshot below) and a simple SQL query (2nd screenshot) will reveal the corresponding data:

EMPLOYEES table scheme in BigQuery
SELECT *
FROM `row-columns-policies.EMPLOYEES.SALARIES`
ORDER BY Country, Last_Name, First_Name
LIMIT 30
SQL exploration result in BigQuery

Creating a “tag policy” taxonomy to restrict access

From BigQuery menu, let’s switch to “Policy tags” (if asked, enable the “Google Cloud Data Catalog” and/or “BigQuery Data Policy” APIs):

After clicking on “Create Taxonomy”, we define three levels of employee information:

  • High (critical information like Salary)
  • Medium (like “Last Performance” or “Salary Positioning”) that can be helpful for Data Scientists in their modeling tasks.
  • Identification (information that establishes a direct connection with an employee)
    + two sub-levels:
  • Identification > Names
  • Identification > Email

Note: the two sub-tags below the “Identification” tag (Names & Emails) will allow us to use different masking strategies later on.

Creating a new taxonomy in BigQuery/Dataplex

Note: As stated before, make sure you are using the same region as the one you chose for the dataset.

Once the taxonomy is created, the most crucial part is to enable it and you might be facing an issue if your project is not related to an organization (see Google related documentation).

Enforcing a taxonomy in BigQuery/Dataplex

Applying Tag policies to the table columns

Tags policies can be applied through the “Dataplex” section of GCP. The search engine will help you to identify the “SALARIES” table quickly:

Thanks to the “SCHEMA AND COLUMN TAGS”, we can easily assign a policy tag to some of the sensitive columns:

Applying policy tags in Dataplex

Adding a “viewer” principal to the project

Illustration of the 3 roles — Created with www.freepik.com sources

Let’s assume that a Data Citizen wants to access to this project and conduct a high-level analysis regarding the employees’ spread in countries.

We go back to the IAM section of the project and add a principal with a “Viewer” role:

Assigning a “Viewer” role in Google IAM

Once connected, this principal will immediately get warnings from BigQuery telling him/her that access to some columns will be restricted:

Reviewing SALARIES table with a “viewer” role in BigQuery

Indeed, as we have the minimum rights, only 3 columns are visible in the preview section:

Previewing SALARIES table with a “viewer” role in BigQuery

But it still allows us to conduct the analysis we want to perform and get a breakdown of the number of employees, per country and department:

SQL query on SALARIES table with a “viewer” role in BigQuery

Defining a “Masked Reader” role and the corresponding masking strategies

Illustration of the 3 roles — Created with www.freepik.com sources

Now let’s assume that Human Resources asked a Data Scientist to perform some statistical analysis regarding the employees.

She needs to access some of the columns but not necessarily in their original form.

We start by assigning the “Masked Reader” role to this new principal:

Assigning a “Masked Reader” role in Google IAM

And then define different masking strategies to transform each column according to our needs. For ex.:

  • First and Last names should be converted to NULL
  • Emails should be “hashed” to keep a unique identifier but impossible to link with the original employees
  • Salary should be converted to “0” (default masking strategy for integers)
Steps to define a Masking Rule in Google BigQuery
A set of Masking Rules in Google BigQuery

Finally, we need to grant access to the “Medium” labeled information and add this new principal as an authorized viewer of this category.

Adding a principal as a viewer of a masking policy in Google BigQuery

As expected, the result of the below SQL query — when connected as a Data Scientist with the “Masked Reader” role — will properly implement the rules defined:

SELECT * EXCEPT(First_Name, Last_Name)
FROM `row-columns-policies.EMPLOYEES.SALARIES`
Result of masking policies on Masked Reader role in Google BigQuery

Adding a row policy to the table access management

Now let’s assume that, as an HR Global Manager, we need to grant access to the table to the local Canadian HR… without disclosing figures from other countries.

We can set row-level security, assigned to this user:

|CREATE ROW ACCESS POLICY Canadian_filter
ON `row-columns-policies.EMPLOYEES.SALARIES`
GRANT TO ('user:user@domain.com')
FILTER USING (Country = 'Canada');

It’s like automatically extending any query executed by this principal with the “ WHERE Country = ‘Canada’ ” statement:

Result of row-level security in Google BigQuery

It is a very nice way to reduce the information available on one or multiple dimensions!

And this last example closes the different use cases I wanted to explore in this article.

As a reminder, “Columns Masking” and “Row-Level” policies offer a great way to filter the data, directly in your Data platform. Only one table is managed in the Data Warehouse, allowing different solutions and/or users to query it seamlessly, without compromising secrecy rules.

🍒 Cherry on the cake, these functionalities are free so there is absolutely no reason not to leverage them!

As usual, I tried to identify all required steps but do not hesitate to revert to me should there be any missing instructions in my tutorial!

(thanks to our Lead Data Engineer, Ilyes Touzene, for proof-reading me!)

--

--

Pierre-Louis Bescond
Roquette Data & Adv. Analytics

Head of Data & Advanced Analytics @ Roquette | Winner of the 1st WorldWide Data Centric Deep Learning Contest | Data Science & Machine Learning Passionate!