Share Data with Confidence: Cell-level Access Controls in BigQuery and Data Studio

How to put your data on PAPER: Product Account Permission Enforced Reporting

Rich Kadel
Sep 3, 2018 · 14 min read

Co-authored by Arif Kasim, fellow Googler

Working with some of Google’s largest customers in Ads and Cloud, our roles give us the flexibility to use almost any platform, API or language at our disposal to address the customer’s business critical needs. When we’re talking about “data” (and we’re almost always talking about data), some of the most common and obvious ways of delivering our solutions include CSV files, Google Sheets (sometimes dynamically updated via API), and more recently, Data Studio dashboards.

What scares me the most is the first premise: Our teams are brought in to help solve business critical problems, and yet, the typical methods for sharing results are by “copy” (e.g., a CSV attachment) or by a shared document. In these scenarios, the data owner (the customer) has no practical control over who has a copy of their data, or how widely it is re-shared; and everyone sees the same, full set of results.

We developed an approach to managing access controls, not only at the document level, but for individual results, using a method we call “Product Account Permission Enforced Reporting” (PAPER). PAPER applies individual user permissions when retrieving results, by row or column, before returning them to the user or displaying them in data analytics and visualization tools like Data Studio, Colab, and Google Sheets.

The biggest benefit — and, often, most underappreciated — is that the data is delivered via BigQuery tables. Our most recent solution delivers 28 terabytes of data, accessible by hundreds of users from many different organizations, with various individual data permissions. These users have the full power of BigQuery SQL to pull detailed results, join these results with other first or third-party data, and aggregate results into summary reports; yet the results are always filtered to the individual’s data permissions. Users can even copy and tailor their own Data Studio reports, Colab notebooks, or Google Sheets, rather than being hemmed in by read-only access to a default dashboard. Direct access to this data in BigQuery also opens opportunities for advanced data analytics, including Machine Learning (ML) with TensorFlow (commonly developed with Colab notebooks) and newer ML tools like BigQuery ML.

The Problem

Customer data is at risk once it leaves a product’s “walled garden”

Most users interact with products like Google Ads and Google Marketing Platform (GMP) through their web user interfaces, and the UIs are custom built to enforce user logins and access controls based on user permissions. But Google is also well known for offering users the flexibility to export their data. Our Ads products are no exception. Configuration data for Ad campaigns and most of the resulting Ads event data are considered “customer data,” and customers regularly pull this data out of the product, in forms including:

  • Spreadsheet reports
Data security challenges once transferred out of a product

The problem is, once someone has extracted this data, those data security controls in the UI are no longer in play. As the data owner, the customer grants specific users access to their accounts in the product, and specific permissions within those accounts, such as to certain Ad campaigns and not others. Agencies may manage accounts for multiple customers, and product account permissions enforce that separation. None of these controls apply to the report someone just downloaded, and then emailed around. And if user permissions change, the product can’t reach back and revoke access to that emailed spreadsheet if required.

The customer is now responsible for enforcing a separate security policy, granting and revoking user access to individual documents, or to BigQuery tables, and the only way to deliver separate data based on different user permissions is to create even more documents and tables, with different (potentially overlapping) sets of data.

The Vision

Use BigQuery security features to filter results based on the user’s original data permissions

When we empower our customers with the ability to leverage powerful data management and visualization tools, outside the source products, that data and those products become more valuable. Our customers know their key business drivers, and can choose the best ways to slice and dice their data, gain unique insights, and trigger market-based decisions, limited only by their imagination.

We want to ensure that the customer’s data is protected even after being exported from the product. User permissions should still be honored, even if the data is being used in many different environments.

BigQuery includes the security features we needed to both protect the data and automatically filter results by user because BigQuery uses the same Google Sign-in process used by most Google products (and many third party products too!). When you access BigQuery from a product like Google Sheets (via Apps Script) or Colab (or Data Studio, as long as the BigQuery connector is configured to use “Viewers Credentials”), your account credentials are securely passed down to BigQuery as part of that request.

This means the vision of PAPER can be achieved using many off-the shelf data analytics and visualization solutions, with a BigQuery backend, and each user facing tool automatically presents only the data the user has permissions to retrieve. Applications are not granted access to the protected data, so there is no way the application can access or expose any privileged data beyond what the individual user is permitted to see.

Putting it on PAPER

PAPER securely maps Product Account Permissions to results using native SQL

Solutions secured by PAPER generally follow an architecture like the one depicted below. Although it’s possible to configure the BigQuery security layer by hand via the BigQuery console UI, humans make mistakes, which can put user privacy and data security at risk. Use scripts or other code, with BigQuery tools and APIs, as a repeatable, testable method to secure your data the right way, every time.

The PAPER architecture and data flow

In the pictured architecture, BigQuery is the central data store for all product data, and user permissions data is similarly stored in what we call the “permissions table”. If any table is already in BigQuery (such Google Marketing Platform and Google Ads data exported via the BigQuery Data Transfer Service), there is (usually) no need to move or copy the data. For external data, we automate data extraction from the product, for example, via product APIs, and then ingest that data into BigQuery.

Generate the permissions table

The process for extracting the permissions data is likely to be product specific, and may require cooperation from the product developers. PAPER assumes a strong one-way encryption of the user’s account identifier. We recommend you extract and transform the user data inside a protected enclave, and encrypt the user account identifiers before any kind of data transfer or storage.

Take, for example, the following example of a product’s user accounts and permissions, as managed by the product itself:

The account identifier has to match the value that will be returned by the BigQuery SESSION_USER() function (“the email address of the user that is running the query”) when user’s later try to retrieve data. BigQuery also has standards-based cryptographic hashing functions, which you can use directly or as part of a custom transformation expression. (You could add a cryptographic salt, for example.) Keep in mind you must be able to replicate the exact transformation, both when producing the permissions table, and when user’s run their queries.

For this simple example, assume the product’s account information can be queried using BigQuery Standard SQL-like syntax, this query normalizes and encrypts the user ID, and saves the user’s list of accessible account IDs in an array, storing one record per user account:

Refresh the permissions data as frequently as possible (such as every 15 minutes) so permissions changes in the product will be enforced in the external queries, analytics, and visualization tools without a long delay.

Protect your source data

Before you save the permissions table or any unfiltered product data to a BigQuery table, make sure the IAM roles of users of your Google Project (project-id in our examples) limit most user’s permissions. Project “Owner” and “Editor” (primitive roles “owner” and “editor”) roles, and “BigQuery Admin” (if used) should be granted only to a handful of trusted administrators. These users will have access to the unfiltered data and permissions tables (albeit with only encrypted account identifiers).

Permissions on BigQuery data are managed at the dataset level (controlling access to a collection of tables or views). Assuming we are storing all unfiltered data and permissions tables in the same dataset (which we’ll refer to as unfiltered in our examples), here are full table specifications for notional tables to protect:

project-id.unfiltered.account_permissions
project-id.unfiltered.account_report
project-id.unfiltered.account_summary
project-id.unfiltered.account_events

We use automation scripts to ensure the security and privacy features of PAPER are always implemented correctly and consistently. In Python implementations, we’ve used the Google Cloud Client Library for Python, and for Bash implementations, we use the gcloud, gsutil, and bq command line tools.

The examples below are in Bash and assume a Linux environment similar to that of a standard Google Compute Engine (GCE) instance, which includes a recent version of Bash. Assume we’ve already authenticated with “gcloud auth login” and set the default project to “project-id”. To keep the examples short and easy to read, some error handling and checks have been removed.

First we create the unfiltered dataset. We use the lock_down_dataset.sh script to remove default permissions, and restrict access to only administrators (Project Owners and Editors).

export PROJECT_ID='project-id'
export UNFILTERED_DATASET_ID='unfiltered'
bq mk ––dataset "${UNFILTERED_DATASET_ID}"
./lock_down_dataset.sh "${UNFILTERED_DATASET_ID}"

The dataset share settings should look like this:

Locked down dataset

Implement User-Specific Filtering in an Authorized View

We need a second dataset to hold Authorized Views to do the primary job of filtering data based on user privileges. (Note that we call the dataset “hidden_views” because we will later hide this dataset as well, in a later step.) Users granted access to this dataset will be able to query the views, even though they can’t query the unfiltered data directly.

export HIDDEN_VIEWS_DATASET_ID='hidden_views'
bq mk ––dataset "${HIDDEN_VIEWS_DATASET_ID}"
./lock_down_dataset.sh "${HIDDEN_VIEWS_DATASET_ID}"

The filter_by_account_id.sql.template below shows one of the simplest queries to filter results based on a user’s list of permitted “account_ids”. We have to explicitly share the unfiltered dataset with the view itself, to “authorize” the view to query the unfiltered tables. The share_with_authorized_view.sh script automates this configuration. (Note we are not showing the how the unfiltered tables were created.)

export PERMISSIONS_TABLE_ID='account_permissions'
export PERMISSIONS_TTL='3 DAY'
./make_and_authorize_view account_report \
filter_by_account_id \
"${UNFILTERED_DATASET_ID}" \
"${HIDDEN_VIEWS_DATASET_ID}"

At this point, the unfiltered data is protected, and you could safely grant users direct access to the hidden_views dataset. User queries would only return the results they are authorized to see. But they would also be able to see the view’s query. They would see exactly what you do to lookup a user’s permissions and how you implement filtering.

Hide the Permissions Filter Logic in another Authorized View

We recommend hiding the implementation from non-privileged users by wrapping the hidden_views with a very simple, pass-through query, implemented by the following template:

The wrapper views are saved in a third dataset, filtered_views, named to highlight to the users that their results are supposed to be filtered. We need to make the dataset, but we don’t need to lock it down since results are filtered based on any user’s product account permissions, and any user that doesn’t have a product account will simply get no results. Otherwise, the steps to create the dataset and then make the wrapper view are as shown below:

export FILTERED_VIEWS_DATASET_ID='filtered_views'
bq mk ––dataset "${FILTERED_VIEWS_DATASET_ID}"
./make_and_authorize_view account_report \
wrap_hidden_view \
"${HIDDEN_VIEWS_DATASET_ID}" \
"${FILTERED_VIEWS_DATASET_ID}"

The final configuration of datasets and views, for one table (account_report, in this example) should now look something like the view from the BigQuery console shown here. The unfiltered account_report table is mirrored by the PAPER-enabled version in filtered_views. The account_events and account_summary tables are yet to be filtered, but both the unfiltered and hidden_views datasets are (by design) restricted to privileged administrators.

Share and Use the Filtered Views

There’s one more step to make these views usable: We need to share the data with users! Not only do you need to share the filtered_views dataset, but users also need permissions to run queries, which means they need to be added, at a minimum, to the “BigQuery Job User” IAM role, in some project. (We typically grant this role in the same project that hosts the BigQuery datasets, but there are valid use cases for using a different project.) With the standard minimum permissions, you users will only see the filtered_views dataset and wrapper view query, as shown below (as well as the full schema):

Standard user’s view of PAPER data in BigQuery

We also, likely, want to go beyond just giving users access to the data. Most of our users are not likely to write their own SQL. So we configure data analytics and visualization tools to reference the filtered_views, and ensure the tools are configured to use each individual user’s credentials before sharing them with the same users that have access to the dataset.

We need to make sure BigQuery uses the data viewer’s credentials to filter the data to that user’s permissions, and not the document owner’s credentials. This distinction is clear in Data Studio data source documents, for example. In Data Studio, you create a data source, selecting the Google BigQuery Connector, and select the BigQuery project, dataset, and table (or view); then you have the option of toggling the option to use “Viewer’s Credentials” or “Owner’s Credentials”. We strongly advise only using Viewer’s Credentials for any sensitive or business critical information.

Data Studio is great for distributing access to PAPER-protected datasets because the actual data retrieved by each user is unique to that viewer. Users can even copy view-only Data Studio reports, or data sources, and create tailored versions that use the same PAPER-protected BigQuery authorized views, so when they re-share their copies, the data is still filtered and the underlying data is still completely safe from unauthorized users.

Colab documents shared in view-only mode — and then used in “playground mode” — provide a protected data enclave similar to Data Studio. In “playground mode”, users can run Colab cells that execute BigQuery SQL, and the output is visible only to that user. Be cautious of granting edit privileges, or even using your own Colab documents shared with others. Unless you explicitly turn on “playground mode”, your results will be visible to other users that might not have the same data privileges.

Google Sheets can include Apps Script to run BigQuery SQL, but there is no concept like “playground mode”, so results are always visible. If you want to offer users Google Sheets as an option for dynamically filling a spreadsheet with permission-filtered data from a PAPER-protected dataset, consider making copies of the Apps Script-enabled Google Sheet, one for each user.

Aggregate or Automate Sharing

We’ve highlighted just some of the ways to share PAPER-protected results, but keep in mind, everytime you share one of these documents, you also need to add that user to the filtered_views dataset and to the project as a BigQuery Job User.

We have two recommendations for simplifying the process of synchronizing these permissions.

  1. Use Google Groups. You can use a Google Group alias in place of individual user emails, so you can grant the Group the BigQuery permissions, and share the filtered_views dataset, and all documents that use it with that Group. Then just add users to the Google Group in one step. (We usually suggest configuring the Google Group to be used only as an access control list, by removing all other Google Group features: Certainly remove the ability to join or request to join the group, but also remove the ability to post topics or send group emails.)

Monitor BigQuery Logs for Unintended Usage

PAPER gives your users a wider array of tools for data analysis, including direct access to run BigQuery queries, but this also introduces a potential risk; particularly when dealing with large datasets. Users might — even unintentionally — create a new dashboard configuration or query that scans too much data, which can increase BigQuery costs over time.

GCP gives you monitoring tools and APIs to identify and correct problems before incurring unexpectedly high costs. We recommend a two-staged safety net:

  1. Implement alert notifications on your BigQuery logs to notify your admins, if the “bytes billed” metric for a user’s query exceeds a threshold that indicates a very expensive query. Stackdriver Logging APIs can trigger Cloud Functions via Pub/Sub, based on criteria you specify. (You could also monitor for unintended use cases, such as if a user uses your BigQuery quota to query large datasets in other Google Cloud projects.)

Advanced Features Also Supported

The example queries and scripts represent realistic but simple versions of a PAPER implementation. We have been able to extend the design to address more complex requirements. Here are two notable cases:

  1. BigQuery supports partitioning tables by Date, as well as breaking up a table into multiple tables with a common prefix (known as Wildcard Tables), and you can even combine both features to effectively create two-dimensional partitioning. For large tables approaching a terabyte or more, effective use of partitions can make a significant difference in BigQuery costs, and query performance. The BigQuery engine has to be able to recognize the partitions required for a query. There ways to inadvertently confuse BigQuery, causing it to scan all partitions instead of the desired subset. The details are outside the scope of this article, but the good news is, all of these features still work from a PAPER filtered view.

Key features and benefits, versus other approaches

PAPER gives data owners confidence their data is protected, while giving users a set of powerful tools, anchored by the full capabilities of BigQuery, to analyze and visualize data, including very large datasets. Compared to other similar solutions, PAPER has clear advantages, as shown in this figure:

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Thanks to Arif Kasim

Rich Kadel

Written by

Rich Kadel is a software engineer at Google, working on Fuchsia. Previously, he developed large scale ads data tools with Google’s largest customers.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade