How To Control Access To BigQuery At Row Level With Groups

Introduction

BigQuery is a serverless, highly scalable, data warehouse system that provides enormous flexibility for varied use cases. BigQuery has a security model based on Google Cloud’s IAM capability, which lets administrators control access to datasets by roles, groups and individual users.

In some use cases, however, it may be necessary to control access to data at a more granular level than the dataset. For example, different users might need access to different subsets of the rows in a table. For that situation, BigQuery has the Authorized Views functionality, which lets you grant a view access to a dataset, instead of an IAM entity like a group. That view can then implement any necessary filters and be shared with end users, who will then only be able to view the filtered data.

A common (and recommended) way to set up access control in GCP is by adding users to groups, and granting permissions to those groups. Often, organizations will already do this for other resources and keep group membership updated in their G Suite Admin or corporate LDAP system. This article explains how to integrate this group membership data with Authorized Views, via Google Groups and the G Suite Admin API (or the Cloud Identity API), in order to create a row level access control scheme based on groups.

Architecture

In order to synchronize group membership, we’ll use both the G Suite Admin Directory API (or the Cloud Identity API) and the BigQuery API. A Google Cloud Function will be scheduled to perform the periodic synchronization, populating a BigQuery table with all group memberships. An Authorized View is then created to join group membership with the business data tables. The following diagram shows how it works:

Prerequisites

You will need a GCP project with the following APIs enabled:

You will also need a G Suite or Cloud Identity organization, and have administrative access to both.

Setup

In order to automate the process and get access to the admin APIs, the first thing you need to do is to create a Service Account with permission to impersonate a G Suite administrator user, which in GCP parlance is called “Google Apps Domain-wide Delegation”.

Start by creating a service account:

  1. Open the Service accounts page. If prompted, select a project.
  2. Click Create service account.
  3. In the Create service account window, type a name for the service account, and select Enable Google Apps Domain-wide Delegation. Then click Save.

Then, you need to authorize this service account to access your G Suite/Cloud Identity domain, by following these steps:

  1. Go to your G Suite domain’s Admin console.
  2. Select Security from the list of controls. If you don’t see Security listed, select More controls from the gray bar at the bottom of the page, then select Security from the list of controls.
  3. Select Advanced settings from the list of options.
  4. Select Manage API client access in the Authentication section.
  5. In the Client name field enter the service account’s Client ID.
  6. In the One or More API Scopes field enter one of these scopes: https://www.googleapis.com/auth/admin.directory.group for the G Suite Admin Directory API (Google Groups) or https://www.googleapis.com/auth/cloud-identity.groups.readonly for the Cloud Identity API
  7. Click the Authorize button.

This is where we start to use the command line, so let’s define some variables that we’ll need:

export PROJECT_ID= # your project id
export REGION= # your region, e.g. us-central1
export SERVICE_ACCOUNT_USERNAME= # service account you created above
export SERVICE_ACCOUNT=$SERVICE_ACCOUNT_USERNAME@$PROJECT_ID.iam.gserviceaccount.com

The service account also needs permission to use BigQuery and issue authentication tokens. You can grant these roles with the following commands:

gcloud projects add-iam-policy-binding $PROJECT_ID \
 --member serviceAccount:$SERVICE_ACCOUNT \
 --role roles/bigquery.user
gcloud projects add-iam-policy-binding $PROJECT_ID \
 --member serviceAccount:$SERVICE_ACCOUNT \
 --role roles/iam.serviceAccountTokenCreator

Now we get to the actual code that will perform all the magic. First, check out the code:

git clone git@github.com:GoogleCloudPlatform/professional-services.git

Next, open main.py with a text editor and update the values of the following variables to reflect your environment:

DOMAIN = # Your G Suite/Cloud Identity domain, e.g. 'cbcloudtest.com'
ADMIN_EMAIL = # The email of a domain administrator, e.g. 'admin@cbcloudtest.com'
DATASET = # The name of a dataset that will be created to hold the user mapping table, e.g. 'bq_iam'
GROUPS_USERS_TABLE_NAME = # Name to give the user mapping table, e.g. 'groups_users'

The code is prepared to use the G Suite API by default. If you’re using Cloud Identity, there are a few lines of code you need to change in the script group_sync.py. Open the file and look for comments that start with “For Cloud Identity”, which indicates lines that need to be changed.

The next step is deploying the code as a Cloud Function, by running the following:

gcloud beta functions deploy sync_groups \
 --project $PROJECT_ID \
 --runtime python37 \
 --trigger-http \
 --timeout 300 \
 --service-account $SERVICE_ACCOUNT

At this point, you can manually run the function with gcloud functions run sync_groups. After if finishes, you should be able to see the new dataset and table populated with the data in BigQuery:

When you’re satisfied with the results, you can schedule the function to synchronize periodically. You can run the following command to schedule it, adapting the — schedule parameter to your needs according to the crontab format:

gcloud beta scheduler jobs create http group_sync \
 --project $PROJECT_ID \
 --schedule=”11 * * * *” \
 --uri=https://$REGION-$PROJECT_ID.cloudfunctions.net/group_sync \
 --description=”Synchronizes group membership between G Suite and BigQuery”

Creating the filtered views

Now that you have your group membership data in BigQuery, you can start joining it with your own business tables to filter it appropriately for each user.

Let’s say you have the following table with customer data, and you want to limit the data a certain analyst can access to their own geographic region:

Furthermore, assume your group membership table looks like this:

In order to apply the appropriate filter for each user, you will need to use the SESSION_USER function. This function always returns the email of the current logged in Google account.

You could create an authorized view with the following query (in Standard SQL):

SELECT c.customer_name, c.customer_id
FROM private.customers c
JOIN user_groups g
  ON SESSION_USER() = g.user_id
WHERE c.country IN (
  CASE
    WHEN g.group='analysts_br' THEN 'br'
    WHEN g.group='analysts_ca' THEN 'ca'
    WHEN g.group='analysts_us' THEN 'us'
  END
)

You can then grant users access to this view, instead of the original table, and they will only be able to see the data they are authorized to.

Limitations

There are a few aspects that are important to keep in mind when using this technique.

First, it only works for native BigQuery tables, with their data in BigQuery storage. For federated sources, i.e. those that pull data from GCS or other databases, the user executing the query needs to have access to the underlying resource, which would defeat the purpose in this case.

Another important limitation is that, since the SESSION_USER function is non-deterministic, BigQuery will not cache results for queries on these views, which could have an impact in costs.

Related work

Also check out the PAPER article, which lays out a comprehensive set of techniques to control access to BigQuery data, given a mapping of user permissions.


Special thanks to “Woogie” Wolgemuth for contributing the authentication code, Jake Ferriero for reviewing both the code and the article, and Edu Marreto for article input.