Using VPC Service Controls to isolate data analytics use cases in Google Cloud
VPC Service Controls (VPC-SC) is an incredibly important security control to protect Google Cloud APIs. Oversimplifying. VPC-SC allows you to define security perimeters around the projects in your organization, and it blocks API calls that would cross that perimeter, unless exceptions like ingress rules are defined.
Normally, VPC-SC should not be used for fine grained access control, which is the domain of Cloud IAM, but in this article I will describe a use case that makes for a perfect exception to the rule.
The Challenge: isolating analytics use cases on Bigquery
Bigquery allows for granular access to datasets, which makes it easy for the data owners to grant users access only to the datasets needed for their business use case. Depending on the organization, the type of data stored in their Bigquery data lake, and the policies regulating access, some customers may want to prevent users from correlating datasets that are accessed for different use cases.
This figure illustrates a very basic example:
Details of the picture:
- Our analyst is performing 2 different analysis, which requires querying and correlating a number of datasets, with minimal overlap between use cases
- Query 1, only queries datasets that have been authorized for the analysis 1, and should be allowed
- Query 2, similarly only queries datasets that are authorized as part of analysis 2, and should be equally allowed
- Query 3, correlates datasets that that are in separate analysis use cases, and should not be allowed
Why this requirement? There are a number of reasons for an organization to limit correlations, for example: correlating two datasets that have been anonymized by aggregating data, using different aggregation strategies, may allow the extraction of information with a granularity that is not allowed by the company’s policies.
Why is this a challenge? The default access control to Cloud resources is based on Cloud IAM policies, these policies grant a principal (e.g.: a service account, a user or group) some permissions (e.g. bigquery.tables.create) on certain resources (e.g. a project). A principal having permissions on multiple resources can use them at the same time, and a bigquery job that queries two tables (A,B) aggregating data will basically need the following permissions:
Which are evaluated independently, and therefore we cannot add a condition to bigquery.tables.getData on table A for whatever else is in the query.
The Solution
This architecture builds on the following Google Cloud Platform components:
- Bigquery, data warehouse and data analytics platform
- Data exchange, a Cloud Data Hub feature to share datasets across projects
- Cloud Workstations, managed development and analysis environments running inside customers’ projects on GCP
- VPC Service Control, security perimeter to control access to GCP APIs based on context
I will not be able to go into the details of the individual products in this post, please use the links above to read more if needed. In particular check how: VPC-SC ingress rules, Data Hub exchange subscriptions, and Cloud Workstations work.
The goal of this architecture is to allow one user (user1@example.com) to work on two separate analytics projects, Project A and Project B, without being able to directly correlate data between the two. The user will be able to access the data from analytics environments deployed as a Cloud Workstation, running in the GCP projects Data analytics project A&B.
The ‘Data Views Project A&B host data exchange and a linked dataset created from a subscription. The data exchange and subscription, in the form of a linked dataset, are created in the “data view” projects A&B. Effectively Bigquery users will see a bigquery dataset in the “Data Views Project A”, but this dataset is directly linked to the one in the Data Lake project, in this case Data Exchange is used to avoid duplicating the data between projects. user1@example.com does not have any access to the Data Lake Project.
The creation of the linked datasets and data exchange can be performed as part of the analytics environment set-up, the user1@example.com only needs to get IAM permissions on the bigquery dataset linked to the subscription in the project data views projects.
Summary of the architecture components:
Projects containing the data:
- Data Lake project: the project containing the datasets that have been vetted and approved for external consumption
- Data View Project A: a project created to enable the analysis project A, it contains the linked datasets, created through a subscription to a data exchange, that contain the data needed by the analysis project A
- Data View Project B: same as project A, for project B
VPC-SC Perimeter: It is built around all the projects containing data, restricting every API.
- The analysis projects can be within a VPC-SC perimeter as well, as long as it is a separate perimeter from this one
Ingress rules:
- Ingress rule A: it allows the user1@example.com from the project Data Analysis project A, to access bigquery API resources in the projects: Data View Project A
- Ingress rule B: it allows the user1@example.com from the project Data Analysis project A, to access bigquery API resources in the projects: Data View Project B
IAM policies
- user1@example.com has the IAM roles to access the bigquery datasets in the project:
Data View Project A
Data View Project B
Not any dataset in Data Lake Project
Workflows:
user1@example.com logs into workstation A
- API calls executed from the workstation will show as coming from Data analysis project A
- user1@example.com will be able to query the linked dataset in Data View Project A
It is required to use Data view project A as billing project for bigquery - But user1@example.com will not be able to query the data in Data View Project B
user1@example.com logs into workstation B
- From there API calls will show as coming from Data view project A B
- user1@example.com will be able to query the dat a in Data View Project B
It is required to use Data view project B as billing project for bigquery - But user1@example.com will not be able to query the data in Data View ProjectA
Example of ingress rule:
ingressPolicies:
- ingressFrom:
identities:
- user:user1@example.com
sources:
- resource: projects/[# Data analysis project A]
ingressTo:
operations:
- methodSelectors:
- method: '*'
serviceName: bigquery.googleapis.com
- methodSelectors:
- method: '*'
serviceName: analyticshub.googleapis.com
resources:
- projects/[# Data view project A]
Ingress rules are exceptions to the VPC_SC perimeter default behavior, which allows only API calls from networks or resources within the same perimeter as the target resource being accessed. In this case the ingress rule, allows for API calls from outside the perimeter as long as:
- The user is authenticated as user1@example.com
- The API calls is originated from the project “Data analysis project A”, which is hosting the VPC network where the Cloud Workstation is attached
- The API is either bigquery.googleapis.com, or analyticshub.googleapis.com
- The target resources are within the project “Data view project A”
This architecture has been tested with a focus on the security requirements, not the data architecture perspective. To build a solution for production, you will need to review the analytics hub and the other data components architecture and align themwith best practices.
This solution addresses a requirement which is specific to GCP organizations hosting sensitive data, and personal data. It is a complex architecture, because it is a complex requirement, which cannot be solved using Cloud IAM policies alone.
When using this solution in production, it is highly recommended to use automation to create the analysis environment for new use cases: the Cloud Workstation project, the Data Views project, and the ingress rules. Without automation, I cannot recommend using this solution, since the likelihood of an error is too high.
Currently (September 2024) VPC-SC has a feature in preview that allows user groups as parameters in ingress/egress rules, as opposed to individual users. Once this feature is available in production, it will highly simplify the management of the ingress/egress rules of this architecture, which now requires individual users to be added to the policies.