Ensuring data localization compliance on data movement between BigQuery regions

Oscar Pulido
6 min readAug 8, 2023

Global organizations need to analyze data coming from the different jurisdictions they operate in to generate important insights for their decision-making processes. On the other hand, data protection laws like the EU’s GDPR (General Data Protection Regulation), or India’s PDPB (Personal Data Protection Bill) include data sovereignty regulations that need to be carefully considered.

While many regulations have data residency requirements that restrict moving PII data out of a given jurisdiction, this generally not apply to anonymized or aggregated data, allowing it to be globally analyzed.

This post will introduce some opinionated best practices to facilitate cross-border transfers in a privacy-safe and compliant way.

For this approach we will leave aside important infrastructure and networking considerations to focus on the data movement process including resource hierarchy, the required data access controls, and PII detection to mitigate regulation breaching risks.

Projects and Datasets

A robust hierarchy using multiple folders and projects to organize the data would facilitate organizational policies enforcement, operations (billing, logging, monitoring), as well as simplify the implementation of data access controls and inspection for PII data. In the next example we are only considering environment, region and data products but a production-ready deployment should consider hierarchies like Data Domains or Data Marts, based on the defined analytics strategy i.e. (DWH, DataMesh or even a Lakehouse).

Projects and Datasets for cross-region data sharing
  1. Source Dataset: One dataset per jurisdiction, Source Dataset will depend on the data warehouse model defined, and it refers to denormalized data, a star model with fact and dimension tables, a data mart or any other analytics schema. You can also think of Source Dataset as the existing Dataset with the source data to be globally analyzed.
  2. Isolated Dataset: One dataset per jurisdiction, holds locally aggregated data, could also be PII free data or anonymized data. Prefer new tables over Views/Authorized views to show clear evidence that no raw data is leaving perimeters, and to be able to use VPC Service Controls.
  3. Destination Dataset: One global dataset in the selected location in which analytical users will run global analytics, it is the destination dataset for the data coming from all the Isolated Datasets in the different regions.

Source Dataset, Isolated Dataset, and Destination Dataset should each reside in different projects, providing the following advantages:

  • Additional IAM layer for security.
  • Can use VPC Service Controls to prevent the local analytic dataset from being API accessible from outside the region.
  • Prevent global analytical Users/Service Accounts from reading non sharable data.

Access Control Strategy

The end to end process of preparing global aggregated data should be achieved with 2 types of service accounts:

  1. Source Service Account: One service account per jurisdiction that reads local analytical data and prepares local aggregate data.
  2. Sharing Service Account: One common service account that can read the minimum necessary local aggregated data in all jurisdictions.
Service Accounts & required accesses

As the Source Dataset is part of the existing data warehouse model, existing users/processes will keep reading/writing to it. No other service accounts or user accounts should be given permissions to read or write data in the Source Isolated Dataset or in the Destination Dataset. Global analytics users should impersonate service accounts to prevent the end user from joining data with other local analytics data.

It is important to prevent the Sharing Service Account from reading non-aggregated/non-anonymized data (Source Dataset) in the local jurisdiction.

Cross-region Data Movement

Data is moved in two steps. Data ready to be moved should sit in a dedicated/isolated BigQuery dataset or in a dedicated/isolated GCS bucket. So as a first step the data should be moved from the source BigQuery dataset to a dedicated/isolated Cloud Storage bucket or BigQuery dataset. The second step is the actual cross region data movement.

Isolating data in the source region

Data should arrive de-identified or aggregated to the isolated dataset. It can be already de-identified in the Source Dataset or could be de-identified as it is moved to the Isolated Dataset.

Assuming your source data is already de-identified in BigQuery Source Dataset, you can move it to an isolated BigQuery dataset or to an isolated GCS bucket before taking it out of the region. Here some options to do that:

  1. bq export is the simplest way to extract data from BigQuery tables to Cloud Storage.
  2. You can also use Composer/Airflow and the BigQueryToCloudStorageOperator Airflow operator to move from BigQuery to Cloud Storage.
  3. Table snapshots or table clones could be also used to move data between source dataset and isolated dataset within BigQuery without physically duplicating data and reducing cost.

If data will be de-identified as it is moved to the Isolated Dataset, then an ETL approach (Dataflow/Dataproc) will be necessary to move data to the Isolated Dataset.

Landing the data in GCS is preferred in order to leverage the Storage Transfer Service or GCS rsync for the following step.

Moving Data between regions

As of now (Nov 2023) there is no GA option to read data from BigQuery in one region and write it back to BigQuery in a different region directly. GCS will need to be used for staging in at least one region. We can move the data using no-code options like Storage Transfer Service (STS) or rsync or an ETL approach (Dataflow/Dataproc):

  1. Use gsutil rsync for data sizes under 1TB, and use the default CMEK key set on the GCS source and destination buckets.
  2. Use STS for data sizes above 1TB, it scales to larger data sizes, and supports transferring data to and from CMEK protected buckets.
  3. Both Dataproc and Dataflow ELT options support setting a temporal bucket when moving data between different BigQuery regions.
GCS temporary bucket location to move data cross regions, applies for ELT options only. gsutil rsync and STS will always be from and to GCS.

There are two in preview (Nov 2023) offerings that allows you moving data between different regions directly in BQ:

  1. BigQuery cross-region dataset copy allows you copying an entire datasets across regions without an ETL or moving data out of BigQuery, however there are several limitations because you can not move views, UDFs, External tables or CMEK encrypted tables, also appending data in the destination dataset is not supported and and the minimum frequency between copy jobs is 12 hours.
  2. If the use case implies keeping a read-only replica in the destination region then Cross-region dataset replication could be considered as it is a simple setup configuration, however it is important to look at the limitations as this option is intended for for additional geo-redundancy, not cross region data sharing specifically.

In this two cases the presented organization hierarchy and access control strategy should be also considered to guaranty jurisdiction data isolation and secure data movement.

Mitigating cross-region PII data exfiltration risk

The DLP (Data Loss Prevention) service is a fundamental tool in ensuring PII data is not transferred outside the local jurisdiction. Inspection jobs and inspection templates can be used to publish tags in Data Catalog at a table level, or at a column level using a Dataflow job.

DLP inspection jobs inspecting sampled data in the Isolated dataset could run in batch to improve cost effectiveness.

To avoid delaying data movement, inspection jobs can run in parallel to the cross-region data replication jobs, ensuring you identify the PII data at least at the same time it is being moved, so you can stop a data movement or delete already transferred data based on alerts.

Once DLP identifies sensitive data, a policy tag can be automatically created to further restrict access depending on the content.

If orchestrated using Composer a Data Catalog Airflow operator could obtain entry details, including tags and values to be used as a control or validation step in the transferring pipeline.

PII validation step on Orchestrated data movement

Conclusion

At the time of writing this article, STS, gsutil rsync and bq export are no-code solutions to isolate and replicate data across regions that, in conjunction with the presented access control strategy, hierarchy design and DLP-based PII data identification process, provides jurisdiction data isolation to ensure data is moved in a secure way.

Thanks to Himal Dwarakanath, Daryus Medora and Julianne Cuneo who collaborated on this story.

--

--