How to Migrate Your BigQuery Dataset to a New Location / Project /Org

You need to follow a process similar to a migration.

Samet Karadag
Google Cloud - Community
6 min readJun 11, 2023

--

If you just want to change region within the same dataset the easy path is: cross regional dataset replication

Path for migrating/copying a dataset to a different project/org: BQ Data Transfer Service Dataset Copy feature to tackle the hardest part which is moving tables and the data in an automated and fastest way.

Easy path — Use BigQuery cross regional replica

1- Setup cross regional replica for your dataset

2- Promote replica to be the primary

3- Drop previous region

You can do this using SQL:

ALTER SCHEMA test
ADD REPLICA `us` OPTIONS(location='us')
# Assign the cross-region secondary replica as the primary.
ALTER SCHEMA test SET OPTIONS(primary_replica = 'us')
ALTER SCHEMA test DROP REPLICA IF EXISTS `us-central1`;

Copy/migrate to another GCP project / Org

To change your dataset region you need to use BigQuery Data Transfer Service -> Dataset Copy. It is a similar swap operation which you do to change table structures but not with SQL but using BQDTS with a temp-dataset.

One important thing that you need to consider is BQDTS only copies tables, thus you need to create views, procedures and external tables apart from DTS transferred tables.

The process is as follows;

1- Create a temp dataset in the desired location

2- Transfer dataset to the temp dataset via BQ Data Transfer Service

3- Backup Views, External Tables, Stored Procedures, Materialized Views in SQL

4- Check the consistency — # of tables and rows

5- Delete the old dataset

6- Create the dataset in the new location with the same name

7- Transfer temp dataset to the new dataset with the exact name in the new location.

8- Create Views, External Tables, Stored Procedures, Materialized Views in the new dataset from backup

9- Check Consistency again

10- Drop the temp dataset

First things first; stop all the write operations to your dataset to ensure data consistency. If you are not sure what are the writes to your dataset you can check audit logs and get the details about those operations.

Next, transfer your dataset to a temp dataset in the desired location. You can create this temp dataset in advance or through BQ Data Transfers page.

Reach the Data Transfers via left hand menu;

Here you can create a new dataset within the desired location;

Since the goal is just to change the location, this can be scheduled on-demand and run one time only;

Fill your source dataset and project details and the configurations is done.

Next run transfer with the button on top right.

As of writing this doc, DTS does not copy views, procedures, external tables and materialized views. Create these in the temp dataset, if there is any.

After the transfer finished successfully;

  • Compare the row counts to be sure that everything is copied successfully
  • Check the views, procedures and other objects
  • Delete the old dataset to create the dataset with the exact name in the desired location
  • Transfer temp dataset to the new dataset with exact name and new location
  • Create views and procedures if there is any
  • Check consistency again
  • Drop the temp dataset

If you want to this via script, here is how it looks like (this script is just to give some idea, it is not recommended to use this script as-is);

Some lines (drop datasets) intentionally commented out, it is better if you execute the flow manually in control and issue drop datasets commands when you are sure that is ok.

#!/bin/bash

# Set variables
PROJECT_ID="project_id"
OLD_DATASET="source_dataset_name"
NEW_LOCATION="us-central1"
NEW_DATASET="if_rename_new_name_or keep_the_same_name_if region change"
TRANSFER_JOB_NAME="transfer_job"

# Step 1: Create a temporary dataset in the desired location
bq --project_id="$PROJECT_ID" mk --location="$NEW_LOCATION" --dataset "temp_$OLD_DATASET"

# Ensure that BQ transfer service api is enabled
# Step 2: Transfer dataset to the temp dataset via BQ Data Transfer Service
transfer_output=$(bq mk --transfer_config --project_id="$PROJECT_ID" --data_source=cross_region_copy --target_dataset="temp_$OLD_DATASET" --display_name="Transfer $OLD_DATASET" --params='{"source_dataset_id":"'"$OLD_DATASET"'","source_project_id":"'"$PROJECT_ID"'","overwrite_destination_table":"true"}' --service_account_name=bq-admin@$PROJECT_ID.iam.gserviceaccount.com)

echo $transfer_output

transfer_config_id=$(echo "$transfer_output" | grep -oE 'projects/[^ ]+')

transfer_config_id_cleaned=$(echo "$transfer_config_id" | sed "s/'$//")

# Step 3: Wait for the transfer job to complete
while true; do
JOB_STATUS=$(bq ls --transfer_run --run_attempt='LATEST' $transfer_config_id_cleaned | grep SUCCEEDED)
if [[ "$JOB_STATUS" == *"SUCCEEDED"* ]] ; then
break
else
sleep 10
fi
done

# Step 4: Check the consistency
SOURCE_TABLES=$(bq --project_id="$PROJECT_ID" ls --max_results=1000 "$OLD_DATASET" | awk '{print $1}' | tail -n +3)
TARGET_TABLES=$(bq --project_id="$PROJECT_ID" ls --max_results=1000 "temp_$OLD_DATASET" | awk '{print $1}' | tail -n +3)





#CAUTION below statement assumes view names start with "v_" adapt per your needs
# Step 5: Copy views from source dataset to target dataset
for VIEW in $(bq --project_id="$PROJECT_ID" ls --max_results=1000 --format=json "$OLD_DATASET" | grep '"type":"VIEW"' | jq -r '.[].tableReference.tableId' ); do
VIEW_QUERY=$(bq --project_id="$PROJECT_ID" show --format=prettyjson "$OLD_DATASET.$VIEW" | jq -r '.view.query')
VIEW_QUERY_cleaned=$(echo "$VIEW_QUERY" | sed "s/$OLD_DATASET./temp_$OLD_DATASET./g")
echo $VIEW-$VIEW_QUERY_cleaned
bq --project_id="$PROJECT_ID" mk --use_legacy_sql=false --view "$VIEW_QUERY_cleaned" "temp_$OLD_DATASET.$VIEW"
done

# Check if the number of tables is the same
if [[ $(echo "$SOURCE_TABLES" | wc -l) -eq $(echo "$TARGET_TABLES" | wc -l) ]]; then
echo "Number of tables is consistent between source and target datasets."
else
echo "Number of tables is NOT consistent between source and target datasets."
fi

# Check row counts/sizes from the information schema
for TABLE in $SOURCE_TABLES; do
SOURCE_ROWS=$(bq --project_id="$PROJECT_ID" query --nouse_legacy_sql "SELECT COUNT(*) FROM \`$PROJECT_ID.$OLD_DATASET.$TABLE\`" | tail -n +3)
TARGET_ROWS=$(bq --project_id="$PROJECT_ID" query --nouse_legacy_sql "SELECT COUNT(*) FROM \`$PROJECT_ID.temp_$OLD_DATASET.$TABLE\`" | tail -n +3)

if [[ "$SOURCE_ROWS" == "$TARGET_ROWS" ]]; then
echo "Row count is consistent for table $TABLE."
else
echo "Row count is NOT consistent for table $TABLE."
fi
done

# Step 6: Copy procedures from source dataset to target dataset
for PROCEDURE in $(bq --project_id="$PROJECT_ID" ls --max_results=1000 --format=json "$OLD_DATASET" | jq -r '.[].routineReference.routineId' | grep "^p_"); do
PROCEDURE_QUERY=$(bq --project_id="$PROJECT_ID" show --format=prettyjson "$OLD_DATASET.$PROCEDURE" | jq -r '.routineType + " " + .definitionBody')
PROCEDURE_QUERY_cleaned=$(echo "$PROCEDURE_QUERY" | sed "s/$OLD_DATASET./temp_$OLD_DATASET./g")
bq query --project_id="$PROJECT_ID" --nouse_legacy_sql --replace -q "$PROCEDURE_QUERY_cleaned" --destination_table "$NEW_DATASET.$PROCEDURE"
done

# Step 7: Delete the old dataset
bq --project_id="$PROJECT_ID" rm -r -f "$OLD_DATASET"

# Step 8: Create the dataset in the new location with the same name
bq --project_id="$PROJECT_ID" mk --location="$NEW_LOCATION" --dataset "$NEW_DATASET"

# Step 9: Transfer temp dataset to the new dataset with the exact name in the new location

transfer_output=$(bq mk --transfer_config --project_id="$PROJECT_ID" --data_source=cross_region_copy --target_dataset="$NEW_DATASET" --display_name="Transfer $NEW_DATASET" --params='{"source_dataset_id":"'"temp_$OLD_DATASET"'","source_project_id":"'"$PROJECT_ID"'","overwrite_destination_table":"true"}' --service_account_name=bq-admin@$PROJECT_ID.iam.gserviceaccount.com)

transfer_config_id=$(echo "$transfer_output" | grep -oE 'projects/[^ ]+')

transfer_config_id_cleaned=$(echo "$transfer_config_id" | sed "s/'$//")

echo $transfer_output

# Step 10: Wait for the transfer job to complete

while true; do
JOB_STATUS=$(bq ls --transfer_run --run_attempt='LATEST' $transfer_config_id_cleaned | grep SUCCEEDED)
if [[ "$JOB_STATUS" == *"SUCCEEDED"* ]] ; then
break
else
sleep 10
fi
done

# Step 11: Copy views from source dataset to target dataset


for VIEW in $(bq --project_id="$PROJECT_ID" ls --max_results=1000 --format=json "temp_$OLD_DATASET" | grep '"type":"VIEW"' | jq -r '.[].tableReference.tableId' ); do
VIEW_QUERY=$(bq --project_id="$PROJECT_ID" show --format=prettyjson "temp_$OLD_DATASET.$VIEW" | jq -r '.view.query')
VIEW_QUERY_cleaned=$(echo "$VIEW_QUERY" | sed "s/temp_$OLD_DATASET./$OLD_DATASET./g")
echo $VIEW-$VIEW_QUERY_cleaned
bq --project_id="$PROJECT_ID" mk --use_legacy_sql=false --view "$VIEW_QUERY_cleaned" "$OLD_DATASET.$VIEW"
done

# Step 12: Copy procedures from source dataset to target dataset
for PROCEDURE in $(bq --project_id="$PROJECT_ID" ls --max_results=1000 --format=json "temp_$OLD_DATASET" | jq -r '.[].routineReference.routineId' | grep "^p_"); do
PROCEDURE_QUERY=$(bq --project_id="$PROJECT_ID" show --format=prettyjson "temp_$OLD_DATASET.$PROCEDURE" | jq -r '.routineType + " " + .definitionBody')
PROCEDURE_QUERY_cleaned=$(echo "$VIEW_QUERY" | sed "s/temp_$OLD_DATASET./$OLD_DATASET./g")
bq query --project_id="$PROJECT_ID" --nouse_legacy_sql --replace -q "$PROCEDURE_QUERY_cleaned" --destination_table "$NEW_DATASET.$PROCEDURE"
done

# Check if the number of tables is the same
if [[ $(echo "$SOURCE_TABLES" | wc -l) -eq $(echo "$TARGET_TABLES" | wc -l) ]]; then
echo "Number of tables is consistent between source and target datasets."
else
echo "Number of tables is NOT consistent between source and target datasets."
fi

# Check row counts/sizes from the information schema
for TABLE in $SOURCE_TABLES; do
SOURCE_ROWS=$(bq --project_id="$PROJECT_ID" query --nouse_legacy_sql "SELECT COUNT(*) FROM \`$PROJECT_ID.$OLD_DATASET.$TABLE\`" | tail -n +3)
TARGET_ROWS=$(bq --project_id="$PROJECT_ID" query --nouse_legacy_sql "SELECT COUNT(*) FROM \`$PROJECT_ID.temp_$OLD_DATASET.$TABLE\`" | tail -n +3)

if [[ "$SOURCE_ROWS" == "$TARGET_ROWS" ]]; then
echo "Row count is consistent for table $TABLE."
else
echo "Row count is NOT consistent for table $TABLE."
fi
done

# Step 13: Drop the temp dataset
bq --project_id="$PROJECT_ID" rm -r -f "temp_$OLD_DATASET"

I hope you find this informative, please let me know your thoughts in the comments.

--

--