Breaking data silos with BigQuery Omni cross-cloud joins

Sushil Kumar
Google Cloud - Community
7 min readOct 18, 2023

BigQuery has been the silver arrow in Google Cloud’s data arsenal. Its one of the most sophisticated piece of technology, pushing the limits of compute, storage and network. In last one year GCP has been very aggressive in pushing out features that can help break the data silos and increase the adoption of BigQuery. Two notable products in this direction has been BigLake and BigQuery Omni.

Also BigQuery recently announced the BigQuery Omni Cross Cloud Joins and in this post, we’ll see how we can use BigLake and BigQuery omni to join data across BigQuery and S3 buckets without copying over data from S3 into GCP.

So let’s get started.

Table of Content

  1. Introduction to BigLake
  2. Introduction to BigQuery Omni
  3. Getting the dataset
  4. Creating BigLake table for data stored in S3
  5. Creating BigQuery Internal Table
  6. Joining two table and getting results
Source : https://datixinc.com/blog/dont-let-data-silos-ruin-your-business/

Introduction to BigLake

BigLake provides a unified storage engine and fine grained security across different formats (BigQuery internal, Open Table formats like Iceberg) and different cloud storage (AWS S3, Azure Blob Storage). BigLake also provides various connectors to read/write data into BigLake tables.

Source : https://cloud.google.com/bigquery/docs/biglake-intro

Introduction to BigQuery Omni

BigQuery Omni lets you use the same BigQuery interface to run analytics on data stored across multiple cloud storages. BigQuery achieves it by running BigQuery compute clusters in the AWS and Azure regions so that the data can be queried where it is stored.

Source : https://cloud.google.com/bigquery/docs/omni-introduction

Combining the power of both of these products, we can break the data silos and can combine data from multiple clouds.

Getting the dataset

The dataset that we’ll use in this post is the AirBnb dataset which is freely available to use. We’ll use the subset (Listings and Reviews) of data for Amsterdam region. We’ll load the reviews data in GCP BigQuery Internal table and will load the listings data in AWS S3 bucket.

The original data is in a gzipped CSV files and we’ll use this online utility to convert it to parquet. We are using parquet as it will reduce the file size. We can also use CSV directly as well.

Once we have data loaded and all set up, we’ll see how we can query and join the data across multiple clouds.

Creating BigLake table for data stored in S3 bucket

We’ll start by putting some parquet data in an S3 bucket. If you don’t have an AWS account, you can signup for a trial account and create an S3 bucket.

I have a bucket named kaysush-bigquery-store in AWS region us-east-1 region. I will upload my listings.parquet file from previous step at path listings/listing.parquet in this bucket.

You’ll also need to create a policy and role that will allow BigQuery Omni to read data from S3 bucket. You can follow the documentation here to configure the policy and the role. My role definition looks like below.


"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::kaysush-bigquery-store"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::kaysush-bigquery-store",
"arn:aws:s3:::kaysush-bigquery-store/*"
]
}
]
}

My role definition looks like this.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "accounts.google.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"accounts.google.com:aud": "00000"
}
}
}
]
}

Next let us configure the external connection.

bq mk --connection --connection_type='AWS' \
--iam_role_id=arn:aws:iam::AWS_ACCOUNT_ID:role/bigquery-omni-read-only \
--location=aws-us-east-1 \
bigquery-omni-s3-us-east-1

This will return an Identity value which we’ll have to add to our AWS role to ensure the connection is trusted. You can follow steps here to do it.

Once our connection is setup, we will create an external table for AirBnb listings.

Let us create the dataset first.

bq --location=aws-us-east-1 mk \
--dataset \
PROJECT_ID:bq_aws_dataset

Next we’ll create a External Table Definition based on which we’ll create the external table.

bq mkdef  \
--source_format=PARQUET \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
--metadata_cache_mode=AUTOMATIC \
S3_PATH > table_def

This will generate a file called table_def with necessary information to create and external table.

{
"connectionId": "AWS_LOCATION.CONNECTION_NAME",
"metadataCacheMode": "AUTOMATIC",
"parquetOptions": {
"enableListInference": false,
"enumAsString": false
},
"sourceFormat": "PARQUET",
"sourceUris": [
"S3_PATH"
]
}

One thing to note however is that if you use this file, BQ will complain that it has a missing maxStaleness parameter which is necessary since we are using AUTOMATIC for metadataCacheMode.

bq mk --external_table_definition=table_def bq_aws_dataset.aws_listings

You get following error.

BigQuery error in mk operation: maxStaleness must be specified when MetadataCacheMode is AUTOMATIC

But if you try to add --max_staleness as a flag in the call to bq mkdef you’ll get the following error.

FATAL Flags parsing error: Unknown command line flag 'max_staleness'
Run 'bq.py help' to get help

I also tried updating the bq cli but no avail. So there seems to be some disconnect between the docs as I was able to create the external table from UI.

To proceed let us create the external table via UI.

Creating External table of AWS S3 table reusing the connection created in previous step

Hit Create Table and your table should be created. You can verify it by running a query on it.

Query over S3 data

So looks like we are able to query our S3 listing data via BigQuery Omni without copying over the data to GCP side.

Creating BigQuery Internal table

Now we’ll create a BigQuery internal table with AirBnB reviews data.

First I’ll create a separate dataset to demarcate that this is an internal table. I’ll keep the location of dataset same as that of S3 bucket and previously created dataset. In next section I’ll also try to see what happens if we try to join tables in different regions.

bq --location=aws-us-east-1 mk \
--dataset \
PROJECT_ID:bq_internal_dataset

So turns out when I tried to create an internal table by uploading Parquet reviews file I was getting an exception that Bigquery: Media upload not authorized .

Apparently moving my bq_internal_dataset to us-east4 worked. I also found a reference to some forum where this problem and its solution was mentioned.

As expected we are able to query the reviews.

Querying the BQ internal review table.

Joining two table and getting results

Now for the moment of truth, we’ll try to join these two tables fetch following attributes

  1. Id of listing
  2. Name of listing
  3. Description of listing
  4. Number of reviews

The query is not very complicated but goes on to show the capabilities of cross cloud joins.

WITH
review_counts AS (
SELECT
listing_id,
COUNT(*) AS review_counts
FROM
`neural-land-324105.bq_internal_dataset.internal_reviews`
GROUP BY
listing_id )
SELECT
listing.id AS listing_id,
listing.name AS name,
listing. description AS description,
review.review_counts
FROM
`neural-land-324105.bq_aws_dataset.aws_listings` AS listing
INNER JOIN
review_counts AS review
ON
listing.id = review.listing_id

However we hit another roadblock, since both these tables are in different regions we got following error.

Not found: Dataset neural-land-324105:bq_aws_dataset was not found in location us-east4

Although the BigQuery Omni document states that aws-us-east-1 and us-east4 are colocated yet we are unable to join tables from these two regions.

I tried deleting and recreating bq_aws_dataset in us-east4 region but when I tried creating external table I got following error.

The selected source requires a data set in the aws-ap-northeast-2, aws-eu-west-1, aws-us-east-1 or aws-us-west-2 location. Please choose a valid data set.

Note : Turns out the cross-cloud joins were in Preview and I had to request access for my project by filling in a Google Form. Once I got it enabled for my project I could run this query without error.

Joining Listing table in AWS to Reviews table in GCP

Conclusion

As more and more companies are taking up multi-cloud approach, to get most value out of the data we might have to move all of the data to once cloud provider where our analytic stack is setup. With BigQuery Omni Cross-Cloud joins we no longer need to move the data as we can query the data where it resides. This work for both AWS and Azure (refer for supported regions).

This is a great feature for teams with multi-cloud setup or teams those who prefer to have their service infra in one provider (say AWS or Azure) but data and analytics in GCP.

--

--

Sushil Kumar
Google Cloud - Community

A polyglot developer with a knack for Distributed systems, Cloud and automation.