Breaking data silos with BigQuery Omni cross-cloud joins
BigQuery has been the silver arrow in Google Cloud’s data arsenal. Its one of the most sophisticated piece of technology, pushing the limits of compute, storage and network. In last one year GCP has been very aggressive in pushing out features that can help break the data silos and increase the adoption of BigQuery. Two notable products in this direction has been BigLake and BigQuery Omni.
Also BigQuery recently announced the BigQuery Omni Cross Cloud Joins and in this post, we’ll see how we can use BigLake and BigQuery omni to join data across BigQuery and S3 buckets without copying over data from S3 into GCP.
So let’s get started.
Table of Content
- Introduction to BigLake
- Introduction to BigQuery Omni
- Getting the dataset
- Creating BigLake table for data stored in S3
- Creating BigQuery Internal Table
- Joining two table and getting results
Introduction to BigLake
BigLake provides a unified storage engine and fine grained security across different formats (BigQuery internal, Open Table formats like Iceberg) and different cloud storage (AWS S3, Azure Blob Storage). BigLake also provides various connectors to read/write data into BigLake tables.
Introduction to BigQuery Omni
BigQuery Omni lets you use the same BigQuery interface to run analytics on data stored across multiple cloud storages. BigQuery achieves it by running BigQuery compute clusters in the AWS and Azure regions so that the data can be queried where it is stored.
Combining the power of both of these products, we can break the data silos and can combine data from multiple clouds.
Getting the dataset
The dataset that we’ll use in this post is the AirBnb dataset which is freely available to use. We’ll use the subset (Listings and Reviews) of data for Amsterdam region. We’ll load the reviews data in GCP BigQuery Internal table and will load the listings data in AWS S3 bucket.
The original data is in a gzipped CSV files and we’ll use this online utility to convert it to parquet. We are using parquet as it will reduce the file size. We can also use CSV directly as well.
Once we have data loaded and all set up, we’ll see how we can query and join the data across multiple clouds.
Creating BigLake table for data stored in S3 bucket
We’ll start by putting some parquet data in an S3 bucket. If you don’t have an AWS account, you can signup for a trial account and create an S3 bucket.
I have a bucket named kaysush-bigquery-store
in AWS region us-east-1
region. I will upload my listings.parquet
file from previous step at path listings/listing.parquet
in this bucket.
You’ll also need to create a policy and role that will allow BigQuery Omni to read data from S3 bucket. You can follow the documentation here to configure the policy and the role. My role definition looks like below.
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::kaysush-bigquery-store"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::kaysush-bigquery-store",
"arn:aws:s3:::kaysush-bigquery-store/*"
]
}
]
}
My role definition looks like this.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "accounts.google.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"accounts.google.com:aud": "00000"
}
}
}
]
}
Next let us configure the external connection.
bq mk --connection --connection_type='AWS' \
--iam_role_id=arn:aws:iam::AWS_ACCOUNT_ID:role/bigquery-omni-read-only \
--location=aws-us-east-1 \
bigquery-omni-s3-us-east-1
This will return an Identity value which we’ll have to add to our AWS role to ensure the connection is trusted. You can follow steps here to do it.
Once our connection is setup, we will create an external table for AirBnb listings.
Let us create the dataset first.
bq --location=aws-us-east-1 mk \
--dataset \
PROJECT_ID:bq_aws_dataset
Next we’ll create a External Table Definition based on which we’ll create the external table.
bq mkdef \
--source_format=PARQUET \
--connection_id=AWS_LOCATION.CONNECTION_NAME \
--metadata_cache_mode=AUTOMATIC \
S3_PATH > table_def
This will generate a file called table_def
with necessary information to create and external table.
{
"connectionId": "AWS_LOCATION.CONNECTION_NAME",
"metadataCacheMode": "AUTOMATIC",
"parquetOptions": {
"enableListInference": false,
"enumAsString": false
},
"sourceFormat": "PARQUET",
"sourceUris": [
"S3_PATH"
]
}
One thing to note however is that if you use this file, BQ will complain that it has a missing maxStaleness
parameter which is necessary since we are using AUTOMATIC
for metadataCacheMode.
bq mk --external_table_definition=table_def bq_aws_dataset.aws_listings
You get following error.
BigQuery error in mk operation: maxStaleness must be specified when MetadataCacheMode is AUTOMATIC
But if you try to add --max_staleness
as a flag in the call to bq mkdef
you’ll get the following error.
FATAL Flags parsing error: Unknown command line flag 'max_staleness'
Run 'bq.py help' to get help
I also tried updating the bq cli but no avail. So there seems to be some disconnect between the docs as I was able to create the external table from UI.
To proceed let us create the external table via UI.
Hit Create Table
and your table should be created. You can verify it by running a query on it.
So looks like we are able to query our S3 listing data via BigQuery Omni without copying over the data to GCP side.
Creating BigQuery Internal table
Now we’ll create a BigQuery internal table with AirBnB reviews data.
First I’ll create a separate dataset to demarcate that this is an internal table. I’ll keep the location of dataset same as that of S3 bucket and previously created dataset. In next section I’ll also try to see what happens if we try to join tables in different regions.
bq --location=aws-us-east-1 mk \
--dataset \
PROJECT_ID:bq_internal_dataset
So turns out when I tried to create an internal table by uploading Parquet reviews file I was getting an exception that Bigquery: Media upload not authorized
.
Apparently moving my bq_internal_dataset
to us-east4
worked. I also found a reference to some forum where this problem and its solution was mentioned.
As expected we are able to query the reviews.
Joining two table and getting results
Now for the moment of truth, we’ll try to join these two tables fetch following attributes
- Id of listing
- Name of listing
- Description of listing
- Number of reviews
The query is not very complicated but goes on to show the capabilities of cross cloud joins.
WITH
review_counts AS (
SELECT
listing_id,
COUNT(*) AS review_counts
FROM
`neural-land-324105.bq_internal_dataset.internal_reviews`
GROUP BY
listing_id )
SELECT
listing.id AS listing_id,
listing.name AS name,
listing. description AS description,
review.review_counts
FROM
`neural-land-324105.bq_aws_dataset.aws_listings` AS listing
INNER JOIN
review_counts AS review
ON
listing.id = review.listing_id
However we hit another roadblock, since both these tables are in different regions we got following error.
Not found: Dataset neural-land-324105:bq_aws_dataset was not found in location us-east4
Although the BigQuery Omni document states that aws-us-east-1
and us-east4
are colocated yet we are unable to join tables from these two regions.
I tried deleting and recreating bq_aws_dataset
in us-east4
region but when I tried creating external table I got following error.
The selected source requires a data set in the aws-ap-northeast-2, aws-eu-west-1, aws-us-east-1 or aws-us-west-2 location. Please choose a valid data set.
Note : Turns out the cross-cloud joins were in Preview and I had to request access for my project by filling in a Google Form. Once I got it enabled for my project I could run this query without error.
Conclusion
As more and more companies are taking up multi-cloud approach, to get most value out of the data we might have to move all of the data to once cloud provider where our analytic stack is setup. With BigQuery Omni Cross-Cloud joins we no longer need to move the data as we can query the data where it resides. This work for both AWS and Azure (refer for supported regions).
This is a great feature for teams with multi-cloud setup or teams those who prefer to have their service infra in one provider (say AWS or Azure) but data and analytics in GCP.