Art of Possible: Federated Queries in Snowflake with Google Big Query

In the era of big data, enterprises are often flooded with vast volumes of data from multiple sources, creating the silos. These sources might be spread across different databases, file systems, and even different geographical locations. The challenge of efficiently querying this data without moving it around has led to the need for federated queries. A federated query allows for the retrieval of data from multiple heterogeneous data sources. Due to regulatory compliance like GDPR, a federated query lets you keep data in its original location, and still query it seamlessly. This allows for real-time or near-real-time data access without the need for ETL processes that move the data. While it helps in compliance it has its own performance challenge, complexity, and cost.

Snowflake breaks those silos and allows you to query data in those systems using the Snowpark feature called External Network Location (Private Preview). So if you have a data set in the Google ecosystem, such as BigQuery, Sheets, etc., you can able to query from those and join it with the data that exists in Snowflake. The following architecture diagram show when a user query in Snowflake, some part of the query run in Google BQ using an external network location using SSL (encrypted). Lets deep dive into this:

Federated Query in Snowflake

First, create security integration in Snowflake using Google OAUTH, you need to create credentials in GCP to get the OAUTH Client id and secret.

USE ROLE ACCOUNTADMIN;
USE DEMODB.EXTFUNC;
USE WAREHOUSE DEMO_WH;

-- using Google oauth
-- need google client id, and client secret from Google API/Credential
-- download the json file that will be use in step 3
-- please update OAUTH_CLIENT_ID AND OAUTH_CLIENT_SECRET

CREATE OR REPLACE SECURITY INTEGRATION google_bq_oauth
TYPE = API_AUTHENTICATION
AUTH_TYPE = OAUTH2
OAUTH_CLIENT_ID = 'YOUR CLIENT ID.apps.googleusercontent.com'
OAUTH_CLIENT_SECRET = 'YOUR CLIENT SECRET'
OAUTH_TOKEN_ENDPOINT = 'https://oauth2.googleapis.com/token'
OAUTH_AUTHORIZATION_ENDPOINT = 'https://accounts.google.com/o/oauth2/auth'
OAUTH_ALLOWED_SCOPES = ('https://www.googleapis.com/auth/cloud-platform')
ENABLED = TRUE;

GRANT USAGE ON INTEGRATION google_Bq_oauth TO ROLE sysadmin;

2. Create networking rules that allow Snowflake to access certain domains.

USE ROLE  sysadmin;
USE DEMODB.EXTFUNC;
USE WAREHOUSE DEMO_WH;

-- crete networking rules so only those sites are allowed
CREATE OR REPLACE NETWORK RULE google_access_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('drive.google.com',
'sheets.googleapis.com',
'bigquery.googleapis.com');

3. Configure OAUTH token and API key, to create external access integration that can be used in UDF.

-- https://developers.google.com/sheets/api/quickstart/python
-- To get refresh token , run local python program to get it
-- use scope https://www.googleapis.com/auth/bigquery


CREATE OR REPLACE SECRET google_bq_oauth_token
TYPE = oauth2
API_AUTHENTICATION = google_bq_oauth
OAUTH_REFRESH_TOKEN = 'YOUR REFRESH TOKEN';

CREATE OR REPLACE SECRET google_bq_api_key
TYPE = GENERIC_STRING
SECRET_STRING = 'YOU API KEY';

-- create exernal access integration with networking rule created above
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION google_bq_access_int
ALLOWED_NETWORK_RULES = (google_access_rule)
ALLOWED_AUTHENTICATION_SECRETS =(google_bq_oauth_token,google_bq_api_key)
ENABLED = true;

Finally, create a Python UDF using Snowpark to call BQ REST API to run the query in Google BQ and return the resultset in JSON.

CREATE OR REPLACE FUNCTION run_bq_query(projid string, sql string)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'get_data'
EXTERNAL_ACCESS_INTEGRATIONS = (google_bq_access_int)
PACKAGES = ('snowflake-snowpark-python','requests')
SECRETS = ('cred' = google_bq_oauth_token, 'apikey' = google_bq_api_key )
AS
$$
import _snowflake
import requests
import json
session = requests.Session()
baseurl='https://bigquery.googleapis.com/bigquery/v2/projects/'
key = _snowflake.get_generic_secret_string('apikey')
def get_data(projid, sql):
token = _snowflake.get_oauth_access_token('cred')
url=baseurl+projid+'/queries/'
data = {'query': sql}
response = session.post(url ,json = data, headers =
{"Authorization": "Bearer " + token})
return response.json()['rows']
# error handling required in case query fails
$$;

# arguments are your BQ project id string and SQL Query

That’s all you need to do for configuration and code! Now, you can run any query in your BQ project from Snowflake. In the following example, I have a table name “dept ” in the dataset “test”, in project “optical-mode-334123” in BQ.

select run_bq_query('optical-mode-xxxxxx',
'select * from test.dept where location_id=1700');
-- You will see the result in json

--use below to format the output from json to tabular
-- test before you use, I have three column in dept table
select
value:f[0].v::number as deptno,
value:f[1].v::varchar as dname,
value:f[2].v::number as locid
from
table(flatten(input =>parse_json(
run_bq_query('optical-mode-xxxxxx',
'select * from test.dept where location_id=1700')) ));
Output of BQ Query

Furthermore, you can run the federated query in Google Big Query from within Snowflake, and bring the output to join, with other data in Snowflake using Snowflake Warehouse on the fly. You have to be mindful of the amount of data you are getting out of these queries as it incurs the egress cost from GCP especially frequent queries and/or large datasets. Secondly, complex or long-running queries can be expensive too, as they run some computing in both places (BQ and Snowflake).

-- Snowflake federated query
select e.first_name,e.last_name, e.job_id,d.dname,l.city,l.state_province
from datarepo.oradata.emp e ,
(
select
value:f[0].v::number as deptno,
value:f[1].v::varchar as dname,
value:f[2].v::number as locid
from
table(flatten(input =>parse_json(
run_bq_query('optical-mode-xxxxxx',
'select * from test.dept where location_id=1700')) ))) d,
datarepo.oradata.locations l
where e.department_id=d.deptno
and d.locid=l.location_id;

-- another example, where join occurs both in BQ and Snowflake
-- emp, departemnts and regions in snowflake, locations and coutnries in BQ
SELECT e.first_name,e.last_name, e.job_id,d.department_name,bq.city,
bq.state, bq.country, r.region_name
FROM datarepo.oradata.emp e , datarepo.oradata.departments d,
( /* federated query that run in BQ*/
SELECT
value:f[0].v::number as locid,
value:f[1].v::varchar as city,
value:f[2].v::varchar as state,
value:f[3].v::varchar as country,
value:f[4].v::number as regid
FROM
table(flatten(input =>parse_json(
run_bq_query('optical-mode-xxxxxx',
'SELECT location_id,
ifnull(city,'''') city,
ifnull(STATE_PROVINCE,''''),
ifnull(COUNTRY_NAME,'''') ,
region_id
FROM test.locations as l
JOIN test.countries as c
ON l.COUNTRY_ID = c.COUNTRY_ID;')) ))) bq,
datarepo.oradata.regions r
WHERE e.department_id=d.department_id
AND bq.locid=d.location_id
AND bq.regid=r.region_id;
Example of Federated Query running in Google BQ

Similarly, you can also access Google Sheets using this method, see below. You need to create security integration, secret, and network integration as described above.

CREATE OR REPLACE FUNCTION get_google_sheet(sheetid string, range string)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'get_data'
EXTERNAL_ACCESS_INTEGRATIONS = (google_sheet_access_int)
PACKAGES = ('snowflake-snowpark-python','requests')
SECRETS = ('cred'=google_sheet_oauth_token, 'apikey'=google_sheet_api_key )
AS
$$
import _snowflake
import requests
import json
session = requests.Session()
baseurl='https://sheets.googleapis.com/v4/spreadsheets/'
def get_data(sheetid,range):
token = _snowflake.get_oauth_access_token('cred')
url=baseurl+sheetid+'/values/'+range
response = session.get(url, headers =
{"Authorization": "Bearer " + token})
return response.json()['values']
$$;

select e.first_name,e.last_name, e.job_id,d.dname,l.city,l.state_province
from datarepo.oradata.emp e ,
(
select
value[0]::number as deptno,
value[1]::varchar as dname,
value[2]::number as locid
from
table(flatten(input =>parse_json(
get_google_sheet('1Ne5mlHUvcaa44D8-nLKmn-ETGYBW8EwoRSL13smSnWA',
'dept!a2:c28')) ))) d,
datarepo.oradata.locations l
where e.department_id=d.deptno
and d.locid=l.location_id;

Conclusion:

Although Federated queries can be costly, and complex, in some use cases it may required, and you must assess if they make sense. You can run federated queries in Snowflake with the help of features in Snowpark — External Network Location, which allows different database systems can be connected and queried at the same time, provided those systems allow to connect using OAUTH. This provides users with the capability to do complex analytics, as well as query data sets, that exist other than Snowflake. In addition, federated queries also can be used to provide near-real-time analytics. Each source system is responsible for securely storing and accessing its data in accordance with its own security regulations. Finally, federated queries provide enhanced security for data retrievals. Snowflake can help you unlock the value of those desperate data sets.

Good luck!

Disclaimer: The opinions expressed in this post are my own and not necessarily those of my employer (Snowflake).

Disclosure: 50% of this article was created using ChatGPT4 and Azure Open AI.

--

--