Tutorial: How to run SQL queries with Presto on Google BigQuery
This tutorial was originally authored by Rohan Pednekar.
Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial is about how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.
Pretos’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.
Step 1: Setup a Presto cluster with Kubernetes
Set up your own Presto cluster on Kubernetes using these instructions.
Step 2: Setup a Google BigQuery Project with Google Cloud Platform
Create a Google BigQuery project from Google Cloud Console and make sure it’s up and running with dataset and tables as described here.
Below screen shows Google BigQuery project with table “Flights”
Step 3 Set up a key and download Google BigQuery credential JSON file.
To authenticate the BigQuery connector to access the BigQuery tables, create a credential key and download it in JSON format.
Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described here.
Sample credential file should look like this:
{
"type": "service_account",
"project_id": "poised-journey-315406",
"private_key_id": "5e66dd1787bb1werwerd5ddf9a75908b7dfaf84c",
"private_key": " - - -BEGIN PRIVATE KEY - - -\nMIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwgKozSEK84b\ntNDXrwaTGbP8ZEddTSzMZQxcX7j3t4LQK98OO53i8Qgk/fEy2qaFuU2yM8NVxdSr\n/qRpsTL/TtDi8pTER0fPzdgYnbfXeR1Ybkft7+SgEiE95jzJCD/1+We1ew++JzAf\nZBNvwr4J35t15KjQHQSa5P1daG/JufsxytY82fW02JjTa/dtrTMULAFOSK2OVoyg\nZ4feVdxA2TdM9E36Er3fGZBQHc1rzAys4MEGjrNMfyJuHobmAsx9F/N5s4Cs5Q/1\neR7KWhac6BzegPtTw2dF9bpccuZRXl/mKie8EUcFD1xbXjum3NqMp4Gf7wxYgwkx\n0P+90aE7AgMBAAECggEAImgvy5tm9JYdmNVzbMYacOGWwjILAl1K88n02s/x09j6\nktHJygUeGmp2hnY6e11leuhiVcQ3XpesCwcQNjrbRpf1ajUOTFwSb7vfj7nrDZvl\n4jfVl1b6+yMQxAFw4MtDLD6l6ljKSQwhgCjY/Gc8yQY2qSd+Pu08zRc64x+IhQMn\nne1x0DZ2I8JNIoVqfgZd0LBZ6OTAuyQwLQtD3KqtX9IdddXVfGR6/vIvdT4Jo3en\nBVHLENq5b8Ex7YxnT49NEXfVPwlCZpAKUwlYBr0lvP2WsZakNCKnwMgtUKooIaoC\nSBxXrkmwQoLA0DuLO2B7Bhqkv/7zxeJnkFtKVWyckQKBgQC4GBIlbe0IVpquP/7a\njvnZUmEuvevvqs92KNSzCjrO5wxEgK5Tqx2koYBHhlTPvu7tkA9yBVyj1iuG+joe\n5WOKc0A7dWlPxLUxQ6DsYzNW0GTWHLzW0/YWaTY+GWzyoZIhVgL0OjRLbn5T7UNR\n25opELheTHvC/uSkwA6zM92zywKBgQC3PWZTY6q7caNeMg83nIr59+oYNKnhVnFa\nlzT9Yrl9tOI1qWAKW1/kFucIL2/sAfNtQ1td+EKb7YRby4WbowY3kALlqyqkR6Gt\nr2dPIc1wfL/l+L76IP0fJO4g8SIy+C3Ig2m5IktZIQMU780s0LAQ6Vzc7jEV1LSb\nxPXRWVd6UQKBgQCqrlaUsVhktLbw+5B0Xr8zSHel+Jw5NyrmKHEcFk3z6q+rC4uV\nMz9mlf3zUo5rlmC7jSdk1afQlw8ANBuS7abehIB3ICKlvIEpzcPzpv3AbbIv+bDz\nlM3CdYW/CZ/DTR3JHo/ak+RMU4N4mLAjwvEpRcFKXKsaXWzres2mRF43BQKBgQCY\nEf+60usdVqjjAp54Y5U+8E05u3MEzI2URgq3Ati4B4b4S9GlpsGE9LDVrTCwZ8oS\n8qR/7wmwiEShPd1rFbeSIxUUb6Ia5ku6behJ1t69LPrBK1erE/edgjOR6SydqjOs\nxcrW1yw7EteQ55aaS7LixhjITXE1Eeq1n5b2H7QmkQKBgBaZuraIt/yGxduCovpD\nevXZpe0M2yyc1hvv/sEHh0nUm5vScvV6u+oiuRnACaAySboIN3wcvDCIJhFkL3Wy\nbCsOWDtqaaH3XOquMJtmrpHkXYwo2HsuM3+g2gAeKECM5knzt4/I2AX7odH/e1dS\n0jlJKzpFpvpt4vh2aSLOxxmv\n - - -END PRIVATE KEY - - -\n",
"client_email": "bigquery@poised-journey-678678.iam.gserviceaccount.com",
"client_id": "11488612345677453667",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x505/bigquery%40poised-journey-315406.iam.gserviceaccount.com"
}
Pro-Tip: Before you move to the next step please try to use your downloaded credential JSON file with other third party sql tools like DBeaver to access your BigQuery Table. This is to make sure that your credentials have valid access rights or to isolate any issue with your credentials.
Step 4: Configure Presto Catalog for Google BigQuery Connector
To configure the BigQuery connector, you need to create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. You can create the file with the following contents, replacing the connection properties as appropriate for your setup. This should be done via the edit config map to make sure it’s reflected in the deployment:
kubectl edit configmap presto-catalog -n <cluster_name> -o yaml
Following are the catalog properties that need to be added:
connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>
bigquery.credentials-file=patch/for/bigquery-credentials.json
Following are the sample entries for catalog yaml file:
bigquery.properties: |
connector.name=bigquery
bigquery.project-id=poised-journey-317806
bigquery.credentials-file=/opt/presto-server/etc/bigquery-credential.json
Step 5: Configure Presto Coordinator and workers with Google BigQuery credential file
To configure the BigQuery connector:
- Load the content of credential file as bigquery-credential.json in presto coordinator’s configmap:
kubectl edit configmap presto-coordinator-etc -n <cluster_name> -o yaml
2. Add a new session of volumeMounts for the credential file in coordinator’s deployment file:
kubectl edit deployment presto-coordinator -n <cluster_name>
Following the sample configuration, That you can append in your coordinator’s deployment file at the end of volumeMounts section:
volumeMounts:
- mountPath:/opt/presto-server/etc/bigquery-credential.json
name: presto-coordinator-etc-vol
subPath: bigquery-credential.json
3. Load the content of credential file as bigquery-credential.json in presto worker configmap:
kubectl edit configmap presto-worker-etc -n <cluster_name> -o yaml
4. Add a new session of volumeMounts for the credential file in worker’s deployment file:
kubectl edit deployment presto-worker -n <cluster_name>
Following the sample configuration, That you can append in your worker deployment file at the end of volumeMounts section:
volumeMounts:
- mountPath:/opt/presto-server/etc/bigquery-credential.json
name: presto-worker-etc-vol
subPath: bigquery-credential.json
Step 6: Setup database connection with Apache Superset
Create your own database connection url to query from Superset with below syntax:
presto://<username>:<password>@bq.rohan1.dev.app:443/<catalog_name>
Step 7 Check for available datasets, schemas and tables, etc
After successfully database connection with Superset, Run following queries and make sure that the bigquery catalog gets picked up and perform show schemas and show tables to understand available data.
show catalogs;
show schemas from bigquery;
show tables from bigquery.rohan88;
Step 8: Run SQL query from Apache Superset to access BigQuery table
Once you access your database schema, you can run SQL queries against the tables as shown below.
select * from catalog.schema.table;
select * from bigquery.rohan88.flights LIMIT1;
You can perform similar queries from Presto Cli as well, here is another example of running sql queries on different Bigquery dataset from Presto CLI.
$./presto-cli.jar - server https://<presto.cluster.url> - catalog bigquery - schema <schema_name> - user <presto_username> - password
Following example shows how you can join Google BigQuery table with the Hive table from S3 and run SQL queries.
We hope you found this helpful!