How Data Engineers Can Use Python to Schedule BigQuery Queries
BigQuery provides guidance for using Python to schedule queries from a service account but does not emphasize why this is an important, if not overlooked step of automating and sustaining a data pipeline.
Service accounts are preferable to personal accounts because service accounts can be accessed by anyone on the team with the corresponding IAM role, meaning that even if someone in the organization leaves, their work can still be accessed, edited and scheduled with ease.
Below, I’ll provide guidance on using Python for scheduling queries and how to handle common pitfalls I’ve experienced in both Python and SQL.
Authenticate and Initialize Data Transfer
Before you proceed, ensure you authenticate with the credentials associated with your GCP project. You’ll also need to enable the Data Transfer API.
Next, install the BigQuery data transfer library with a simple pip install.
pip install google-cloud-bigquery-datatransfer
After you’ve authenticated and downloaded the necessary packages, you can set your project parameters.
from google.cloud import bigquery_datatransfer # Import logging to catch and report…