BigQuery and Cloud Functions integration by using Remote function to create cost-effective data pipelines

5 min readJul 24, 2023

In this article, I would like to consider using the BigQuery Remote function feature. The BigQuery Remote function lets us create a direct integration BigQuery and Google Functions. The direct integration, in turn, allows us to build flexible data pipelines that include SQL and Python (or other languages) code steps in a single process.

Photo by Sebastian Pena Lambarri on Unsplash

Approach differences

Let's take a typical task — loading and processing the billing system data such as Stripe or Zuora. We need periodically load raw data by API requests and process it in BigQuery.

Previously I used an approach when I ran the Cloud function Python script by Cloud scheduler and then processed data in BigQuery, using stored procedures, independently running them by scheduled queries tool.

I changed this approach after I began using Dataform because, besides other outstanding features, Dataform contains an excellent orchestration tool (Workflow), which solves many inconveniences of running a sequence of queries specific to bare BigQuery.

As we can see, in this case, the Cloud function is a part of the single data pipeline, scheduled and run by Dataform Workflow. And the BigQuery Remote function allows us to call the Cloud function from the SQL query directly.

In more detail, let's consider this step — executing the Cloud function script by invoking the Remote function from the BigQuery SQL query.

So, the Cloud function task is requesting the billing system API, extracting data, and uploading it to the BigQuery tables. We will not focus on creating the Cloud function itself, only on calling it and creating a response, but you can find the script example here.

To create and run the Remote function, we need to:

Have a Cloud function with a Python script.
Create the new external connection source (BigLake and Remote function type) in BigQuery.
Grant permissions for the connection service account.
Create the BigQuery Remote function.
Create the SQL query that invokes the Remote function.

Let's delve into it further.

Create a connection and grant permissions

To create the new external connection, go to the BigQuery page, click Add data, and then click Connections to external data sources. Select BigLake and remote functions (Cloud Resource) as a connection type. In the Connection ID field, enter a name for your connection (for example, cloud_func_connection). Choose your data location (for example, us-east1). Click Create Connection.

If you prefer the Cloud Shell, use a code like this:

bq mk --connection --location=us-east1 --project_id=my_project\
--connection_type=CLOUD_RESOURCE cloud_func_connection

We will be able to use this connection not to invoke one specific Cloud function only but for any future functions we may need to call.

After creating a connection, we need to open the connection info pane and copy the service account ID (for example, hdfk-123123412345–7ci3@gcp-sa-bigquery-condel.iam.gserviceaccount.com).

To grant permissions for the service account, go to the IAM & Admin page, and click Grant Access. In the opened dialog, enter the service account ID in the New principals field and select the roles Cloud Function Invoker and Cloud Run invoker. Click Save.

Create the BigQuery Remote function

For creating the Remote function, we need to know the trigger URL, which we can find in the trigger tab of the Cloud function (for example, https://us-east1-my_project.cloudfunctions.net/billing_etl_func).

To create the BigQuery Remote function, use the following query:

CREATE FUNCTION `my_project.my_dataset.billing_etl`(start_time TIMESTAMP, 
                                                    end_time TIMESTAMP)
RETURNS STRING
REMOTE WITH CONNECTION `my_project.us-east1.cloud_func_connection`
OPTIONS
(endpoint ='https://us-east1-my_project.cloudfunctions.net/billing_etl_func')

Our Remote function has two timestamp-type arguments: start_time and end_time, which define the uploading time range. The Remote function receives a response with the executing status string (in a specific JSON format).

The Remote function is ready for work, but we need to ensure that the Cloud function is also prepared to execute and can accept arguments and send responses in the appropriate format.

Cloud function code features

Let's look into the Cloud function source code.

import functions_framework
import json

...
...

@functions_framework.http
def main(request):
    # Parsing arguments
    request_json = request.get_json()
    start_time = request_json['calls'][0][0]
    end_time = request_json['calls'][0][1]
    
    try:
        ...    
        Our function code
        ...
      # Create the successful response
      return json.dumps({"replies": ["Job is finished"]})

    except Exception as e:
      # Create the failed response  
      return json.dumps({"errorMessage": str(e)})

The main function code should have two features that answer the following questions. First is how to recognize input arguments. We sent two values in the calling Remote function (start_time and end_time), but on the Cloud function side, the main function receives the JSON object, which contains an encoded argument list in the calls field. The BigQuery request may look like this.

{
 "requestId": "124ab1c",
 "caller": "//bigquery.googleapis.com/projects/my_project/jobs/my_project:us-east1.bquxjob_5b4c112c_17961fafeaf",
 "sessionUser": "user@company.com",
 "userDefinedContext": {},
 "calls": [
  ["2023-07-09 14:15:00", "2023-07-09 18:15:00"]
 ]
}

We use the Flask get_json function that the Function framework library provided for the parsing argument values.

The second question is how to create the response object properly. The SQL query expects the function response by the specific JSON object. If the response format is not appropriate, the script produces the error. The response object for our case may look like this for successful and failed results:

# Successfule response
{
  "replies": [ "Job is finished" ]
}

# Failed response
{
  "errorMessage": "The function finished with status: 'error'".
}

Using the try-except statement and the json library, we form the response object for any execution outcome.

Suppose the Cloud function finishes with an error because of timeout or unavailable extracting data. In that case, the Cloud function tries new execution attempts again several times (5 by default) and sends the failed response only after all retries.

Call the Remote function

So, everything is ready for running the SQL script with the Remote function:

SELECT `my_project.my_dataset.billing_etl`('2023-07-15 00:00:00', '2023-07-16 00:00:00') as result

The script calls the Remote function to upload data on 15 July 2023.

Conclusions

Thus, in the article, we explored utilizing the Remote function in BigQuery SQL query to call the Cloud function. And the benefits of using the data pipeline-building approach where we have a single Dataform pipeline for SQL and Python code.

There are some great services available in GCP that can assist with data pipeline creation and orchestration. However, it should be noted that many of these services can be quite costly. If you are looking for affordable options, a combination of such services as Dataform (over BigQuery), Cloud Function, Pub/Sub, and Cloud Storage (within their performance limitations) can be a smart choice. And Dataform takes a central place here as the leading platform for code generating, dependence management (including dependencies between other data pipelines), and orchestration. This way allows us to use SQL as much as possible for data transformations and insert steps that can not be implemented without Python into a single chain using Remote functions.