Serverless data ingestion from Google Sheets to BigQuery using Google Cloud Function

Published in

Google Cloud - Community

7 min readOct 26, 2023

Summary

This article offers an easier alternative to a typical scalable data pipeline that can extract data from google sheets and load to BigQuery. No docker image. No EC2. No ECS. No artifact registry. No workflow orchestration. Just python code, cloud function and cloud scheduler.

Target Audience

Data engineers who do not want to spend too much time to set up a data pipeline.
Or perhaps your data owners prefer to update data sources on spreadsheet only.
Google Cloud enthusiasts who want to get some hands-on experience with Google Cloud Function.

Outline

Grant GCP service account access to Google Sheets (and download service account key)
Create a Google Cloud Function to read data from Google Sheets and load dataframe to BigQuery
Deploy Cloud Function and test it using Cloud Shell
Set a Google Cloud Scheduler to trigger cloud function regularly

As data engineers, we have to deal with different types of data sources.

Grant GCP service account access to Google Sheets (and download service account key)

If your google sheet is public, you can copy the url and skip part 1. Otherwise, open your spreadsheet and click on the “share” button.

Instead of sharing with a real person, we will add the google service account, with access to Cloud Storage, Cloud Function, and Cloud Scheduler, to our google sheet. To get the service account email, please go to GCP console for service accounts.

Share google sheet with google service account.

To save time, I will use the default compute service account. Please copy the service account email here.

If you haven’t downloaded the key file for the chosen service account, don’t close the service accounts page yet. Click on the three dots below “Actions” and select “manage keys”.

Next, click on “add key” and next “create new key”.

Create new key for this service account.

Last, in the pop-up window, select JSON as the key type. Subsequently, the key file will be downloaded automatically after you click on “create”.

Please store this key file carefully.

2. Create a Google Cloud Function to read data from Google Sheets and load dataframe to Big Query

Here comes the most exciting part of the article. Let’s open Cloud Functions page and click on “create function”.

Leave everything as default. Enter a function name and change the region if necessary. Please take note of the URL generated after entering the function name. You will need it later.

Next, scroll down to the next section, “Runtime, build, connections and security settings”. For me, I don’t need to increase the allocated memory and CPU. However, if your data is huge, please adjust accordingly. In addition, if you are not using the default service account, please choose the service account here.

Select the runtime service account for your cloud function.

Click “next” to continue. You will see the default cloud function configuration in the picture below. Change the runtime to python.

Change the cloud function runtime to python.

On the left panel, you will see the default source code main.py and dependency file requirements.txt. To make things easier, let’s just upload the service account key file here for authentication. (Instead of doing this, you can also store the key file in secret manager and use client to retrieve the json file.)

I will key in my_service_account_credentials.json for service account key file, which later will be used as a parameter in my functions.

Enter the filename for service account key file.

In part 1 above, we have downloaded a json key. Please paste the content here.

Next, paste the dependencies in existing requirements.txt.

# ===========================================
#             requirements.txt
# ===========================================

functions-framework==3.*
pandas
gspread
pandas_gbq
google-auth

Last, paste the code below and overwrite the existing main.py.

# ===========================================
#                    main.py
# ===========================================

import functions_framework
# ----------- insert code here --------------
from google.oauth2 import service_account
import pandas as pd
import gspread
import json
import os
import json
import pandas_gbq

KEY_PATH = "./my_service_account_credentials.json"
BIGQUERY_TABLE_ID = "dev.my_csv"  # change to your table id
SCOPES = [
    "https://www.googleapis.com/auth/spreadsheets",
    "https://www.googleapis.com/auth/drive",
]
URL = "<paste-your-google-sheets-url-here>"

def gsheets2df(key_path: str, gsheets_url: str) -> pd.DataFrame:
    """extract data from first sheet of google sheet"""

    with open(key_path) as file:
        credentials_json = json.load(file)

    credentials = service_account.Credentials.from_service_account_info(
        credentials_json
    )
    creds_with_scope = credentials.with_scopes(SCOPES)
    gspread_client = gspread.authorize(creds_with_scope)

    spreadsheet = gspread_client.open_by_url(gsheets_url)
    worksheet = spreadsheet.get_worksheet(0)
    records_data = worksheet.get_all_records()
    records_df = pd.DataFrame.from_dict(records_data)

    return records_df

def df2bq(
    df: pd.DataFrame,
    bigquery_table_id: str,
    # key_path: str,
    bigquery_dataset_id: str = "<paste-gcp-project-id-here>",
    project_id_gc: str = "<paste-gcp-project-id-number>",
) -> None:
    """save dataframe to bigquery"""

    pandas_gbq.to_gbq(
        dataframe=df,
        destination_table=f"{bigquery_dataset_id}.{bigquery_table_id}",
        project_id=project_id_gc,
        progress_bar=False,
        if_exists="append",
        chunksize=None,
        api_method="load_csv",
        location="US", # change to your BigQuery region
        verbose=False,
        # credentials=service_account.Credentials.from_service_account_file(key_path),
    )
    print("Dataframe saved.")

# ----------- end ------------

@functions_framework.http
def hello_http(request):
    """HTTP Cloud Function.
    Args:
        request (flask.Request): The request object.
        <https://flask.palletsprojects.com/en/1.1.x/api/#incoming-request-data>
    Returns:
        The response text, or any set of values that can be turned into a
        Response object using `make_response`
        <https://flask.palletsprojects.com/en/1.1.x/api/#flask.make_response>.
    """
    request_json = request.get_json(silent=True)
    request_args = request.args

    # ----------- insert code here -------------
    df = gsheets2df(key_path=KEY_PATH, gsheets_url=URL)

    df2bq(
        df=df,
        bigquery_table_id=BIGQUERY_TABLE_ID,
        # key_path=KEY_PATH,
    )
    # ------------- end -------------------

    if request_json and 'name' in request_json:
        name = request_json['name']
    elif request_args and 'name' in request_args:
        name = request_args['name']
    else:
        name = 'World'
    return 'Hello {}!'.format(name)

Please be reminded to fill in your BigQuery dataset id, table id, and GCP project ID (in string) and a project ID(?) number (in integer). If you do not know how to locate this number, just copy the first few digits in your default compute service account email.

3. Deploy Cloud Function and test it using Cloud Shell

We are ready to deploy the function now. Just click on “deploy” at the bottom left corner.

After it’s deployed, you will see the page below. Click on “testing” and scroll down to look for the CLI command.

Open testing page after deploying cloud function.

Copy the CLI test command then click on “run in cloud shell” above the command.

Copy the CLI test command for deployed cloud function.

After you open cloud shell editor, you have to configure the project by entering the command: gcloud config set project [PROJECT_ID]. You will see the project id in yellow after it has been configured successfully. Now it’s time to paste the CLI test command here.

Paste the cloud function CLI test command.

Press enter and wait for the HTTP response text.

It ran successfully. Let’s go to BigQuery and check out the newly loaded dataset.

We have managed to ingest data from Google Sheets to BigQuery. However, we may not want to trigger it manually everytime some updates this google sheet. (If you are ok with triggering cloud function manually, you can skip the next part.)

4. Set a Google Cloud Scheduler to trigger cloud function regularly

We can make use of cloud scheduler to send a HTTP request on a fixed schedule, so that it will trigger the cloud function regularly.

Open Cloud Scheduler and click on “create job”. Enter a name for the scheduler and most importantly key in the CRON expression to define your job schedule. I selected * * * * * to run it every minute.

Enter a name and CRON expression to define the scheduler.

Click on “continue” and then select HTTP as the target type.

Select HTTP as the cloud scheduler target type.

Paste the url link to the google sheet. Select POST as the HTTP method. Please pay attention to the authentication section. Select Add OIDC token from the dropdown menu. Then select the right service account.

Paste the url and configure the authentication method for cloud scheduler.

After we are done with cloud scheduler configuration, we can wait until the cloud function has been triggered and check the status of last execution.

Check the status of last execution for cloud function.

If you don’t need to run this job on a schedule, please read my article: No-code data ingestion from Google Sheets to BigQuery.

Acknowledgments: thank you Putyah for pointing out that authentication is not required for cloud function to write to BigQuery. 😊

Serverless data ingestion from Google Sheets to BigQuery using Google Cloud Function

Summary

Target Audience

Outline

Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium

Written by Lu Zhenna