Serverless data ingestion from Google Sheets to BigQuery using Google Cloud Function
Summary
This article offers an easier alternative to a typical scalable data pipeline that can extract data from google sheets and load to BigQuery. No docker image. No EC2. No ECS. No artifact registry. No workflow orchestration. Just python code, cloud function and cloud scheduler.
Target Audience
- Data engineers who do not want to spend too much time to set up a data pipeline.
- Or perhaps your data owners prefer to update data sources on spreadsheet only.
- Google Cloud enthusiasts who want to get some hands-on experience with Google Cloud Function.
Outline
- Grant GCP service account access to Google Sheets (and download service account key)
- Create a Google Cloud Function to read data from Google Sheets and load dataframe to BigQuery
- Deploy Cloud Function and test it using Cloud Shell
- Set a Google Cloud Scheduler to trigger cloud function regularly
As data engineers, we have to deal with different types of data sources.
- Grant GCP service account access to Google Sheets (and download service account key)
If your google sheet is public, you can copy the url and skip part 1. Otherwise, open your spreadsheet and click on the “share” button.
Instead of sharing with a real person, we will add the google service account, with access to Cloud Storage, Cloud Function, and Cloud Scheduler, to our google sheet. To get the service account email, please go to GCP console for service accounts.
To save time, I will use the default compute service account. Please copy the service account email here.
If you haven’t downloaded the key file for the chosen service account, don’t close the service accounts page yet. Click on the three dots below “Actions” and select “manage keys”.
Next, click on “add key” and next “create new key”.
Last, in the pop-up window, select JSON
as the key type. Subsequently, the key file will be downloaded automatically after you click on “create”.
Please store this key file carefully.
2. Create a Google Cloud Function to read data from Google Sheets and load dataframe to Big Query
Here comes the most exciting part of the article. Let’s open Cloud Functions page and click on “create function”.
Leave everything as default. Enter a function name and change the region if necessary. Please take note of the URL generated after entering the function name. You will need it later.
Next, scroll down to the next section, “Runtime, build, connections and security settings”. For me, I don’t need to increase the allocated memory and CPU. However, if your data is huge, please adjust accordingly. In addition, if you are not using the default service account, please choose the service account here.
Click “next” to continue. You will see the default cloud function configuration in the picture below. Change the runtime to python.
On the left panel, you will see the default source code main.py
and dependency file requirements.txt
. To make things easier, let’s just upload the service account key file here for authentication. (Instead of doing this, you can also store the key file in secret manager and use client to retrieve the json file.)
I will key in my_service_account_credentials.json
for service account key file, which later will be used as a parameter in my functions.
In part 1 above, we have downloaded a json key. Please paste the content here.
Next, paste the dependencies in existing requirements.txt
.
# ===========================================
# requirements.txt
# ===========================================
functions-framework==3.*
pandas
gspread
pandas_gbq
google-auth
Last, paste the code below and overwrite the existing main.py
.
# ===========================================
# main.py
# ===========================================
import functions_framework
# ----------- insert code here --------------
from google.oauth2 import service_account
import pandas as pd
import gspread
import json
import os
import json
import pandas_gbq
KEY_PATH = "./my_service_account_credentials.json"
BIGQUERY_TABLE_ID = "dev.my_csv" # change to your table id
SCOPES = [
"https://www.googleapis.com/auth/spreadsheets",
"https://www.googleapis.com/auth/drive",
]
URL = "<paste-your-google-sheets-url-here>"
def gsheets2df(key_path: str, gsheets_url: str) -> pd.DataFrame:
"""extract data from first sheet of google sheet"""
with open(key_path) as file:
credentials_json = json.load(file)
credentials = service_account.Credentials.from_service_account_info(
credentials_json
)
creds_with_scope = credentials.with_scopes(SCOPES)
gspread_client = gspread.authorize(creds_with_scope)
spreadsheet = gspread_client.open_by_url(gsheets_url)
worksheet = spreadsheet.get_worksheet(0)
records_data = worksheet.get_all_records()
records_df = pd.DataFrame.from_dict(records_data)
return records_df
def df2bq(
df: pd.DataFrame,
bigquery_table_id: str,
# key_path: str,
bigquery_dataset_id: str = "<paste-gcp-project-id-here>",
project_id_gc: str = "<paste-gcp-project-id-number>",
) -> None:
"""save dataframe to bigquery"""
pandas_gbq.to_gbq(
dataframe=df,
destination_table=f"{bigquery_dataset_id}.{bigquery_table_id}",
project_id=project_id_gc,
progress_bar=False,
if_exists="append",
chunksize=None,
api_method="load_csv",
location="US", # change to your BigQuery region
verbose=False,
# credentials=service_account.Credentials.from_service_account_file(key_path),
)
print("Dataframe saved.")
# ----------- end ------------
@functions_framework.http
def hello_http(request):
"""HTTP Cloud Function.
Args:
request (flask.Request): The request object.
<https://flask.palletsprojects.com/en/1.1.x/api/#incoming-request-data>
Returns:
The response text, or any set of values that can be turned into a
Response object using `make_response`
<https://flask.palletsprojects.com/en/1.1.x/api/#flask.make_response>.
"""
request_json = request.get_json(silent=True)
request_args = request.args
# ----------- insert code here -------------
df = gsheets2df(key_path=KEY_PATH, gsheets_url=URL)
df2bq(
df=df,
bigquery_table_id=BIGQUERY_TABLE_ID,
# key_path=KEY_PATH,
)
# ------------- end -------------------
if request_json and 'name' in request_json:
name = request_json['name']
elif request_args and 'name' in request_args:
name = request_args['name']
else:
name = 'World'
return 'Hello {}!'.format(name)
Please be reminded to fill in your BigQuery dataset id, table id, and GCP project ID (in string
) and a project ID(?) number (in integer
). If you do not know how to locate this number, just copy the first few digits in your default compute service account email.
3. Deploy Cloud Function and test it using Cloud Shell
We are ready to deploy the function now. Just click on “deploy” at the bottom left corner.
After it’s deployed, you will see the page below. Click on “testing” and scroll down to look for the CLI command.
Copy the CLI test command then click on “run in cloud shell” above the command.
After you open cloud shell editor, you have to configure the project by entering the command: gcloud config set project [PROJECT_ID]
. You will see the project id in yellow after it has been configured successfully. Now it’s time to paste the CLI test command here.
Press enter and wait for the HTTP response text.
It ran successfully. Let’s go to BigQuery and check out the newly loaded dataset.
We have managed to ingest data from Google Sheets to BigQuery. However, we may not want to trigger it manually everytime some updates this google sheet. (If you are ok with triggering cloud function manually, you can skip the next part.)
4. Set a Google Cloud Scheduler to trigger cloud function regularly
We can make use of cloud scheduler to send a HTTP request on a fixed schedule, so that it will trigger the cloud function regularly.
Open Cloud Scheduler and click on “create job”. Enter a name for the scheduler and most importantly key in the CRON expression to define your job schedule. I selected * * * * *
to run it every minute.
Click on “continue” and then select HTTP
as the target type.
Paste the url link to the google sheet. Select POST
as the HTTP method. Please pay attention to the authentication section. Select Add OIDC token
from the dropdown menu. Then select the right service account.
After we are done with cloud scheduler configuration, we can wait until the cloud function has been triggered and check the status of last execution.
If you don’t need to run this job on a schedule, please read my article: No-code data ingestion from Google Sheets to BigQuery.
Acknowledgments: thank you Putyah for pointing out that authentication is not required for cloud function to write to BigQuery. 😊