Event-Driven Cloud Function Triggered Multiple Times & How to Address It

Jenn Wang
5 min readMay 3, 2024

In GCP (Google Cloud Platform) Cloud Function, we can use event-driven functions when we want a function to be invoked automatically in response to an event that occurs in our cloud environment. In my recent project, I noticed one of my cloud functions was triggered multiple times by a Google Cloud Storage (GCS) bucket event. In this post, I will walk through why this cloud function got invoked more than once and how to address it.

A. The Issue at Hand:

An event-driven cloud function (e.g. triggered by a GCS bucket load) gets triggered multiple times when the expected invocation is once.

B. Example Set-Up:

(B.0) Overall Flow:

  • Step 1: A Data Fusion pipeline loads to the “my_test_gcs_bucket_dev” GCS bucket
  • Step 2: Any new files loaded to this “my_test_gcs_bucket_dev” GCS bucket will trigger the “Upload_to_BQ_Cloud_Function” cloud function

(B.1) Step 1 — More Details

In my case, I have a Data Fusion pipeline that loads to a GCS bucket called “my_test_gcs_bucket_dev” at the end. Whenever this Data Fusion pipeline runs, it will load a file to this GCS bucket.

Note: I will spare you the details on this Data Fusion pipeline since it’s not the focus of this post. But really you can have any other tools (other than Data Fusion, e.g. Dataflow, an orchestrated Python script) that load to a GCS bucket, and you may run into the same multi-invocation issue with your cloud function.

Figure 1: Data Fusion Pipeline that Loads to GCS

(B.2) Step 2— More Details

Upon any file creation in the “my_test_gcs_bucket_dev” GCS bucket, a cloud function called “Upload_to_BQ_Cloud_Function” will be triggered.

Note: Again, I will skip the details about this cloud function. But at a high level, this “Upload_to_BQ_Cloud_Function” function downloads files from an on-prem system and uploads the files to BigQuery, a data warehouse on Google Cloud.

Figure 2: Config of the Cloud Function - GCS as Event Trigger

C. Issue Observed & Root Cause:

Although only one file seems to be written to the GCS bucket (see Figure 1, “In 1” in the last plug-in), the cloud function gets triggered multiple times.

To trouble-shoot, I added print statements to the cloud function to log out more info on the event itself. Note that the print(“File: {}”.format(event[“name”])) line will be particularly helpful at the next step.

def download_file_upload_to_bq(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
print(f"Event ID: {context.event_id}")
print(f"Event type: {context.event_type}")
print("Bucket: {}".format(event["bucket"]))
print("File: {}".format(event["name"]))
print("Created: {}".format(event["timeCreated"]))
print("Updated: {}".format(event["updated"]))

After getting the logs of the event, we can see that this cloud function is being triggered many times at around 15:27, and each invocation is caused by different directories in the GCS bucket (see below highlighted in red).

E.g., The “2024–04–11–19–20/_temporary/0/_temporary/” directory in the GCS bucket triggered the cloud function at 15:27:34, and some other files triggered this cloud function again almost simultaneously.

Figure 3: Cloud Function Logs

However, if we go to the GCS bucket now to check the content, we can’t even find the “2024–04–11–19–20/_temporary” directory. This is because, the _temporary folder is often used by the data processing framework for temporary storage during the execution of tasks. The content of this folder is often transient and invisible. The only visible files in this GCS bucket under the “2024–04–11–19–20” directory are below:

Figure 4: GCS Bucket Content

Thus, so far, we have identified the root cause of the multiple invocation issue — it’s because of the transient files/folders outputted to the GCS bucket. In other words, although the Data Fusion pipeline shows that only one file was written to the GCS bucket, during the write processes, multiple transient and invisible directories were created in the bucket, and each of those transient directories would then trigger the cloud function, thus invoking the function more than once.

D. Solution:

Since we don’t want to trigger the cloud function when any transient files get written to the GCS bucket, and we only want to trigger the function once upon a successful write to this bucket, we can leverage the “_SUCCESS” file that is visible in the final GCS bucket. This “_SUCCESS” file indicates the successful completion of a data processing job, and thus in our cloud function code, we can wrap our code logic under an IF statement (see line 23 below). This way, the rest of the code logic will only run if the triggering event (in this case, the file created in GCS) is called “_SUCCESS”, and it will run once only.

Figure 5: Cloud Function Source Code

E. Summary:

  • This event-driven (specifically, triggered by any file creation in the GCS bucket) cloud function was triggered multiple times due to transient files written to the GCS bucket.
  • To resolve this issue, we can wrap the code logic inside of an IF statement and only run the code logic when the event’s payload matches the desired name of the file in the GCS bucket.
  • Printing out the event payload and metadata can be very helpful when trouble-shooting cloud functions. By logging out the event’s payload (e.g. the name of the newly created GCS file that triggers the cloud function), not only can we identify the root cause, but we can also come up with solutions to resolve the multi-invocation issue.

--

--

Jenn Wang

From a classroom gardener, to a PowerPoint bullet point aficionado, to now a data pipeline plumber. I share my thoughts and things I learn on here.