Google Workflows: A Potential Replacement for Simple ETL?

Matheus Tramontini
Nagoya Foundation
Published in
5 min readJun 20, 2023

Workflow orchestration is essential in big data and microservices environments as it aids in automating, coordinating, and managing complex and interdependent tasks. In this post, we will discuss the key features of Apache Airflow and Google Workflows, providing an example of a simple parallel workflow and explaining why Google Workflows became a cost-saving solution for me.

What is the popular Airflow?

Apache Airflow is an open-source platform initially developed by Airbnb and later integrated into the Apache Foundation. It allows managing, scheduling, and monitoring complex ETL (Extract, Transform, Load) workflows, batch data processing, and other automations.

It is based on the use of Directed Acyclic Graphs (DAGs), which represent the workflow as an acyclic graph. Each node in the DAG represents a task and defines an action to be executed. To control these nodes, we can determine the order and dependencies of each task in the DAG creation script.

Two points that make Airflow so big are its web interface, which facilitates the management of pipelines, plugins, connections and others, and its abundance of connectors already made.

And about Google Workflows?

Similar to Airflow, Google Workflows is a service for workflow orchestration. However, it is a fully managed service offered on the Google Cloud Platform (GCP). While Airflow uses a Python script to generate DAGs, Workflows uses a YAML file to define the logic of the flow.

One of the major features of Workflows is its native integration with various services in GCP, making it easier to connect and manage native pipelines with internal tools such as BigQuery, Cloud Run, Cloud Functions, and more. Additionally, it provides multiple ways to trigger these pipelines, not just through a CRON schedule. We can utilize Pub/Sub events, HTTP calls, or even the Google Scheduler itself to activate the workflows.

Airflow x Google Workflows

Alright, now that we have an idea of what both services are, what are their major differences?

  1. The most obvious difference lies in the entities maintaining the services. While Airflow has a vibrant community, Workflows is fully managed by Google (everyone has their own opinion on which one is better or not).
  2. Speaking in general terms, Airflow requires installation, configuration, and infrastructure management. In the case of GCP, there is Cloud Composer to facilitate this process and reduce infrastructure-related headaches. On the other hand, Workflows is fully managed, taking care of infrastructure and scalability, allowing users to focus solely on the pipeline logic.
  3. Airflow is designed with a direct focus on ETL processes, batch data processing, and automation. Conversely, Workflows is used more for automation and coordination of distributed services and microservices.
  4. Due to its open-source nature, Airflow provides more flexibility in creating custom plugins and extensions, whereas Workflows is more limited to REST APIs and internal services.

Now, the question arises: If Airflow is specifically focused on data, why would one use Workflows for ETL or data pipelines?

The answer to this question is the famous “it depends.” Each will have its specific use case and considerations regarding data size, budget, and other factors. I believe the greatest advantage of Workflows for data ETL lies in scenarios with a small amount of data where cost control is crucial. When using Composer, you will incur expenses for Cloud SQL (Postgres), Google Kubernetes Engine (for creating scheduler and worker pods), and a Composer service fee. In Workflows, billing is based on steps and the service you utilize (in my case, Cloud Functions). The first 5,000 internal steps are free, with an additional $0.01 per 1,000 steps. If you’re curious, you can delve deeper into Workflows pricing here.

Hands-on: Building a Google Workflows pipeline with Cloud Functions

Since the focus is on demonstrating the construction of workflows, I won’t create any elaborate function in Cloud Function. Instead, I’ll simply print that functions 1, 2, and 3 have executed. The idea is to create a parallel flow where functions 1 and 2 will run simultaneously, and if they return the expected result (in this case, Success), function 3 will be executed. Understand that the code in this case is less important. If your Cloud Function works, it will execute smoothly in Workflows. Below is the skeleton code that will be in the Cloud Function for the three functions, only changing the number.

def main(_):
print('Function 1 is running!')
return 'Success'
main:
params: [args]
steps:
- runningParallel:
parallel:
branches:
- getFunction1:
steps:
- callFunction1:
call: http.post
args:
url: https://cloudzone-projectid.cloudfunctions.net/function1
auth:
type: OIDC
result: resp
- checkFunction1Result:
call: assert_response
args:
expected_response: "Success"
got_response: ${resp.body}
- getFunction2:
steps:
- callFunction2:
call: http.post
args:
url: https://cloudzone-projectid.cloudfunctions.net/function2
auth:
type: OIDC
result: resp

- checkFunction2Result:
call: assert_response
args:
expected_response: "Success"
got_response: ${resp.body}
- callFunction3:
call: http.post
args:
url: https://cloudzone-projectid.cloudfunctions.net/function3
auth:
type: OIDC
body:
key:value
result: resp

- checkFunction3Result:
call: assert_response
args:
expected_response: "Success"
got_response: ${resp.body}

assert_response:
params: [expected_response, got_response]
steps:
- compare:
switch:
- condition: ${expected_response == got_response}
next: end
- fail:
raise: ${"Expected response is " + expected_response + ". Got " + got_response + " instead."}

The YAML above creates the following flow:

To summarize the YAML:

  1. First, we define possible parameters for the flow to receive and create the steps it will take using the “steps” key.
  2. We specify that we want to create a parallel flow using the “runningParallel” key, followed by defining the branches it will have, in this case, “CallFunction1” and “CallFunction2”.
  3. Next, we define the steps within each branch. In this example, we have two steps: calling our Cloud Function through a POST request and checking the result. If the result matches the expected value, the flow will continue.
  4. After that, we exit the parallel flow declaration and call our function 3. In this case, I provided an example in case you need to pass a body to your function.

Note: In this case, I set the authentication type as OIDC. This may vary depending on your specific case, but in this example, I prefer using OIDC as it utilizes my service account, which is more secure for administration purposes.

Conclusion

Unfortunately, I wanted to showcase something more robust to demonstrate the power of Workflows, but I couldn’t obtain GCP credits for demonstration purposes. Nevertheless, I believe you have gained an understanding of how Workflows can be used to manage service flows. It’s important to note that every case is unique. In my simple case, Cloud Functions were sufficient, and I couldn’t afford a high cost, so I didn’t need to pay for a full Airflow setup.

Finally, I will provide three links. One is to explore more templates provided by Google, another is for the syntax reference, and the last one is to download the Workflows cheat sheet.

--

--