Sitemap
Walmart Global Tech Blog

We’re powering the next great retail disruption. Learn more about us — https://www.linkedin.com/company/walmartglobaltech/

Google Serverless Spark, Part 1:
An Overview and Guide

4 min readApr 29, 2025

--

Press enter or click to view image in full size
Photo Credit: Auben Networks via LinkedIn

Introduction

Google Cloud Platform (GCP) consistently pioneers cutting-edge solutions for data processing, one of the most notable being Google’s Serverless Spark. As an autoscaling serverless data processing service, it fully utilizes the power of Apache Spark, offering a seamless and efficient approach to manage heavy-duty data workloads.

The buzz around serverless computing has been growing in the tech industry. It’s a model where cloud service providers manage the servers, allowing developers to concentrate on coding. This blog post delves into the key characteristics and advantages of utilizing Google Serverless Spark.

  • Google Cloud Platform (GCP) Serverless Spark is a fully managed and serverless data processing service provided by Google. It offers hassle-free processing of big data workloads by leveraging the power of Apache Spark.
  • A serverless environment enables developers to run spark jobs without the need to manage and provision infrastructure.
  • It takes care of automatically scaling the resources based on the workload, ensuring efficient utilization of resources and execution of Spark jobs.
  • Additionally, it also eliminates the need for manual tuning and management of Spark clusters, allowing coders to focus on data processing and analysis.

Differences Between Dataproc Cluster and Serverless Spark

Before getting into what to use when, in this section, let’s understand the key differences between the ephemeral Dataproc Cluster (i.e. Cloud Dataproc on Google Compute Engine (DPGCE)) and the DataProc Serverless Spark (DPaaS, that is, data platform as a service).

Press enter or click to view image in full size

— Server Management: In a Dataproc Cluster running on compute engines, users need to manage the provisioning, configuration, and scaling of servers. In Serverless Spark, all aspects are fully managed.

— Lead Time: The lead time for a Dataproc Cluster is typically between 3 to 5 minutes. Serverless Spark, takes approximately 90 seconds to start up, hence slightly faster.

— Flexibility: Dataproc Cluster offers flexibility in customizing cluster configurations, such as machine type and number of instances. Serverless Spark requires less configuration and management, however, offers less control over the infrastructure.

— Cost: With Dataproc Cluster, costs are incurred based on cluster size and usage. On the other hand, Serverless Spark incurs costs based on Spark job execution time and resources consumed. Notably, there’s no cost when no jobs are running.

Use Case: When to Use Serverless vs Ephemeral Cluster

You made it this far, so let’s help you with a quick reference guide that will help you decide when to use DPaaS Serverless.

When less than four jobs are required to run concurrently or sequentially, a serverless cluster is better. The math is simple as described below:

Press enter or click to view image in full size

After several runs of jobs, a serverless batch job takes approximately 90 seconds to start, which makes the lead time for 4 serverless jobs equivalent to lead time for Ephemeral cluster. Every job hereafter on serverless is an additional 90 seconds. This lead time could trigger potential inefficiencies if you’re running multiple jobs.

An additional consideration when choosing DPaaS is that it is suitable when a job doesn’t require a custom machine type, many executors (more than 2000), or cores per executor other than 4, 8, or 16. This way, it’s not only useful for smaller tasks but also for a majority of more complex tasks that need to balance cost-effectiveness and performance requirements.

How to Execute Jobs Using Google Serverless Spark

Google Serverless Spark can be invoked using a relatively simple batch command. Below is an example illustration of gcloud CLI job submission, comparing a serverless spark job submission to a job with a dataproc cluster:

  1. Serverless: gcould CLI batch submission
gcloud dataproc batches submit spark \ 
--batch=my-batch-001 \
--project=my-project \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

2. Dataproc: gcloud CLI job submission that will run on a DPaaS cluster ‘my-cluster’ that was spun up prior to running the job

gcloud dataproc jobs submit spark \ 
--cluster=my-cluster \
--project=my-project \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

If we compare both the job submissions, the dataproc job submission requires providing the details about a cluster that must be spined up prior to running the jobs. On the other hand, the serverless batch command only requires providing a unique batch name.

You may additionally choose to leverage spark properties, as illustrated below in the section ‘Guide to optimizing performance and tuning’.

gcloud dataproc batches submit spark \ 
--batch=my-batch-001 \
--project=my-project \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar – 1000
--properties=spark.dynamicAllocation.enabled=true,\
spark.dataproc.scaling.version=2,\
spark.dynamicAllocation.initialExecutors=2,\
spark.dynamicAllocation.minExecutors=2,\
spark.dynamicAllocation.maxExecutors=20,\
spark.executor.memoryOverhead=4G,\
spark.executor.cores=4,\
spark.executor.memory=16g,\
spark.driver.cores=4,\
spark.driver.memory=16g,\
spark.dataproc.executor.compute.tier=standard,\
spark.dataproc.driver.compute.tier=standard,\
spark.dataproc.executor.disk.size=250g,\
spark.dataproc.driver.disk.size=250g

Key Benefits of Serverless Spark

1. Scalability: Serverless Spark can handle increased loads, ensuring high-performance computation with large datasets.

2. Cost-Effectiveness: Pay only for what you use, with no need to manage and pay for idle servers.

3. Simplicity: Serverless Spark allows you to focus on writing and running applications, not on infrastructure.

4. Performance: The serverless architecture ensures fast execution and better performance.

Conclusion

For projects with four jobs or less or where infrastructure control is less important, Serverless Spark is a useful tool to manage server cost effectively while allowing developers to focus on coding. My next post provides an in-depth guide to fine-tuning your job performance in Serverless Spark.

Reference

  1. https://cloud.google.com/dataproc-serverless/docs/overview
  2. https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling
  3. https://cloud.google.com/dataproc-serverless/docs/concepts/properties
  4. https://cloud.google.com/blog/products/data-analytics/tune-spark-properties-to-optimize-dataproc-serverless-jobs

--

--

Shobhit Sabharwal
Shobhit Sabharwal

Written by Shobhit Sabharwal

Data Engineer, Merchandising at Walmart Global Tech

No responses yet