Google Serverless Spark, Part 1:
An Overview and Guide
Introduction
Google Cloud Platform (GCP) consistently pioneers cutting-edge solutions for data processing, one of the most notable being Google’s Serverless Spark. As an autoscaling serverless data processing service, it fully utilizes the power of Apache Spark, offering a seamless and efficient approach to manage heavy-duty data workloads.
The buzz around serverless computing has been growing in the tech industry. It’s a model where cloud service providers manage the servers, allowing developers to concentrate on coding. This blog post delves into the key characteristics and advantages of utilizing Google Serverless Spark.
- Google Cloud Platform (GCP) Serverless Spark is a fully managed and serverless data processing service provided by Google. It offers hassle-free processing of big data workloads by leveraging the power of Apache Spark.
- A serverless environment enables developers to run spark jobs without the need to manage and provision infrastructure.
- It takes care of automatically scaling the resources based on the workload, ensuring efficient utilization of resources and execution of Spark jobs.
- Additionally, it also eliminates the need for manual tuning and management of Spark clusters, allowing coders to focus on data processing and analysis.
Differences Between Dataproc Cluster and Serverless Spark
Before getting into what to use when, in this section, let’s understand the key differences between the ephemeral Dataproc Cluster (i.e. Cloud Dataproc on Google Compute Engine (DPGCE)) and the DataProc Serverless Spark (DPaaS, that is, data platform as a service).
— Server Management: In a Dataproc Cluster running on compute engines, users need to manage the provisioning, configuration, and scaling of servers. In Serverless Spark, all aspects are fully managed.
— Lead Time: The lead time for a Dataproc Cluster is typically between 3 to 5 minutes. Serverless Spark, takes approximately 90 seconds to start up, hence slightly faster.
— Flexibility: Dataproc Cluster offers flexibility in customizing cluster configurations, such as machine type and number of instances. Serverless Spark requires less configuration and management, however, offers less control over the infrastructure.
— Cost: With Dataproc Cluster, costs are incurred based on cluster size and usage. On the other hand, Serverless Spark incurs costs based on Spark job execution time and resources consumed. Notably, there’s no cost when no jobs are running.
Use Case: When to Use Serverless vs Ephemeral Cluster
You made it this far, so let’s help you with a quick reference guide that will help you decide when to use DPaaS Serverless.
When less than four jobs are required to run concurrently or sequentially, a serverless cluster is better. The math is simple as described below:
After several runs of jobs, a serverless batch job takes approximately 90 seconds to start, which makes the lead time for 4 serverless jobs equivalent to lead time for Ephemeral cluster. Every job hereafter on serverless is an additional 90 seconds. This lead time could trigger potential inefficiencies if you’re running multiple jobs.
An additional consideration when choosing DPaaS is that it is suitable when a job doesn’t require a custom machine type, many executors (more than 2000), or cores per executor other than 4, 8, or 16. This way, it’s not only useful for smaller tasks but also for a majority of more complex tasks that need to balance cost-effectiveness and performance requirements.
How to Execute Jobs Using Google Serverless Spark
Google Serverless Spark can be invoked using a relatively simple batch command. Below is an example illustration of gcloud CLI job submission, comparing a serverless spark job submission to a job with a dataproc cluster:
- Serverless: gcould CLI batch submission
gcloud dataproc batches submit spark \
--batch=my-batch-001 \
--project=my-project \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000 2. Dataproc: gcloud CLI job submission that will run on a DPaaS cluster ‘my-cluster’ that was spun up prior to running the job
gcloud dataproc jobs submit spark \
--cluster=my-cluster \
--project=my-project \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi \
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000 If we compare both the job submissions, the dataproc job submission requires providing the details about a cluster that must be spined up prior to running the jobs. On the other hand, the serverless batch command only requires providing a unique batch name.
You may additionally choose to leverage spark properties, as illustrated below in the section ‘Guide to optimizing performance and tuning’.
gcloud dataproc batches submit spark \
--batch=my-batch-001 \
--project=my-project \
--region=us-central1 \
--class=org.apache.spark.examples.SparkPi
--jars=file:///usr/lib/spark/examples/jars/spark-examples.jar – 1000
--properties=spark.dynamicAllocation.enabled=true,\
spark.dataproc.scaling.version=2,\
spark.dynamicAllocation.initialExecutors=2,\
spark.dynamicAllocation.minExecutors=2,\
spark.dynamicAllocation.maxExecutors=20,\
spark.executor.memoryOverhead=4G,\
spark.executor.cores=4,\
spark.executor.memory=16g,\
spark.driver.cores=4,\
spark.driver.memory=16g,\
spark.dataproc.executor.compute.tier=standard,\
spark.dataproc.driver.compute.tier=standard,\
spark.dataproc.executor.disk.size=250g,\
spark.dataproc.driver.disk.size=250g Key Benefits of Serverless Spark
1. Scalability: Serverless Spark can handle increased loads, ensuring high-performance computation with large datasets.
2. Cost-Effectiveness: Pay only for what you use, with no need to manage and pay for idle servers.
3. Simplicity: Serverless Spark allows you to focus on writing and running applications, not on infrastructure.
4. Performance: The serverless architecture ensures fast execution and better performance.
Conclusion
For projects with four jobs or less or where infrastructure control is less important, Serverless Spark is a useful tool to manage server cost effectively while allowing developers to focus on coding. My next post provides an in-depth guide to fine-tuning your job performance in Serverless Spark.
Reference
- https://cloud.google.com/dataproc-serverless/docs/overview
- https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling
- https://cloud.google.com/dataproc-serverless/docs/concepts/properties
- https://cloud.google.com/blog/products/data-analytics/tune-spark-properties-to-optimize-dataproc-serverless-jobs

