Creating Serverless Spark Jobs with Google Cloud
Write scalable Spark applications and data pipelines without any manual infrastructure provisioning or tuning.
When it comes to data processing, with Google Cloud one size does not fit all. Today, engineers have a lot of flexibility as they can choose between serverless, Kubernetes clusters, and compute clusters for their Spark applications.
I have always been a fan of operational simplicity. When Dataproc Serverless for Spark became generally available, I knew it was time to do some testing and verify the benefits. After all, configuring and maintaining clusters is an expensive and time-consuming task.
In this post, I will make use of a Google Cloud service named Dataproc to run a Spark workload without configuring any infrastructure. I will make use of a feature called Dataproc Serverless for Spark to run a PySpark script. Additionally, we will be creating a custom container to explore the possibility of adding additional dependencies.
Managed Apache Spark with Dataproc
Dataproc is a managed Apache Spark and Apache Hadoop service as per Google Cloud documentation. It provides open-source data tools for batch processing, querying, streaming, and machine learning.