Creating Serverless Spark Jobs with Google Cloud

Ramon Marrero
Geek Culture
Published in
6 min readOct 31, 2022

--

Write scalable Spark applications and data pipelines without any manual infrastructure provisioning or tuning.

Business illustrations by Storyset

When it comes to data processing, with Google Cloud one size does not fit all. Today, engineers have a lot of flexibility as they can choose between serverless, Kubernetes clusters, and compute clusters for their Spark applications.

I have always been a fan of operational simplicity. When Dataproc Serverless for Spark became generally available, I knew it was time to do some testing and verify the benefits. After all, configuring and maintaining clusters is an expensive and time-consuming task.

In this post, I will make use of a Google Cloud service named Dataproc to run a Spark workload without configuring any infrastructure. I will make use of a feature called Dataproc Serverless for Spark to run a PySpark script. Additionally, we will be creating a custom container to explore the possibility of adding additional dependencies.

Managed Apache Spark with Dataproc

Dataproc is a managed Apache Spark and Apache Hadoop service as per Google Cloud documentation. It provides open-source data tools for batch processing, querying, streaming, and machine learning.

--

--

Ramon Marrero
Geek Culture

Head of Data Engineering | AWS Community Builder | AWS Certified Solutions Architect | Google Cloud Certified Professional