Creating Serverless Spark Jobs with Google Cloud

Published in

Geek Culture

6 min readOct 31, 2022

Write scalable Spark applications and data pipelines without any manual infrastructure provisioning or tuning.

When it comes to data processing, with Google Cloud one size does not fit all. Today, engineers have a lot of flexibility as they can choose between serverless, Kubernetes clusters, and compute clusters for their Spark applications.

I have always been a fan of operational simplicity. When Dataproc Serverless for Spark became generally available, I knew it was time to do some testing and verify the benefits. After all, configuring and maintaining clusters is an expensive and time-consuming task.

In this post, I will make use of a Google Cloud service named Dataproc to run a Spark workload without configuring any infrastructure. I will make use of a feature called Dataproc Serverless for Spark to run a PySpark script. Additionally, we will be creating a custom container to explore the possibility of adding additional dependencies.

Managed Apache Spark with Dataproc

Dataproc is a managed Apache Spark and Apache Hadoop service as per Google Cloud documentation. It provides open-source data tools for batch processing, querying, streaming, and machine learning.

Creating Serverless Spark Jobs with Google Cloud

Managed Apache Spark with Dataproc

Written by Ramon Marrero