Dataproc Serverless for Spark on GCP

Published in

Appsbroker CTS Google Cloud Tech Blog

5 min readApr 5, 2022

There’s a new kid on the block, and they’re here to chew bubble gum and change the way you use the cloud for Big Data & Analytics. And they’re all out of bubble gum.

Firstly, a writers note: I’m an avid fan of containerisation and somewhat “serverless” technologies in general (see post history). The day Serverless Spark was made GA in Dataproc was the day that legacy Big Data finally got its break and caught up with the modern tech stack. I’ve been pretty excited about Spark’s support for running on Kubernetes since it was first introduced as Big Data + containers is where two of my favourite worlds collide.

What is Dataproc?

To get a better understanding of the power behind Serverless Spark, we need to understand the capabilities of Dataproc more generally.

In a Google Cloud, not so long ago…

Dataproc is Google Clouds answer to Hadoop in the cloud and enables organisations to move their analytic workloads into the cloud with somewhat minimal effort. It works in very much the same way as other services in GCP work: Enabling the separation of storage and compute. This is immensely powerful in the Analytics space, as it allows users to persist data, but create compute on demand in a pay-as-you-go fashion (did someone say BigQuery?). Meaning you don’t have to continuously run and maintain humongous clusters, significantly reducing the amount of operations required.

Google Cloud introduced a couple of different ways in which you could orchestrate your clusters and run jobs, such as Workflow Templates and the Dataproc Operators for Cloud Composer (GCP’s managed Airflow) — or a combination of the two. These capabilities enabled users to start up and tear down clusters as part of their data pipelines.

Great! We don’t need to create clusters up-front or persist them! Plus, we can automate things!

However, there is one major drawback — and this my friends, is the need to actually configure a cluster. When I say configure, I really mean right-size it. You have to understand the workload you’re running and how scaling, whether vertically or horizontally, is going to affect it. This configuration needs to be hardcoded somewhere, either in your workflow template or in your Composer DAG. It can take a lot of trial and error to ensure that you’ve got the right amount of workers, executors, CPU and RAM to be able to successfully run the workloads you want in the desired timeframe.

On a positive note though, you can also scope a cluster per workload so that if there is a task that doesn’t need all that power, you could create a more specifically sized cluster. Even so — wouldn’t it be great if there was a way that this could be done for you, maybe even automagically…? 😱

[a shadowy figure bursts through the door and yells… ]

Serverless Spark to the rescue!

Serverless Spark became Generally Available in early 2022 and removed (in my view) the single biggest pain point it had — defining your cluster size up front.

Serverless Spark is, well, exactly that — serverless. But what does that mean?

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers
Source: Wikipedia

Comparing that definition with the previous example of defining your compute requirements upfront, in reality, this means that when creating your Spark jobs, you simply tell Dataproc to run your job and it takes care of the rest.

Why does this matter?

This matters because it enables your Data Engineering assets to really focus on the Data Engineering. There’s no need to “performance test” your pipelines and continuously tweak your cluster to ensure it’s as beefy as your workloads require.

There’s no need for an operations team (sorry Ops-led folk!) to manage the infrastructure that your workloads are running on. Everything from upgrades to monitoring is done for you.

You can simply run a gcloud dataproc batches submit ... command, and off you go, instead of having to create a cluster to use for development. A development cluster per engineer can be costly and using a shared cluster can cause issues around contention

If you’d like to give Serverless Spark a try, my colleague Keven wrote an article on how you can get started:

Running pyspark jobs on Google Cloud using Serverless Dataproc

Run Spark batch workloads without having to bother with the provisioning and management of clusters!.

medium.com

Hang on. What about Dataflow?

Dataflow is Google Clouds’ native serverless analytics processing platform for running Apache Beam. Dataflow has been serverless from the start. It allows you to write a job and submit it, leaving the orchestration and scaling of infrastructure to Google.

The launch of Serverless Spark closes the gap between Dataproc & Dataflow from an operations perspective, making the choice between the two even more difficult.

When should I choose Dataproc?

If your organisation is already bought into the Hadoop ecosystem, then it makes a lot of sense to migrate your workloads into Google Cloud using Dataproc. This enables your organisation to move much more quickly than having to rewrite a bunch of your pipelines, thus gaining value from the cloud fairly early on.

When should I choose Dataflow?

If you’re starting your analytics journey with GCP as the first step, then Dataflow is probably something you should consider, if you can’t do what you need to directly in BigQuery. This allows your organisation to adopt cloud native technologies from the get-go.

But what about vendor lock-in, I hear you cry!

Regardless of which option you choose, you’re locked into something. There is always going to be some form of investment in a particular technology, so why fret? If you do decide to adopt Dataflow and then move away from GCP in the future, the task of migrating your warehouse in its entirety is so much effort that migrating your workloads to run elsewhere is most likely a small part of that. Also Dataflow runs on other platforms, like Spark and Kubernetes.

Ash Broadley — Cloud Data Architect

If you want to transform your organisation’s data warehouse, reduce your TCO and gain value from your data much more quickly, then feel free to get in touch!

About CTS:

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!