How to run a Java 11 Spark Job on Google Cloud Dataproc

Published in

Unearth

4 min readApr 9, 2020

Managing Java & Spark dependencies can be tough. We recently migrated one of our open source projects to Java 11 — a large feat that came with some roadblocks and headaches. Luckily, installing OpenJDK 11 was the easy part. In this tutorial, we’ll show you how to set up your Google Cloud Platform Dataproc Spark jobs to run software compiled in Java 11.

Introducing GCP Dataproc

Dataproc is a fairly new addition to the Google Cloud Platform (GCP). It offers everything you need to execute and manage large Apache Spark and Hadoop data processing jobs in the cloud. This includes services like on-demand ephemeral clusters, autoscaling, rapid cluster creation, out-of-the-box GCP service integration (Cloud SQL, Cloud Storage, etc.), and much more.

In this article, we’ll show you how to do the following:

Create a Custom Image
Create a Cluster from your custom image
Run a Java 11 Spark Job on your cluster

Before we begin, let’s first review some GCP terminology:

Cluster — A cluster is a combination of master and worker machines used to distribute data processing.
Image — Predefined virtual machine configurations used as instructions to launch new instances.
Instance — A virtual machine hosted on Google’s infrastructure.
Cluster Properties — Configuration properties for open source tools (hive, spark, yarn, presto, etc.) installed on each cluster instance.

1. Creating a Custom Image

Although GCP Dataproc offers managed public images for both Debian and Ubuntu operating systems, none of them come pre-configured with Java 11. We’ll use the GCP Compute Engine custom image feature to spin up an image with open JDK 11 installed.

Before we get started, make sure you have the google cloud sdk installed and configured on your machine. For installation instructions, please visit: https://cloud.google.com/sdk/docs/quickstarts.

First, let’s clone this very useful custom-images repo. Here, we’ll execute a python script that automates the custom image generation workflow.

git clone https://github.com/GoogleCloudDataproc/custom-images

Create the following shell script & name it java-11.sh

#!/usr/bin/env bashecho "Installing custom packages..."sudo apt-get -y update
sudo apt install -y openjdk-11-jdk
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"export PATH=$PATH:$JAVA_HOME/binecho $JAVA_HOME
java -versionecho "Successfully installed java 11 packages."

Execute the generate_custom_image.py with the following arguments:

image-name: the name of your image (must be unique)
dataproc-version: see https://cloud.google.com/dataproc/docs/concepts/versioning/overview for more details
customization-script: path to our java-11.sh script
zone: GCP region/zone - https://cloud.google.com/compute/docs/regions-zones
gcs-bucket: GCP Storage bucket URL

python3 generate_custom_image.py \
    --image-name "custom-ubuntu-java11" \
    --dataproc-version "1.4.24-ubuntu18" \
    --customization-script java-11.sh \
    --zone "us-west1-a" \
    --gcs-bucket "gs://my-gcp-storage-bucket" \
    --shutdown-instance-timer-sec 500

Coffee break ☕️ — this will take a few minutes. Once complete, quickly verify that your custom image exists.

gcloud compute images list --filter="name=('custom-ubuntu-java11')"

Awesome! We now have a custom image we can use to create a Dataproc cluster.

2. Creating a Cluster

Let’s create a cluster. GCP does an awesome job of providing the Developer community with an abundance of tools for interacting with their various APIs. Dataproc has client libraries available in various programming languages (Node.js, Python, Java, Go, and more), as well as via the gcloud cli. Since we haven’t fully automated this workflow, we’ll use the GCP console UI.

Go to the Create Cluster page in your GCP console.

After configuring resources for your master and cluster nodes, scroll towards the bottom of the page and select “Advanced options”.

Set your cluster image as “custom-ubuntu-java11”.

2. Add the following cluster properties. Again, these are custom properties used to overwrite the default Spark config.

spark-env:JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
spark:spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

This sets the default JAVA_HOME environment variable for the Spark driver and executors.

At this point, we’re good to go — select “Create” and watch your cluster provision.

3. Submit Spark Job

Now that we have everything set up, this last part should be quick. In the Daraproc console, open the Job Submission page and select the cluster we created above.

First, make sure you’ve set the Job Type to Spark. Next, add your main class, program arguments, and the path to your jar file (in this example, I uploaded a shaded jar to a Cloud Storage bucket).

You now have a Spark job running Java 11 software! If you run into a ClassNotFound or NoSuchMethodError Exception, there may be a dependency conflict between your jar and the software installed on the cluster —which can be resolved by building a shaded jar.

Now that we have the process down for creating clusters with the right software, we’re all set up for automation with workflow orchestration tools like Apache Airflow, Metaflow, Luigi, and many more. Enjoy!

Leave a comment below if you have additional questions!

How to run a Java 11 Spark Job on Google Cloud Dataproc

Introducing GCP Dataproc

1. Creating a Custom Image

2. Creating a Cluster

3. Submit Spark Job

Written by D Baah