Run your Spark and Hadoop jobs as a Service with Dataproc Workflow Templates

David Verdejo
bluekiri
Published in
4 min readSep 2, 2018
Resultado de imagen de dataproc

Let’s start with a brief introduction to Dataproc. Dataproc is a fully-managed cloud service for running Apache Spark and Apache Hadoop clusters over Google Cloud Platform. It creates ephemeral clusters instead of provisioning and maintaining a cluster for all our jobs.

This way, clusters will be easier to maintain (you only install the requirements for the job) and also cheaper (you only pay for the time the cluster is running). The main requirement is to move data from inside the cluster (HDFS) to Google Cloud Storage (GCS) in order to kill the cluster (and to use preemptive workers).

To do execute Dataproc jobs, one has to follow the following steps:

  1. Create the Dataproc cluster.
  2. Submit the job after the cluster provisioning.
  3. Monitor the job until it’s finished.
  4. Finally kill the cluster.

The good news is that Google has announced Workflow Templates. Its main goal is to provide a mechanism for managing and executing our jobs. Note that it is a beta release, so you might want to avoid using this feature in production environments:

This feature might be changed in backward-incompatible ways and is not subject to any SLA or deprecation policy. This feature is not intended for real-time usage in critical applications.

In the next few paragraphs, we’re going to walk you through a quick demo (source code in Github). We are going to create a workflow template, add one or more jobs to the template, and finally instantiate the template. Only when the template is instantiated, the cluster will be created, the jobs executed and then the cluster deleted.

First step is to create a PySpark job that will sort a text file.

import pyspark

sc = pyspark.SparkContext()
rdd = sc.textFile('gs://bk-dataproc-template/quijote.txt')
print(sorted(rdd.collect()))

Our text file contains the first words from “Don Quixote” (in Spanish)

En
un
lugar
de
la
Mancha
de
cuyo
nombre
no
quiero
acordarme

Next step is to upload both files to GCS

Then, we run the workflow template. This is a four step process:

1) Create the workflow template

gcloud beta dataproc workflow-templates create <TEMPLATE_ID>gcloud beta dataproc workflow-templates create quijote_dtp_template

2) Setup a new cluster (we could use a previous cluster too but we’ll make things more interesting and create a new one)

gcloud beta dataproc workflow-templates set-managed-cluster <TEMPLATE_ID> — zone=<ZONE> — cluster-name=<CLUSTER_NAME> ...cluster args...gcloud beta dataproc workflow-templates set-managed-cluster quijote_dtp_template \
— zone=europe-west1-b \
— master-machine-type n1-standard-1 \
— worker-machine-type n1-standard-1 \
— num-workers 2 \
— cluster-name=bk-cluster-1

3) Add jobs to the workflow

gcloud beta dataproc workflow-templates add-job <JOB_TYPE> \
--step-id <STEP_NAME> \
--workflow-template <TEMPLATE_ID> \
...job args...
gcloud beta dataproc workflow-templates add-job pyspark gs://bk-dataproc-template/quijote_sorted.py --step-id=quijo --workflow-template=quijote_dtp_template

4) Finally, run the workflow (instantiate). Take into account that you are going to be charged for the resources (see pricing)

gcloud beta dataproc workflow-templates instantiate <TEMPLATE_ID>gcloud beta dataproc workflow-templates instantiate quijote_dtp_template

Here are some screenshots to show you the process. We start our workflow and start the cluster provision:

When the cluster is up and running, the job is submitted:

The job finishes and the cluster is killed (we stop paying for this resource)

Finally, we can see the job status and the generated logs

To wrap this up, with Workflow Templates we can predictably create the infrastructure to run our Spark and Hadoop jobs on-demand and at a low cost.

“I love it when a plan comes together”… with The Bluekiri team

--

--