Persisting Application History from Ephemeral Clusters on Google Cloud Dataproc

Jacob Ferriero
Apr 2 · 9 min read

So you want to use ephemeral Dataproc Clusters, but don’t want to lose your valuable YARN and Spark history once your cluster is torn down.

Enter the . It will serve as your persistent history server for MapReduce and Spark job history as well as a window into aggregated YARN logs.


Background

Hadoop engineers who have been around for a while are probably used to the YARN and Spark UIs for Application history. Some may even dig through the YARN container logs for a deeper layer of debugging. These UIs and container logs provide valuable information for diagnosing what went wrong during a certain job. In the cloud era (no pun intended) of cluster ephemerality, you’d like to have a persistent way of accessing these familiar interfaces. Dataproc is the Google Cloud Platform service for managing Hadoop clusters, making it simple to spin-up and tear-down resources at will. This post outlines a pattern to have a single-node long-running Dataproc cluster that acts as an Application History Server for a shorter lived cluster (or several such clusters). This post follows best practices for creating long-running Dataproc environments as described in Google’s blog plost: “

What We’re Building

  • A GCS bucket to house cluster templates, initialization actions, aggregated YARN logs, Spark events and MapReduce done-dir and intermediate-done-dir.
  • A VPC network for your clusters. (Note: ssh access to private IP addressed machines could be provided by a VPN paired with this VPC, but this is out of scope for this example.)
  • A single-node Dataproc cluster for viewing the Spark and MapReduce job history UIs. This will have a public IP to allow simple SSH access via OS Login.
  • A Dataproc cluster configured to persist its YARN logs in GCS. This will not have public IP addresses.

Let’s Get Building

First things first we need to in the project you wish to spin up this example.

gcloud config set project <your-project-id>gcloud services enable \
compute.googleapis.com \
dataproc.googleapis.com

Next, pull down the Google Cloud repository and navigate to the source for this example.

git clone cd professional-services/examples/dataproc-persistent-history-server

Using the Right Tool: gcloud vs Terraform

Capturing and version controlling the configuration of your cloud environment has become best practice. However, engineers are faced with different ways of doing this. provides a lot of value when you are spinning up several permanent components of your infrastructure, with various dependencies. In other words Terraform is great when you want something to authoritatively capture the state of your environment. The methods provided by the Google Cloud SDK (ie. the gcloud CLI) are really convenient for spinning up a single resource during prototyping, and for managing ephemeral resources. When it comes to Dataproc clusters, if you plan to regularly tear down your cluster (as would be the case for a job scoped cluster), it is best to capture the configuration as a cluster template yaml (this can be or ). It doesn’t make sense to reapply a terraform plan at the beginning and end of jobs to manage cluster spin-up /tear down, as this could be prone to breaking.

The code for this example provides both terraform and yaml for both the history server and a cluster configured to persist its logs on GCS. The thought behind this is

  • If you are treating your history server as a rather permanent part of your environment, it should be in terraform
  • If you are treating the history server as something you spin up and tear down when you’re using it then it makes more sense to .

Similarly, the and will spin up congruent clusters. Choose which cluster configuration tool makes more sense for your use of that cluster.

If you are just trying to run this example, it’s recommended you read the Google Cloud SDK Environment Setup without running the gcloud blocks. Just check out the and read through the files in the terraform directory, set the appropriate values in terrafom.tfvars and apply the plan for this example. Note, you’ll need to have installed. This will greatly simplify tearing down all the resources for this example when you’re done.

Terraform Environment Set Up

From the , run the following commands.

cd terraform
terraform init
terraform plan
terraform apply

You can skip the gcloud blocks in the next section, but I’d recommend skimming for explanation as to why certain decisions were made for configuring the environment for this example.

Google Cloud SDK Environment Setup

This walk-through will spin up an environment using gcloud CLI while explaining each choice.

Storage and Service Account Setup

Create a Google Cloud Storage Bucket in which you’ll store your logs. It’s a clean practice to keep logs in a separate bucket from your data to make the intention of the bucket clear and to simplify privileges.

(Note: You must stage an empty object in this GCS bucket so that the spark-events before running a job, otherwise the job will fail.)

gsutil mb -c regional -l us-central1 gs://my_history_bucket
touch .keep
gsutil cp .keep gs://my_history_bucket/spark-events/.keep
rm .keep

Create a service account for your history server.

gcloud iam service-accounts create history-server-account

Give this service account permissions to run a Dataproc cluster:

gcloud projects add-iam-policy-binding project_id \
--member=serviceAccount:history-server-account@project_id.iam.gserviceaccount.com \
--role=roles/dataproc.worker

Grant read access to the history bucket:

gsutil iam ch \
serviceAccount:history-server-account@project_id.iam.gserviceaccount.com:objectViewer \
gs://my_history_bucket

Setting up your Network

You want to take extra care in setting up the firewall rules to avoid the of exposing the YARN web interface port 8088 to the public internet allowing bad actors to submit jobs on your cluster. The recommended approach for engineers connecting to a cluster is to that is peered with your VPC network. This way, engineers can ssh into clusters referencing their private IP addresses using the gcloud compute ssh — internal-ip <target-instance>. To automate Linux account creation you can enable . To keep this example simple, you will assign the history server a public IP address and a firewall rule to allow any ingress traffic on port 22.

Note: The firewall rule will expose an entry point to your Dataproc cluster from the public internet on port 22.

First things first, get your networking pre-requisites in order. The following command creates a regional network and specifies that we will create your own sub-networks.

gcloud compute networks create example-net \
--bgp-routing-mode regional \
--subnet-mode custom

Add a sub-network for Dataproc. The following command creates a subnet for dataproc in the us-central1 region. It specifies a CIDR range that specifies the range of internal IPs for the machines on this network from 10.128.0.0 to 10.128.255.255. We also , so that the machines without public IP addresses on this network can reach GCP services.

gcloud compute networks subnets create \
example-hadoop-subnet-central \
--range 10.128.0.0/16 \
--network example-net \
--region us-central1 \
--enable-private-ip-google-access

Next, create a firewall rule to allow ssh for the hadoop-history-ui-access tag. For the source range you’re following in suit with the default network and allowing ssh from all IPs. You should consider restricting this CIDR range to only those of the data engineering teams that would need to ssh to the cluster.

gcloud compute firewall-rules create \
example-hadoop-subnet-central-hadoop-history-ui-ssh \
--allow tcp:22 \
--direction INGRESS \
--network example-net \
--target-tags hadoop-history-ui-access,hadoop-admin-ui-access \
--source-ranges 0.0.0.0/0

Finally, Hadoop uses a lot of ports for communication between nodes so create this firewall rule as well to allow all tcp, udp and icmp traffic between nodes with any of the specified tags.

gcloud compute firewall-rules create \
example-hadoop-subnet-central \
--allow tcp,udp,icmp \
--source-tags hadoop-history-ui-access,hadoop-admin-ui-access \
--target-tags hadoop-history-ui-access,hadoop-admin-ui-access

Spinning up a History Server

Think of this history server as a window to the logs you’ve persisted on GCS. All of the state is captured on GCS so you can choose to spin this history server up and down at will or keep it running for convenience.

However, keep in mind:

… the MapReduce job history server only reads history from Cloud Storage when it first starts up. From that point forward, the only new job history you will see in the UI is for the jobs that were run on that cluster; in other words, these are jobs whose history was moved by that cluster’s job history server. By contrast, the Spark job history server is completely stateless, so the previous caveats do not apply. — Mikayla Konst & Chris Crosbie: c

Next, create a single-node Dataproc cluster on the sub-network created above. This will serve as your long-running Application History Server. You can use the history_server.yaml file to create the history server.

First, replace the placeholders in the yaml files by running the following sed commands.

cd workflow_templates
sed -i 's/PROJECT/your-gcp-project-id/g' *
sed -i 's/HISTORY_BUCKET/your-history-bucket/g' *
sed -i 's/HISTORY_SERVER/your-history-server/g' *
sed -i 's/REGION/us-central1/g' *
sed -i 's/ZONE/us-central1-f/g' *
sed -i 's/SUBNET/your-subnet-id/g' *

cd ../cluster_templates
sed -i 's/PROJECT/your-gcp-project-id/g' *
sed -i 's/HISTORY_BUCKET/your-history-bucket/g' *
sed -i 's/HISTORY_SERVER/your-history-server/g' *
sed -i 's/REGION/us-central1/g' *
sed -i 's/ZONE/us-central1-f/g' *
sed -i 's/SUBNET/your-subnet-id/g' *

Now you’re ready to spin up the history server.

cd ..
gcloud beta dataproc clusters import \
history-server \
--source=cluster_templates/history-server.yaml \
--region=us-central1

Creating a Cluster and Submitting Jobs

Finally, spin up an ephemeral cluster, that’s configured to write logs to Google Cloud Storage, so they can live long after your cluster is torn down. It’s best practice to capture information about the clusters you are creating in a yaml file for consistent cluster creation.

Note: The cluster yamls refer to an on GCS to disable the local history servers on the ephemeral clusters. This can be found in disable_history_servers.sh . You must stage this in your history bucket.

gsutil cp init_actions/disable_history_servers.sh \
gs://my_history_bucket/init_actions/

You can manage cluster spin-up, submission of a Spark job, submission of a Hadoop job, and finally cluster tear-down with a. This is the most straightforward way to run this example. The job dependencies and cluster definition can be found in spark_mr_workflow_template.yaml.

gcloud dataproc workflow-templates instantiate-from-file \
--region=us-central1 \
--file=workflow_templates/spark_mr_workflow_template.yaml

Analyze Your Spark application history Long After Your Cluster Is Gone

Now that the cluster where you ran the job has been deleted, use an SSH tunnel to your history server in order to display the Spark History on your Chrome browser. The following commands are for a Linux OS, you can follow the in the Dataproc docs.

gcloud compute ssh history-server-m \
--project=your-project-id \
--zone=us-central1-a \
-- \
-D 1080 -N

Connect Chrome through the proxy.

/usr/bin/google-chrome \
--proxy-server=”socks5://localhost:1080" \
--user-data-dir=”/tmp/history-server-m” \

This should launch the Chrome browser and you should be able to see the Spark History UI.

Take a look at the YARN History Server.

/usr/bin/google-chrome \
--proxy-server=”socks5://localhost:1080" \
--user-data-dir=”/tmp/history-server-m” \

Closing Thoughts

While this is just an example, hopefully it paints a picture of how you can maintain the logs that are important for diagnosing misbehaving jobs and still confidently destroy clusters when jobs complete. Depending on your team structure, you could build on top of this to:

  • If the engineers that author your pipelines do not also make infrastructure decisions: Create an API on top of workflow templates that abstracts these cluster configuration details away from your ETL engineers and allow them to just worry about writing their pipelines.
  • If your ETL pipeline engineers want control over the infrastructure their pipelines run on: Implement a practice where each time your ETL engineer defines a job part of that job definition includes a workflow template that captures their job and the appropriate cluster configuration for it to run on. The configuration in this example can serve as boiler plate to configure the logging pattern discussed in this post.

Clean Up

Don’t forget to tear down the resources unless you want to pay for them :).

Listen to the “Threat Level Midnight” hero, and clean up the artifacts from this example.

If you spun up with terraform just run terraform destroy . If not, delete your infrastructure down in the cloud console or the with gcloud in the following order:

  1. Dataproc clusters glcoud dataproc clusters delete <cluster-id>
  2. GCS bucket gsutil rb <bucket-name>, service account gcloud iam service-accounts delete <service-account-id>, VPC subnet gcloud compute networks subnets delete <subnet-name>
  3. VPC network gcloud compute networks delete <network-name>

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Thanks to Daniel De Leo.

Jacob Ferriero

Written by

Data Engineer @Google

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.