Persisting Application History from Ephemeral Clusters on Google Cloud Dataproc

Jacob Ferriero
Apr 2, 2019 · 10 min read

So you want to use ephemeral Dataproc Clusters, but don’t want to lose your valuable YARN and Spark history once your cluster is torn down.

Enter the single node Dataproc cluster. It will serve as your persistent history server for MapReduce and Spark job history as well as a window into aggregated YARN logs.


Background

Hadoop engineers who have been around for a while are probably used to the YARN and Spark UIs for Application history. Some may even dig through the YARN container logs for a deeper layer of debugging. These UIs and container logs provide valuable information for diagnosing what went wrong during a certain job. In the cloud era (no pun intended) of cluster ephemerality, you’d like to have a persistent way of accessing these familiar interfaces. Dataproc is the Google Cloud Platform service for managing Hadoop clusters, making it simple to spin-up and tear-down resources at will. This post outlines a pattern to have a single-node long-running Dataproc cluster that acts as an Application History Server for a shorter lived cluster (or several such clusters). This post follows best practices for creating long-running Dataproc environments as described in Google’s blog plost: “10 tips for building long-running clusters using Cloud Dataproc

What We’re Building

  • A GCS bucket to house cluster templates, initialization actions, aggregated YARN logs, Spark events and MapReduce and .
  • A VPC network for your clusters. (Note: ssh access to private IP addressed machines could be provided by a VPN paired with this VPC, but this is out of scope for this example.)
  • A single-node Dataproc cluster for viewing the Spark and MapReduce job history UIs. This will have a public IP to allow simple SSH access via OS Login.
  • A Dataproc cluster configured to persist its YARN logs in GCS. This will not have public IP addresses.

Let’s Get Building

First things first we need to enable a few services in the project you wish to spin up this example.

Next, pull down the Google Cloud professional services repository and navigate to the source for this example.

Using the Right Tool: gcloud vs Terraform

Capturing and version controlling the configuration of your cloud environment has become best practice. However, engineers are faced with different ways of doing this. Terraform provides a lot of value when you are spinning up several permanent components of your infrastructure, with various dependencies. In other words Terraform is great when you want something to authoritatively capture the state of your environment. The methods provided by the Google Cloud SDK (ie. the CLI) are really convenient for spinning up a single resource during prototyping, and for managing ephemeral resources. When it comes to Dataproc clusters, if you plan to regularly tear down your cluster (as would be the case for a job scoped cluster), it is best to capture the configuration as a cluster template yaml (this can be exported from an existing cluster or written by hand). It doesn’t make sense to reapply a terraform plan at the beginning and end of jobs to manage cluster spin-up /tear down, as this could be prone to breaking.

The code for this example provides both terraform and yaml for both the history server and a cluster configured to persist its logs on GCS. The thought behind this is

  • If you are treating your history server as a rather permanent part of your environment, it should be in terraform
  • If you are treating the history server as something you spin up and tear down when you’re using it then it makes more sense to create it using .

Similarly, the and will spin up congruent clusters. Choose which cluster configuration tool makes more sense for your use of that cluster.

If you are just trying to run this example, it’s recommended you read the Google Cloud SDK Environment Setup without running the blocks. Just check out the professional services repo and read through the files in the directory, set the appropriate values in and apply the plan for this example. Note, you’ll need to have terraform installed. This will greatly simplify tearing down all the resources for this example when you’re done.

Terraform Environment Set Up

From the dataproc-persistent-history-server directory, run the following commands.

You can skip the blocks in the next section, but I’d recommend skimming for explanation as to why certain decisions were made for configuring the environment for this example.

Environment Setup

This walk-through will spin up an environment using CLI while explaining each choice.

Storage and Service Account Setup

Create a Google Cloud Storage Bucket in which you’ll store your logs. It’s a clean practice to keep logs in a separate bucket from your data to make the intention of the bucket clear and to simplify privileges.

(Note: You must stage an empty object in this GCS bucket so that the path exists before running a job, otherwise the job will fail.)

Create a service account for your history server.

Give this service account permissions to run a Dataproc cluster:

Grant read access to the history bucket:

Setting up your Network

You want to take extra care in setting up the firewall rules to avoid the common security pitfall of exposing the YARN web interface port 8088 to the public internet allowing bad actors to submit jobs on your cluster. The recommended approach for engineers connecting to a cluster is to connect to your clusters via VPN that is peered with your VPC network. This way, engineers can ssh into clusters referencing their private IP addresses using the . To automate Linux account creation you can enable OS Login. To keep this example simple, you will assign the history server a public IP address and a firewall rule to allow any ingress traffic on port 22.

Note: The firewall rule will expose an entry point to your Dataproc cluster from the public internet on port 22.

First things first, get your networking pre-requisites in order. The following command creates a regional network and specifies that we will create your own sub-networks.

Add a sub-network for Dataproc. The following command creates a subnet for dataproc in the us-central1 region. It specifies a CIDR range that specifies the range of internal IPs for the machines on this network from to . We also enable private Google access, so that the machines without public IP addresses on this network can reach GCP services.

Next, create a firewall rule to allow ssh for the tag. For the source range you’re following in suit with the default network and allowing ssh from all IPs. You should consider restricting this CIDR range to only those of the data engineering teams that would need to ssh to the cluster.

Finally, Hadoop uses a lot of ports for communication between nodes so create this firewall rule as well to allow all tcp, udp and icmp traffic between nodes with any of the specified tags.

Spinning up a History Server

Think of this history server as a window to the logs you’ve persisted on GCS. All of the state is captured on GCS so you can choose to spin this history server up and down at will or keep it running for convenience.

However, keep in mind:

… the MapReduce job history server only reads history from Cloud Storage when it first starts up. From that point forward, the only new job history you will see in the UI is for the jobs that were run on that cluster; in other words, these are jobs whose history was moved by that cluster’s job history server. By contrast, the Spark job history server is completely stateless, so the previous caveats do not apply. — Mikayla Konst & Chris Crosbie: 10 tips for building long-running clusters using Cloud Dataproc

To configure our clusters to use GCS for log agregation and point at the history server we need properties like this:

Next, create a single-node Dataproc cluster on the sub-network created above. This will serve as your long-running Application History Server. You can use the file to create the history server.

First, replace the placeholders in the yaml files by running the following commands.

Now you’re ready to spin up the history server.

Creating a Cluster and Submitting Jobs

Finally, spin up an ephemeral cluster, that’s configured to write logs to Google Cloud Storage, so they can live long after your cluster is torn down. It’s best practice to capture information about the clusters you are creating in a yaml file for consistent cluster creation.

Note: The cluster yamls refer to an initialization action on GCS to disable the local history servers on the ephemeral clusters. This can be found in . You must stage this in your history bucket.

You can manage cluster spin-up, submission of a Spark job, submission of a Hadoop job, and finally cluster tear-down with a workflow template. This is the most straightforward way to run this example. The job dependencies and cluster definition can be found in .

Analyze Your Spark application history Long After Your Cluster Is Gone

Now that the cluster where you ran the job has been deleted, use an SSH tunnel to your history server in order to display the Spark History on your Chrome browser. The following commands are for a Linux OS, you can follow the instructions for your OS in the Dataproc docs.

Connect Chrome through the proxy.

This should launch the Chrome browser and you should be able to see the Spark History UI.

Take a look at the YARN History Server.

[Update 6/22/2019- Using Dataproc Component Gateway]

Added some additional configuration to to enable Dataproc Component Gateway, an alpha feature for accessing to the cluster’s UIs from links in the GCP console without need to create ssh tunnel. Note, this will not work on VPC-SC.

Closing Thoughts

While this is just an example, hopefully it paints a picture of how you can maintain the logs that are important for diagnosing misbehaving jobs and still confidently destroy clusters when jobs complete. Depending on your team structure, you could build on top of this to:

  • If the engineers that author your pipelines do not also make infrastructure decisions: Create an API on top of workflow templates that abstracts these cluster configuration details away from your ETL engineers and allow them to just worry about writing their pipelines.
  • If your ETL pipeline engineers want control over the infrastructure their pipelines run on: Implement a practice where each time your ETL engineer defines a job part of that job definition includes a workflow template that captures their job and the appropriate cluster configuration for it to run on. The configuration in this example can serve as boiler plate to configure the logging pattern discussed in this post.

Clean Up

Don’t forget to tear down the resources unless you want to pay for them :).

Listen to the “Threat Level Midnight” hero, and clean up the artifacts from this example.

If you spun up with terraform just run . If not, delete your infrastructure down in the cloud console or the with in the following order:

  1. Dataproc clusters
  2. GCS bucket , service account , VPC subnet
  3. VPC network

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Thanks to Daniel De Leo

Jacob Ferriero

Written by

Data Engineer @Google

Google Cloud - Community

A collection of technical articles published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade