Persistent Spark History Server with Transient Dataproc clusters

6 min readMay 26, 2021

These days more and more data engineering teams are opting to run their Spark workloads on Transient Dataproc clusters. Using Transient Clusters is one of the resource optimization technique that you can use in your architecture and their cost proposition is really awesome as you only pay for the time you are using the cluster.

However, one problem that arise from using transient clusters is the fact that you loose the access to the most valuable tool in the arsenal of a Spark Developer, The Spark UI. In transient clusters, Spark UI is only available as long as the job is running and post this you loose access to it.

In the setting where Spark Clusters are long lived and always running, we also have a dedicated Spark History Server which show you the Spark UI long after the job is gone.

The way history server works is that Spark writes event logs while the application is running. These logs power the Spark UI when application is in running state and post the application finishes, these same logs can be used to power the History Server UI.

In this article we are going to discuss, how we can setup a Persistent Spark History Server which can collect event logs from multiple Spark applications running on multiple transient clusters and can show the Spark UI when the application finishes.

So lets get started.

Table of Content

Setup Infrastructure
Enable Event Logs on Datparoc Clusters
Demo
Conclusion

1. Setup Infrastructure

GCS Bucket

We’ll first setup a bucket which we will use to store History Server Startup script, Spark driver files and event logs.

PROEJCT=<GCP-PROJECT-ID>
gsutil mb gs://$PROJECT

Spark History Server Docker Image

Next, we’ll create a Docker image to run our History Server. For running history server you can use any Docker image with Spark distribution in it. However, since we will be storing our event logs on Google Cloud Storage (GCS) bucket, we’ll need Hadoop GCS Connector in the image to read event data.

We are using Spark version 2.4.0 as our base because in Spark 3.0 structure for storing event logs has changed and a history server from Spark 3.0 is not able to read the event logs from previous versions. (At least this is what I experienced, please comment if that’s not the case).

Now build the docker image.

docker build -t spark-history-server:2.4.0 .

You can choose to create your own image and host it on your own docker hub, or you can use the one I pushed to my repository.

docker pull kaysush/spark-history-server:2.4.0

Spark History Server Compute VM

We’ll use GCP Compute VM to start a Persistent History Server, using following startup script to start the service.

Startup Script

Upload following script to the bucket we created in previous steps.

Notice the argument we are passing to the Docker container. We are passing the path on our GCS bucket, where we will write the Spark event logs.

We’ll configure our Dataproc clusters to write event logs to this location which can be picked up by our History Server.

Compute Instance

Next, lets create the VM.

gcloud beta compute \
--project=<PROJECT-ID> instances create spark-history-server \
--zone=us-central1-a \
--machine-type=e2-medium \
--subnet=default \
--network-tier=PREMIUM \
--metadata=startup-script-url=gs://<BUCKET-NAME>/start-server.sh \
--maintenance-policy=MIGRATE \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--tags=spark-history-server \
--image=cos-77-12371-1109-0 \
--image-project=cos-cloud \
--boot-disk-size=10GB \
--boot-disk-type=pd-balanced \
--boot-disk-device-name=spark-history-server \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--reservation-affinity=any

Firewall Rule

We also have to create a firewall rule to allow access to History Server UI from outside.

gcloud compute firewall-rules create allow-spark-history-server \
--network default \
--priority 1000 \
--direction ingress \
--action allow \
--target-tags spark-history-server \
--source-ranges 0.0.0.0/0 \
--rules tcp:8080 \
--enable-logging

Once all of these components are up and running you should be able to access History Server UI at http://<EXTERNAL-IP>:8080.

Currently it won’t have any applications to show. So in the next section, we’ll configure our clusters to write event logs to the bucket which can then be picked by History Server UI.

2. Enable Event Logs on Datparoc Clusters

To enable event logs on Dataproc clusters we need to set following properties either at cluster level or per job level. I would suggest to add these properties at cluster level to make event logging transparent from applications.

spark.eventLog.enabled=true
spark.eventLog.dir=gs://<BUCKET-NAME>/spark-events

Adding these two properties will enable the logging and our Spark applications will write the event logs to the GCS bucket.

We also have two more properties set to allow for rolling files in case our log files get big for long running applications.

spark.eventLog.rolling.enabled=true
spark.eventLog.rolling.maxFileSize=128m

Lets create a Dataproc cluster with these properties.

gcloud beta dataproc clusters create demo-cluster \
--region us-central1 \
--zone us-central1-b \
--master-machine-type n1-standard-2 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-2 \
--worker-boot-disk-size 500 \
--image-version 1.3-ubuntu18 \
--properties spark:spark.eventLog.enabled=true,spark:spark.eventLog.dir=gs://<BUCKET-NAME>/spark-events/,spark:spark.eventLog.rolling.enabled=true,spark:spark.eventLog.rolling.maxFileSize=128m \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project <PROJECT-ID>

We are using Spark 2.3 in our Dataproc cluster to confirm if our 2.4 History server will be able to pickup the logs or not.

3. Demo

Once this cluster is created, we’ll run couple of sample PySpark jobs to generate some event logs.

Upload this driver file in the bucket to be used in our Spark Jobs.

Next run this job couple of times in our cluster.

Once the application finishes go to the bucket and you’ll see event log files at gs://<BUCKET-NAME>/spark-event , exactly where we configured our cluster to write events.

Spark Event Log Files

Check the logs of our history server, and you’ll see that History Server has picked up these files. History server keep checking the event logs folder for any new applications. The spark property that governs this interval is spark.history.fs.update.interval=10s .

History Server Logs

Open the History Server UI and you’ll see your applications.

4. Conclusion

Having a persistent Spark History Server running when you are using Transient Dataproc clusters is of great value for Spark Developers to look at the application stats after the application finishes.

One thing to note however is that on production server you should also setup clean up properties to make sure history server is not loading applications from start of time.

That’ it. That’s how we can setup a persistent history server for our transient Dataproc clusters.

If you find anything wrong in the post or have questions in general feel free to comment.

Till then Happy Coding! :)

Also, if you wish to learn how to use Dataproc to run/migrate your Spark Workloads, you can follow my free course “Apache Spark on Dataproc | Google Cloud Series” on YouTube.