Apache Spark UI Monitoring with Docker

Step-by-Step Guide for Troubleshooting Apache Spark's Performance

Published in

Globant

4 min readMay 6, 2024

This article empowers data engineers and scientists who utilize Apache Spark to optimize the execution efficiency of their applications within the AWS Glue environment. We will provide a step-by-step guide to establishing the Spark UI for your Glue job. By leveraging it, you can gain real-time insights into job progress and meticulously examine logs to identify and address potential errors or performance bottlenecks.

Why Docker?

In this previous article, we discovered that the conventional approach of setting up the Spark History Server through CloudFormation is not cost-effective as it requires booting up EC2 instances on AWS.

Spark UI Monitoring with Cloud Formation: A Step-by-Step Guide

Troubleshooting Made Simple: Setting Up Spark History Server for AWS Glue

medium.com

A more budget-friendly alternative is available that eliminates the need for EC2 instances and reduces costs to Zero while providing the same level of effectiveness as running the server on a local Windows machine.

Docker is a tool that allows you to run software in a container, which is like a small virtual environment. With it, you can set up and run the Spark History Server on your local Windows machine without using an EC2 instance on AWS. This means you won’t have to worry about the cost of running an EC2 instance, and you can easily monitor your Spark applications’ performance by opening Docker when you need to check the Spark UI. It’s like having a small box that holds your Spark History Server, and you only need to open it (run Docker) when you want to see how your Spark apps are doing. This way, you can save money and have more control over your Spark environment.

The following five sections will show all the needed steps; let's move on!

Step 1: Install Docker on Local System

To set up Docker on your Windows machine, refer to the instructions in the following link. This comprehensive guide walks you through each step, ensuring a seamless installation process tailored to your system.

Install Docker Desktop on Windows

Get started with Docker for Windows. This guide covers system requirements, where to download, and instructions on how…

docs.docker.com

Step 2: Enable Spark Logging

Navigate to the corresponding Glue console to view the Spark UI for a particular Glue instance. Once accessed, locate and ensure the Spark UI checkbox is selected. Then, within the Spark UI logs path field, input the preferred S3 path for storing logs to be referenced within the Spark UI. For additional guidance, refer to the accompanying image. Lastly, ensure that the job is saved to retain these configurations.

Step 3: Setting IAM Role

Establish an IAM role by incorporating a user within the IAM’s User tab. Allocate permissions pertinent to S3, as the designated S3 bucket will serve as the repository for Spark logs.

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Action": [
    "s3:ListBucket",
    "s3:GetObject"
   ],
   "Effect": "Allow",
   "Resource": "arn:aws:s3:::<<Bucket Name>>"
  }
 ]
}

Issue an Access Key and Secret Access Key for the designated user. This process guarantees the user’s access is restricted solely to the designated S3 bucket housing the logs.

Step 4: Downloading Required Files

Access the provided link to retrieve the Dockerfile and pom.xml files. Refer to the highlighted sections in the accompanying image to identify the specific files for download.

aws-glue-samples/utilities/Spark_UI at master · aws-samples/aws-glue-samples

AWS Glue code samples. Contribute to aws-samples/aws-glue-samples development by creating an account on GitHub.

github.com

Dockerfile and pom.xml files to download

Step 5: Setting up Docker

Execute the following command in Windows PowerShell to build the Spark image.

 $ docker build -t glue/sparkui:latest .

This Docker Run command will create and build the SparkUI image, which will be visible in the Docker Images tab. It will also initiate the containers in the Docker Desktop.

$ LOG_DIR="s3a://path_to_eventlog/"
$ AWS_ACCESS_KEY_ID="AKIAxxxxxxxxxxxx"
$ AWS_SECRET_ACCESS_KEY="yyyyyyyyyyyyyyy"
$ docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=$LOG_DIR -Dspark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

Clicking on 18080:18080 will redirect to the Spark UI page. From there, you can analyze the jobs, monitor their processing, and identify potential bottlenecks.

Conclusion

The instructions in this post help create the Spark History Server interface for an AWS Glue job. One useful tool that lets you monitor the status of your Glue operation and check logs for problems or mistakes is the Spark UI. The entire procedure is simple to follow and is quite helpful for troubleshooting and attentively observing the work’s progress. These comprehensive instructions will provide essential insights into efficiently improving your AWS Glue Jobs.