Spark UI Monitoring with Cloud Formation

Troubleshooting Made Simple: Setting Up Spark History Server for AWS Glue

Published in

Globant

5 min readAug 1, 2023

In this article, we aim to help data engineers and data scientists working with Spark enhance the performance of their applications. This article will guide you through the process of creating the Spark UI for an AWS Glue job. There are a few simple steps that you can follow to get the Spark UI up and running for your Glue job. You can view the Spark UI to monitor the progress of your Glue job and check the logs for any errors or issues.

Let’s first revise what Spark, Spark UI, AWS Glue, and Cloud Formation are.

Spark

Apache Spark is an open-source distributed computing system designed for big data processing and analytics. It provides a unified analytics engine that supports a wide range of data processing tasks, including batch processing, real-time streaming, machine learning, and graph processing. Spark offers high-speed, in-memory data processing capabilities, allowing for faster and more efficient data analysis and computation compared to traditional batch processing frameworks.

Spark UI

Spark UI, also known as the Spark Web UI, is a web-based user interface provided by Apache Spark. It offers detailed insights and real-time monitoring of Spark applications and clusters. Spark UI displays critical information such as job progress, task execution details, resource utilization, and application performance metrics. It enables developers and data engineers to monitor and optimize their Spark applications, identify bottlenecks, and troubleshoot issues for efficient data processing.

AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS). It simplifies the process of preparing and loading data from various sources for analytics, machine learning, and other data processing tasks. Glue allows you to discover, catalog, transform, and move data between different data stores or data lakes, making it easier to analyze and derive insights from your data.

CloudFormation

AWS CloudFormation is a service offered by Amazon Web Services that enables you to provision and manage AWS resources in a declarative and automated manner. It allows you to define and deploy infrastructure as code (IaC) using templates written in YAML or JSON format. CloudFormation simplifies the process of setting up and managing AWS resources, such as virtual machines, storage, networking components, and other services, by automating the provisioning and configuration steps. This approach ensures consistency, repeatability, and scalability of infrastructure deployments, making it easier to manage and maintain complex cloud environments.

Before diving into the step-by-step guide for creating the Spark UI using CloudFormation, it’s important to understand the benefits of the Spark UI. Overall, creating the Spark UI for a Glue job is straightforward and can be helpful for debugging and monitoring the job’s performance. With this understanding in place, let’s proceed.

Step 1: Enable Spark Logging

Open the Glue on which you want to see the Spark UI.
Check the box labeled Spark UI and enter the S3 path where the logs should be stored in the Spark UI logs path field.
Save the job.

Step 2: Region and Glue Version Selection

Click on this link to be redirected to the Spark History Server page, where you must select the Glue version you are using and the respective region.

Launching the Spark history server

You can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance…

docs.aws.amazon.com

Step 3: Stack Creation

Upon clicking that link, a window will open where you will need to fill in some details. Under Prepare template, the default selected option is Template is ready.
In the Amazon S3 URL field, a default value will already be populated (for security reasons, the value has been deleted in this description).
Click Next.

Now that the basic details have been entered, there are a few more details that need to be filled in.

The Stack Name can be anything.
IpAddressRange field, enter 0.0.0.0/0 to allow any machine to access the link by everyone. IP-Range can also be provided.
HistoryServerPort field has a default value of 18080.
EventLogDir field, enter the path where the Spark history logs are stored.
SparkPackageLocation field has a default value that will be automatically populated, so you can leave it as is.

Cloud Formation Specify Stack Details Page 1

KeyStorePath field, is optional, and can be left empty.
KeyStorePassword field, enter a password that you will remember.
InstanceType field, select the desired instance type; a micro or small instance will usually be sufficient.
VpcId field, select the VPC ID where all the Glue jobs are running.
SubnetId field, select the subnet where all the Glue jobs are running.
Click Next.

Cloud Formation Specify Stack Details Page 2

Step 4: Stack Configuration

In the Configure Stack section:

The IAM Role field is optional, so it is not required.
Click Next.

Cloud Formation Configure Stack Options Page

After you click Next, the CloudFormation service will launch an EC2 instance. This may take some time.

Once the instance is available, select the stack created, then go to Outputs and click on the SparkUiPublicUrl.
Copy that link and open it in any browser to view the Spark UI.

Summary

This article serves as a helpful, easy-to-follow guide on how to create the Spark UI for an AWS Glue job. The Spark UI is a valuable tool that enables you to track the progress of your Glue job and examine logs for any errors or issues. The entire process is straightforward and proves highly beneficial for debugging and closely monitoring the job’s performance. By following these step-by-step instructions, you can gain valuable insights into optimizing your AWS Glue job effectively.