Spark UI Monitoring with Cloud Formation
Troubleshooting Made Simple: Setting Up Spark History Server for AWS Glue
In this article, we aim to help data engineers and data scientists working with Spark enhance the performance of their applications. This article will guide you through the process of creating the Spark UI for an AWS Glue job. There are a few simple steps that you can follow to get the Spark UI up and running for your Glue job. You can view the Spark UI to monitor the progress of your Glue job and check the logs for any errors or issues.
Let’s first revise what Spark, Spark UI, AWS Glue, and Cloud Formation are.
Spark
Apache Spark is an open-source distributed computing system designed for big data processing and analytics. It provides a unified analytics engine that supports a wide range of data processing tasks, including batch processing, real-time streaming, machine learning, and graph processing. Spark offers high-speed, in-memory data processing capabilities, allowing for faster and more efficient data analysis and computation compared to traditional batch processing frameworks.
Spark UI
Spark UI, also known as the Spark Web UI, is a web-based user interface provided by Apache Spark. It offers detailed insights and real-time monitoring of Spark applications and clusters. Spark UI displays critical information such as job progress, task execution details, resource utilization, and application performance metrics. It enables developers and data engineers to monitor and optimize their Spark applications, identify bottlenecks, and troubleshoot issues for efficient data processing.
AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS). It simplifies the process of preparing and loading data from various sources for analytics, machine learning, and other data processing tasks. Glue allows you to discover, catalog, transform, and move data between different data stores or data lakes, making it easier to analyze and derive insights from your data.
CloudFormation
AWS CloudFormation is a service offered by Amazon Web Services that enables you to provision and manage AWS resources in a declarative and automated manner. It allows you to define and deploy infrastructure as code (IaC) using templates written in YAML or JSON format. CloudFormation simplifies the process of setting up and managing AWS resources, such as virtual machines, storage, networking components, and other services, by automating the provisioning and configuration steps. This approach ensures consistency, repeatability, and scalability of infrastructure deployments, making it easier to manage and maintain complex cloud environments.
Before diving into the step-by-step guide for creating the Spark UI using CloudFormation, it’s important to understand the benefits of the Spark UI. Overall, creating the Spark UI for a Glue job is straightforward and can be helpful for debugging and monitoring the job’s performance. With this understanding in place, let’s proceed.
Step 1: Enable Spark Logging
- Open the Glue on which you want to see the Spark UI.
- Check the box labeled
Spark UI
and enter the S3 path where the logs should be stored in theSpark UI logs path
field. - Save the job.
Step 2: Region and Glue Version Selection
Click on this link to be redirected to the Spark History Server page, where you must select the Glue version you are using and the respective region.
Step 3: Stack Creation
- Upon clicking that link, a window will open where you will need to fill in some details. Under Prepare template, the default selected option is
Template is ready
. - In the
Amazon S3 URL
field, a default value will already be populated (for security reasons, the value has been deleted in this description). - Click Next.
Now that the basic details have been entered, there are a few more details that need to be filled in.
- The
Stack Name
can be anything. IpAddressRange
field, enter0.0.0.0/0
to allow any machine to access the link by everyone. IP-Range can also be provided.HistoryServerPort
field has a default value of18080
.EventLogDir
field, enter the path where the Spark history logs are stored.SparkPackageLocation
field has a default value that will be automatically populated, so you can leave it as is.
KeyStorePath
field, is optional, and can be left empty.KeyStorePassword
field, enter a password that you will remember.InstanceType
field, select the desired instance type; a micro or small instance will usually be sufficient.VpcId
field, select the VPC ID where all the Glue jobs are running.SubnetId
field, select the subnet where all the Glue jobs are running.- Click Next.
Step 4: Stack Configuration
In the Configure Stack section:
- The
IAM Role
field is optional, so it is not required. - Click Next.
After you click Next, the CloudFormation service will launch an EC2 instance. This may take some time.
- Once the instance is available, select the stack created, then go to Outputs and click on the
SparkUiPublicUrl
. - Copy that link and open it in any browser to view the Spark UI.
Summary
This article serves as a helpful, easy-to-follow guide on how to create the Spark UI for an AWS Glue job. The Spark UI is a valuable tool that enables you to track the progress of your Glue job and examine logs for any errors or issues. The entire process is straightforward and proves highly beneficial for debugging and closely monitoring the job’s performance. By following these step-by-step instructions, you can gain valuable insights into optimizing your AWS Glue job effectively.