AWS EMR Queue Configuration With Capacity Scheduler

Published in

Globant

7 min readSep 29, 2023

To start off this article, I want to share an experience with the configuration of a cluster in AWS EMR (Elastic Map Reduce). When there are different teams executing jobs in the same cluster and many times without intention, executing a job consumes 100% of cluster capacity for several days of the month. The ability to use queues allows us to segment resources and share the capacity required by each team.

Amazon EMR

We’ll begin by introducing the purpose of AWS EMR, a managed service that enables the processing of large amounts of distributed data in clusters. AWS EMR uses data science frameworks such as Apache Spark, Apache Hadoop, and Presto, among others, and third-party applications like Apache Hive or Apache Pig. Some of the use cases for this service are machine learning, data mining, and overall large-scale data processing. AWS EMR abstracts infrastructure management, freeing you from concerns about cluster configuration and maintenance complexities.

Some important architectural components are:

Storage Options

Hadoop Distributed File System (HDFS): It is the primary data storage. It allows you to store large datasets across multiple nodes and provides fault tolerance by replicating data blocks across the cluster.
EMR File System (EMRFS): A data access layer developed by Amazon specifically for use in access to data stored in Amazon S3.
Local File System: Storage on the individual nodes in the EMR cluster.

Cluster Resource Manager

By default, it uses YARN (Yet Another Resource Negotiator) and allows for the centralized management of cluster resources for various data processing frameworks.
Some frameworks use different resource managers.

Node Types

Master Node: Manages the cluster and coordinates the distribution of data and tasks among other nodes for processing.
Core Node: Executes tasks and stores data in the Hadoop Distributed File System (HDFS) in the cluster.
Task Node: Only executes tasks and doesn’t store data in HDFS. Task nodes are optional.

High level — AWS EMR core components | source: Own

As an example of how YARN plays a crucial role, let’s consider a company that has an AWS EMR cluster available 24/7, running data processing jobs with durations ranging from hours to even days, and used by different teams. This led to some jobs consuming 100% of the capacity of the cluster’s resources, causing delays for other projects/teams trying to execute their respective jobs due to the absence of a strategy to segment the cluster’s capacity.

This is where the YARN component, previously mentioned in the EMR architecture, plays a crucial role. YARN is responsible for managing the cluster’s resources and consists of two important components: the ResourceManager, which administers cluster resources, and the NodeManager, which administers resources on each node. Then, we make use of the CapacityScheduler, which allows us to manage the allocation and scheduling of cluster resources.

One of the strategies I evaluated that allowed us to plan and coordinate cluster resources is the use of Queues for AWS EMR. These queues define a policy to configure how cluster resources are segmented and what capacity is assigned, among other features that we will mention later. For our example, we will define some queues and discuss the relevant properties that will be assigned to them, allowing us to practically demonstrate their use.

In this example, each queue has the following properties. To see the complete documentation, refer to the following source link:

yarn.scheduler.capacity.<queue-path>.queues

Defines the number of queues at the root level of the CapacityScheduler’s queue hierarchy.

yarn.scheduler.capacity.<queue-path>.capacity

Queue capacity in percentage (%) as a float (e.g. 12.5) of queue minimum capacity.
The sum of capacities for all queues must be equal to 100.
However, Applications in the queue may consume more resources than the queue’s capacity if there are free resources, providing elasticity.

yarn.scheduler.capacity.<queue-path>.maximum-capacity

Maximum queue capacity in percentage (%) as a float

In the next diagram, an example of three queues is shown in which a capacity will be assigned according to the technical and business requirements and how they consume the cluster capacity:

YARN queue properties represent an extensive field with numerous options for configuring elements like queue prioritization, assigning users to each queue, specifying the maximum number of concurrently executing applications in each queue, and more. There are many options available for customizing based on your use case.

To apply these changes, we will modify an actual running cluster to include the queues and their respective properties using the following JSON file:

{
"ClusterId": "j-XXXXXXXXXXXX",
"InstanceGroups": [{
"InstanceGroupId": "ig-XXXXXXXXX",
"Configurations": [{
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.root.queues": "sla-low,sla-medium,sla-high",
"yarn.scheduler.capacity.root.sla-low.capacity": "10",
"yarn.scheduler.capacity.root.sla-low.maximum-capacity": "20",
"yarn.scheduler.capacity.root.sla-medium.capacity": "30",
"yarn.scheduler.capacity.root.sla-medium.maximum-capacity": "40",
"yarn.scheduler.capacity.root.sla-high.capacity": "60",
"yarn.scheduler.capacity.root.sla-high.maximum-capacity": "80",
},
"Configurations": []
}]
}]
}

The following command will modify the cluster configuration using the previous JSON file:

{
"ClusterId": "j-XXXXXXXXXXXX",
"InstanceGroups": [{
"InstanceGroupId": "ig-XXXXXXXXX",
"Configurations": [{
"Classification": "capacity-scheduler",
"Properties": {
"yarn.scheduler.capacity.root.queues": "sla-low,sla-medium,sla-high",
"yarn.scheduler.capacity.root.sla-low.capacity": "10",
"yarn.scheduler.capacity.root.sla-low.maximum-capacity": "20",
"yarn.scheduler.capacity.root.sla-medium.capacity": "30",
"yarn.scheduler.capacity.root.sla-medium.maximum-capacity": "40",
"yarn.scheduler.capacity.root.sla-high.capacity": "60",
"yarn.scheduler.capacity.root.sla-high.maximum-capacity": "80",
},
"Configurations": []
}]
}]
}aws emr modify-instance-groups — cluster-id j-XXXXXXXXX — cli-input-json file:///path_json_file/queue_config.json

Even if the capacity is set to, for example 60%, if the cluster has available resources not used by other queues, it can take up to 80%.

Now, when you describe the cluster through the CLI, you can see that the configuration changes have been applied with the following command:

aws emr describe-cluster - cluster-id j-XXXXXXXXX

EMR cluster properties

By performing this configuration and assigning specific capacity and complementary properties to each queue, we can ensure that each one runs with a certain percentage of capacity, allowing for greater cluster utilization based on job priority or team-specific queues, depending on the use case.

When a job is executed, you can view its details in the Application History section of the YARN Timeline Server console:

YARN Timeline Server | source: screenshot

Another interesting view is this one where you can view the queue’s capacity progress bar :

**YARN Timeline Server — Job | source: screenshot**

Alternatively, through the command line, the queue status allows us to see the current capacity versus the maximum capacity:

yarn queue -status sla-medium

mapred queue -list

**Mapred que list | source: screenshot**

Tips when working with queues

Queues are one of the strategies to isolate workloads and not interfere with each other.
The prioritization is important as it comes over when the resource manager allocates resources.
Monitor queues execution to ensure it is performing as expected.
Enabling properties like “user-limits” or “maximum-applications” could prevent monopolizing queues or overloading of resources to ensure fair resource sharing.

Summary

Setting up queues for AWS EMR using the Capacity Scheduler of the YARN component is a procedure that includes creating a queue configuration in a JSON file and adjusting EMR instance groups using the aws emr modify-instance-groups command. It’s important to distribute resources among jobs and queues to ensure optimal performance and efficiency for EMR, and the Capacity Scheduler provides an effective solution for achieving this objective.

Keep in mind that by default, AWS EMR uses Hadoop YARN as a robust resource manager to customize for every specific need. Every use case has different requirements for setting up the cluster, and the capacity scheduler is responsible for allocating resources in the cluster, in this case, sharing resources efficiently. The scheduler plays an important role; its primary purpose is to ensure that each queue receives its share of resources from the cluster and allows administrators to configure queues with resource allocations, priorities, and limits. This configuration is very useful for multi-tenant scenarios, where different teams or users need to share the same AWS EMR cluster and need to allocate resources daily.