AWS EMR Schedulers Explained

Oleksandr Makarevych
Machine Learning Reply DACH
7 min readFeb 9, 2023

AWS (Amazon Web Services) is a popular choice for cloud applications due to its scalability, reliability, and security. AWS offers a wide range of services such as Elastic Compute Cloud (EC2), Simple Storage Service (S3), and Elastic Block Store (EBS), which allows developers to easily build, deploy, and scale their applications. Additionally, AWS offers various security features such as virtual private clouds and identity and access management to ensure that data and applications are protected. Furthermore, AWS provides a global network of data centers, which allows for low latency and fast data transfer, and it guarantees the high availability of services. AWS also offers a pay-as-you-go pricing model, which allows customers to only pay for the resources they use, making it a cost-effective solution for cloud applications.

Amazon Elastic MapReduce (EMR) is a fully managed big data processing service in AWS that has come a long way in recent years. It now supports a wide range of big data processing frameworks, including Apache Hadoop, Apache Spark, Apache Hive, Apache Pig, and Apache HBase, among others. This allows users to easily run and process big data workloads on a fully managed cluster without the need for complex infrastructure setup and maintenance.
One of the major advancements in EMR is the integration of managed services, such as Amazon S3, Amazon DynamoDB, and Amazon Kinesis, which allows users to easily store and process large amounts of data in a cost-effective and scalable manner. Additionally, EMR now provides built-in support for machine learning and data visualization tools such as Amazon SageMaker and Amazon QuickSight, making it easier for data scientists and analysts to gain insights from big data.
EMR also now provides an easy-to-use web interface, the EMR console, which allows users to easily set up, configure, and monitor their clusters without the need for a command line or programming skills. This makes it easy for even non-technical users to set up and run big data workloads on EMR.

Another important aspect of EMR is its ability to provide flexible scheduling options. Schedulers play a critical role in managing the allocation of resources in Amazon Elastic MapReduce (EMR), the fully managed big data processing service in AWS. They help ensure that resources are used efficiently and that jobs are completed in a timely manner, thus, providing a stable and reliable environment for running big data workloads. In this article, we will explore the different schedulers available in EMR and their use cases, with a focus on the Hadoop FIFO Scheduler, the Capacity Scheduler, and the Fair Scheduler.

The Hadoop FIFO Scheduler is the default scheduler in EMR and is based on the concept of first in, first out. This means that jobs are executed in the order they are received, with the first job in the queue being executed first. This scheduler is simple and easy to use, but it does not provide any advanced features or capabilities, such as resource allocation or job prioritization. It is best suited for simple, single-user clusters with a small number of jobs where there is no need for advanced features or fine-grained control over resource allocation. For example, if you have a small number of batch jobs running on the cluster and you want to ensure that they are executed in the order they are received, you can use the FIFO scheduler.

The Capacity Scheduler, on the other hand, is designed for use in multi-tenant clusters, where multiple users and applications are running on the same cluster. It allows users to specify their resource requirements and guarantees that each user will receive the resources they need to run their applications. The scheduler also supports the concept of queues, where users can specify their resource requirements at the queue level, rather than at the application level. This allows for better resource utilization and isolation between different users and applications. The capacity scheduler is particularly useful when you have multiple teams or departments sharing the same cluster and you want to ensure that each team has guaranteed resources and no team is affected by the other’s workloads. For example, if you have a large number of users running Spark jobs on the same cluster, you can use the capacity scheduler to ensure that each user gets the resources they need to run their jobs without affecting the performance of other users.

The Fair Scheduler is another popular choice in EMR, and it is designed for use in single-tenant clusters, where only one user or application is running on the cluster. It provides a fair sharing of resources between different jobs, based on the concept of fair sharing. This means that resources are allocated to jobs in a way that is fair to all users and applications, rather than being allocated based on the resources required by a specific job. The fair scheduler also allows configuring minimum and maximum resources for each pool and assigning a job to the pool accordingly. It is particularly useful when you have a single team or department using the cluster and you want to ensure that each job gets a fair share of resources without any specific job taking more resources than it needs. For example, if you have a large number of Hive queries running on the same cluster, you can use the fair scheduler to ensure that each query gets a fair share of resources without any query taking more resources than it needs.

All these schedulers can be used in conjunction with different Hadoop applications such as Spark, Hive, and Pig. For example, when running Spark jobs, you can use the fair scheduler to ensure that each Spark application gets a fair share of resources, while also using the capacity scheduler to ensure that each user gets the resources they need to run their applications. Similarly, when running Hive queries, you can use the FIFO scheduler to ensure that queries are executed in the order they are received, while also using the capacity scheduler to ensure that each user gets the resources they need to run their queries.

It’s also worth mentioning that all these schedulers can be fine-tuned and customized to better fit the needs of the application. For example, in the case of the Capacity Scheduler, you can configure the minimum and maximum resources for each queue, as well as set priorities for different queues, so that the most important jobs get the resources they need first. Similarly, with the Fair Scheduler, you can configure the minimum and maximum resources for each pool and assign jobs to the pool accordingly. Additionally, you can also configure the weight of each job or user, to ensure that the most important jobs get the resources they need first.

Another important aspect to consider is the security of the cluster when using different schedulers. The FIFO scheduler does not provide any advanced features or capabilities in terms of security, while the Capacity Scheduler supports the concept of queues, where users can specify their resource requirements at the queue level, rather than at the application level, which can help in terms of security. The Fair Scheduler also allows to set up of pools for different users or teams and assigning jobs accordingly, which can help in terms of security as well.

In addition, it’s important to take into account the scalability of the cluster when using different schedulers. As FIFO is the simplest scheduler, taking into account how it works, one wouldn’t use it when scalability is an important factor. The Capacity Scheduler is designed for use in multi-tenant clusters and it allows users to specify their resource requirements, which can help in terms of scalability. The Fair Scheduler is also designed for single-tenant clusters and it provides a fair sharing of resources between different jobs, which can help in terms of scalability as well.

Finally, the monitoring and troubleshooting of the cluster is also an important aspect to consider when using different schedulers. The FIFO scheduler does not provide any advanced features or capabilities for monitoring and troubleshooting. The Capacity Scheduler and the Fair Scheduler both provide advanced features and capabilities for monitoring and troubleshooting, such as the ability to view the status of the cluster and the resources being used, as well as the ability to view the history of the cluster and the resources being used.

In conclusion, when using Amazon Elastic MapReduce (EMR), it is important to consider the type of scheduler being used to manage the allocation of resources for the tasks and applications running on the cluster. Different schedulers have their own set of features and capabilities, and the choice of a scheduler will depend on the specific requirements of the applications running on the cluster. The Hadoop FIFO Scheduler is best suited for applications that require a straightforward execution order such as batch jobs, data processing jobs, or simple ETL jobs. The Capacity Scheduler is best suited for applications that have many users running on the same cluster, such as Spark jobs, Hive queries, and other big data processing jobs. The Fair Scheduler is best suited for applications that have a single user or team running on the cluster, such as Hive queries, Pig scripts, and other big data processing jobs. Additionally, it’s important to consider the fine-tuning, security, scalability, monitoring, and troubleshooting aspects when choosing the appropriate scheduler for an application. By understanding the capabilities and use cases of each scheduler, you can make an informed decision on which one to use for your big data workloads in EMR, and ensure optimal performance and resource allocation for your cluster.

--

--