Mesos: A cluster scheduler for diverse frameworks

Published in

Coinmonks

8 min readOct 2, 2018

Introduction

In companies or research groups that are working on large compute loads or datasets, there are many applications that run using different types of application frameworks such as map-reduce, spark, Dyrad, Pregel. No single framework is a panacea. In presence of such diverse workloads, it is important that different frameworks are able to share the resources of the cluster efficiently. One traditional approach has been to reserve capacity by static partitioning of the cluster or by allocating a certain number of VMs for each framework. Such allocation is not optimal and leads to either overworked or underutilized cluster resources.

To provide efficient utilization of cluster resources, Mesos, offers a fine grained resource scheduler, that monitors the current cluster resource usage and then offers resources to these frameworks, as they become available. Mesos is highly scalable as it can deal with tens of thousands of nodes and tasks in a cluster and their resources. Also it is fault tolerant as critical applications depend on it. Mesos is designed in a way to be able to support multiple different frameworks such as map-reduce, MPI, spark.

Design Approaches

One approach for such a cluster scheduler could be to have a centralized scheduler that gets resource requirements for all the frameworks and then comes up with an optimal policy across all frameworks. While this seems promising, it also is a very hard problem to solve which would affect scalability of such a scheduler. Another issue is that new frameworks are evolving all the time and it is not clear if an optimal scheduler can be built that take into account their future needs.

Instead, Mesos uses a decentralized scheduling approach using a technique of “resource offer”. Mesos scheduler decides how many resources of different types can it hand out to a given framework based on some high level organization policy. The framework can decide to take or reject the resources offered. If the framework takes the offer, then it can schedule it’s own tasks using these resources. While this may not lead to the most optimal scheduling approach, it scales well and performs pretty optimally taking into account data locality and alike. This also allows framework builders to build specialized frameworks and schedulers that perform well for those workloads.

The classic workload that Mesos was originally built on consisted of many small tasks and some long tasks. The median for jobs takes approximately 30s. Traditionally such clusters at facebook, yahoo were used for loading log data and running interactive queries on it or running some map-reduce tasks.

High level architecture

Mesos master making resource offers and Framework scheduler using that to deploy tasks on nodes on the cluster where framework executors and mesos slave runs

Mesos consists of slave processes that run on each node in the cluster. There is mesos master process what is connected to these slaves that are monitoring local resource usage. Each framework has its own scheduler and an executor. Executor runs on the same nodes that slave runs on. Framework scheduler decides which tasks needs to run on which nodes and then hands it off to the master. Master can then use the slaves and executors for running jobs on the nodes as assigned by the framework scheduler. The first resource offer is dependent on organization policy such as should resources be assigned fairly or using some other priority based mechanism. These different policies can be plugged into Mesos master by using an allocation module.

Resource offer sequence from slave->AllocationModule->Framework->Master->slave

In the diagram above, Mesos-slave1 notices that node 1 has 4Cpus and 4 GB memeory available. It informs that to the master. Master(based on the organizational policy) offers all of these resources to framework1. Framework currently has only 2 tasks that it wants scheduled: task1 needs 2cpus and 1GB memory, while task2 needs 1cpu and 2GB memory. The framework informs the master about this. Master can then assigns these resources to the slave1 which reserves this resources and informs the framework executor which can then launch the two tasks that framework1 is interested in. There is still 1 CPU and 1 GB left, which master can offer to framework2 or any other resources that might become available from slave2.

The question that still remains is how can frameworks leverage the data locality. This is an issue because the master is unaware of what each framework needs for optimal locality. This is achieved by having the framework scheduler reject the offer of resources, if it thinks that it is not optimal or hampers the locality drastically. Frameworks also use delay scheduling to wait for some time for nodes to become available which can achieve near-optimal locality.

Resource offers need to be scalable. This is achieved by implementing filters at the master. Frameworks can specify filters such as “only give me nodes from a certain list of nodes, ignore nodes that don’t have x”. In addition, for the master to scale, master incentivizes the frameworks to respond quickly to it’s offers: Either by rescinding offers that are taking a long time for confirmation by the framework or by counting the unacknowledged resource offers towards the usage of the framework.

Allocation module and Revocations

As mentioned earlier, the allocation module is a pluggable entity in the master process. So organizations can direct it in a manner that helps them. Two default choices are fair scheduling and strict priorities.

With the help of delay scheduling, most short tasks don’t have to wait for long. They generally get resources after waiting for a short time. It is possible that long running tasks or buggy code blocks the short running tasks. In such cases, the master can kill the long running process by giving them some grace period. Grace period helps the long running(non buggy) processes by setting up recovery mode when the process comes back up again.

Generally, short tasks are not too perturbed by revocations. In addition, each framework can ask for some guaranteed allocation. As long as, the framework is under this allocation, it won’t have any tasks terminated. If the framework is above the allocation limit, then any task can get killed. This simplistic scheme can be improved by assigning priorities to tasks of a framework and lower priority tasks can be killed until the usage falls below the allocation limits.

Mesos master fault tolerance

Mesos slaves are already distributed and need local recovery mechanisms when they die. Frameworks also get notified of the failures on slaves and executors.

Main central point of failure is the master. Master uses zookeeper for fault tolerance. A hot-standby (and other standby masters) is registered with zookeeper to take over as soon as zookeeper notices the master failure. In addition, master is designed to operate with soft-state i.e. it can reconstruct its state using periodic messages from the slaves and the framework schedulers.

Similarly framework schedulers can also die. So framework schedulers can register their failover schedulers so that operations can continue seamlessly. But frameworks need to figure out a way to share state among themselves.

Defining and measuring Mesos’ effectiveness

The authors discuss what kind of workloads might Mesos have to schedule and how to measure the effectiveness of scheduling in those cases. The different types of workload that authors considered were: The workloads that can scale-up or scale-down elastically. Workloads can be separated by their durations i.e. short, medium, long. Another type of workload can be one that has to wait for certain number of resources to become available.

In presence of such workloads, Mesos measures the following metrics:

Framework ramp-up time: The time it takes for a framework to achieve it’s allocation
Job completion times: How long does it take for jobs to complete
Cluster utilization: How occupied vs idle is the given cluster.

The paper summarizes their work from another paper in a table explaining the effect of scheduling on these numbers. While this works reasonably well for homogeneous tasks, there is question of long tasks versus short tasks. You don’t want long tasks starving short ones. This can be done by placing a cap on percentage utilization for long tasks in the given cluster e.g. no more than 50% resources will be allocated for large tasks. Master can have different timeouts for task revocation and ensure that it incentivizes short tasks.

Incentives for frameworks

As mentioned earlier, Mesos gives frameworks some control over their scheduling needs. It means that it is important to build Mesos such that frameworks are incentivized for high utilization. For atask that framework wants to launch, here are some incentive structures that are built into Mesos.

Short tasks: These are incentivized because short tasks will find resources quickly and also will have less recovery cost if those were to be killed.
Lack of guaranteed allocation: This ensures that frameworks can start using resources as soon as any of them become available as opposed to ALL of them.
Elastic resource usage: This is ensured because frameworks can scale down when cluster is under pressure without getting the whole job killed and scale up by grabbing resources when they are available.
Usable resources: Only usable resources should be accepted by frameworks(e.g. GPUs), otherwise it is holding onto resources that count towards its cluster usage, but actually not using. This will affect the future resource allocation for itself.

When is the distributed scheduling is sub-optmial?

As we have seen Mesos is not a centralized scheduler. When tasks are heterogenous, Mesos may not able to do bin-packing as effectively as a centralized scheduler that is aware of all the different tasks. This will lead to fragmented use of the scheduler. There is also some chance of starvation of large tasks because as soon as small resources free up, small tasks take up that slot. This can be prevented by having minimum offer sizes that Mesos will make to frameworks and ensures that certain sized resource set is available before it is offered. In terms of complexity of schedulers for frameworks, frameworks need to know about their needs anyway and then express it to a centralized scheduler or use the resource offers made by a decentralized scheduler — in fact, a lot of schedulers cannot predict task times and need to make online adjustments to the scheduler.

Conclusion

Mesos is useful for fine grained scheduling in clusters so that different types of frameworks can get access to the resources of the cluster in an efficient manner. Frameworks can scale up and down depending on cluster utilization and use resource offers from the master for scheduling their tasks optimally. I thought distributed scheduling approach of Mesos was pretty novel.

Get Best Software Deals Directly In Your Inbox