Practical Capacity Scheduling with YARN

Or, getting the bugs out of your YARN.

If you have a non-homogenous YARN cluster (like us), you’ll eventually want to start pinning different kinds of work to different kinds of nodes. If you use Hive in particular, you’ll want to reserve capacity for interactive jobs too. This is ‘easy’ with YARN’s CapacityScheduler in the sense that it’s super fiddly and filled with bugs. This post is about how we got it working in the face of a fun bug in how YARN actually calculates the capacity used by various queues.

Yo, wait: what’s a non-homogenous YARN cluster?

It’s a YARN cluster that isn’t homogenous. The classic way of deploying Hadoop is that all your work and storage are performed on nodes which host both YARN Nodemanagers and HDFS DataNodes. This achieves the classic ‘move the computation to the data’ advantage of Hadoop as a system. In a world of completely free computers, a non-homogenous cluster would be a super deep optimization for very special usage patterns.

In a world where computers cost money, homogenous clusters have some disadvantages. If your computational needs grow faster than your data storage needs, a completely homogenous cluster results in a lot of spare (read: wasted) storage capacity and you lose computing capacity because you still need to give the DataNode a processor and RAM. In fact, in a scenario with a very large cluster, you would start to see other weird performance characteristics. If data is very sparse throughout your HDFS cluster, you could actually end up with data further, on average, from the computation than if you concentrated in a subset of the cluster. (This is theoretical — I’ve never had the luxury of being able to buy a thousand unnecessary boxes).

Non-homogenous clusters are particularly common with AWS, because of the pay-as-you go nature of AWS, and the various pricing schemes in play. Using dedicated YARN-only nodes allows us to use spot workers to provision the bulk of the computing capacity we need, and scale it independently of storage. I’ve also heard of places where they also rely on HDFS replication and differential max bids to allow their HDFS cluster to be provisioned on spot instances. I can see how it would work, but it would keep me up at night.

So let’s see your cluster, then.

Our cluster: 10 DataNodes, 23 NodeManagers, of which 10 are colocated with the DataNodes.

What goes where?

We have two main workloads: Hive and Spark. The Hive workload splits down into two further kinds — batch processing and interactive queries.

The key thing to know about Hive is that it very much assumes a homogenous cluster, uses HDFS for intermediate storage, and as a result very much benefits from being run on machines with DataNodes.

The thing about Spark is that it moves data over the network between executors (if on different machines), and can pass data between tasks within a process, or between processes on a single host. Accordingly, Spark only gets a speedup at the start and end of the job (or rather, the start and end points of the job DAG), and even then only if it’s reading to and from HDFS rather than something else.

By pinning our Hive workload to the DataNodes, and keeping Spark workload off them, we’ve managed to reduce network IO in our cluster to 20% of what it was before.

How does it work?

YARN allows nodes to be labelled. If YARN is configured to use the CapacityScheduler, traffic can then be directed to queues; queues are then permitted to use capacity in various node labels. In practice, there are also some fun bugs which affect exactly how the configuration has to be drawn.

Configuring YARN to use the CapacityScheduler

In our setup, we have only two labels: the Data label (for NodeManagers co-located with DataNodes) and the default, or empty, label. The reason we don’t have a dedicated label for the YARN-only nodes is that, well, we did, but then we found that YARN wouldn’t allocate any containers for application masters. That behavior is due to this bug that causesYARN to only calculate the amount of capacity to allocate for application masters against the default label (fixed in the unreleased version 2.8.0). If you don’t have any capacity in the default label, you will get no application masters, which is bad. Separately, if you don’t allocate any capacity to the default queue, traffic which isn’t directed to a queue just goes nowhere. Maybe you want that, maybe you don’t.

The bug has some other consequences, annotated in the gist below. User limits are also calculated against the default queue, so we had to get a bit clever to configure the actual desired limit. The default queue is given 98% capacity, and the other two queues with 1% each. Then, the user limit for each of the Interactive and Noninteractive queues are set to 35 (35 times 1%) which works out at about 25% of actual capacity.

To actually configure node labels, you’ll need the YARN rmadmin command. This involves first creating the labels (floating free in empty space, as it were) and then applying those labels to specific nodes.

Once node labels are applied, you can configure queues to control where workload is actually executed. We have 80% of the Data label allocated for the Noninteractive queue, and 20% for the Interactive queue. In addition, we have the Interactive queue configured such that it can burst out and use up to 100% of the capacity of the Data label if it is not being used by the Noninteractive queue.

As to how this is specifically implemented, see the gist appended below. Comments call out the magic, such as it is.

Interested in solving problems like this? Work with Marcin on Handy’s Platform Team, or Dave (who did most of the real work figuring out YARN) on the Dev Ops Team.

Capacity Scheduler Configuration