Autonomous IronWorker: Phase 1

Reed Allman
Iron.io Technical Blog
7 min readFeb 6, 2017

Like any good band of programmers, one of our goals internally is to automate every part of our product so that we can play Overwatch while money falls from the sky. However, since IronWorker’s inception we’ve been tasked with manually matching infrastructure to demand for many years now. This mostly materialized in having alerts through whatever alert system we’re using at the time for high queue times and, after being alerted, whichever poor soul looked at their phone first had to log on to our internal deployer tool and launch some servers (and then make sure they launched.. :stare:). Yes, it was bad, and we feel bad. But this seemed like something easy enough to replace with computers…

Enter autoscale: step 1 of submission to the robot overlords.

We had some interesting requirements for scaling our job running infrastructure, internally coined ‘runners’ (the process that actually runs user tasks). We’ll go over the requirements, challenges and design of the system and then some perrty charts to show the results.

The most obvious requirement is how to determine current load and turn it into a number that we can use to equate to N runners. On the surface this seems easy, but there are 2 factors to consider. One, we have the number of busy runners. The other factor is there are 0+ queued tasks to consider. This is a little different from traditional job systems that use resource capacity (CPU/RAM/etc) and then jobs specify their resource requirements and are scheduled to run based on ‘fit’; for us, using raw task counts is mostly done since internally we limit and sell the product based on concurrency, as opposed to resources.

Once we get our samples of running and queued, we have in our hands a raw number to base scaling off of. However, making decisions from any one sample of data is always dicey, since it could result in us scaling up and down frequently and in turn costing us a lot of $, so we decided it would be ideal to use an exponential weighted moving average (ewma for short) to do smoothing of our most recent samples in order to avoid spinning up and down servers wildly. Netflix has a great wiki page about the technique with illustrative pictures here: https://github.com/Netflix/atlas/wiki/DES. Much like the illustrations from the wiki page, we effectively drew a channel around the current number of runners, and if the ewma dips below the bottom of the channel, we can kill enough runners so that we would be back in the channel. This means we’re effectively using percent of capacity to scale up or down, which is adaptive. To throw queued into the mix we just add log(queued) to capacity, previous attempts to use queued at a larger weighting were expensive (in the monetary sense), and this handles the n=0 case well enough.

Scaling up is a little more complex, since the ewma is bounded to the capacity, typically adding 1 server of runners would be enough to get back to equilibrium if done in the same manner as scaling down, but in many cases this will just lead to hitting capacity again immediately and continuing to add 1 server at a time. Using that strategy, it takes a long time to get through all the queued tasks, so we had to try to be a little clever. What we did was draw a separate log curve based off the number of current runners that would give us a reasonable amount of new servers if we were at capacity. Where with scaling down we could just pick the point that is the top of the channel, with scaling up we wanted to pick a point far below the bottom of the channel if we were at capacity. This is probably best shown with a table, where L, H & T should all be interpreted as percentages:

// n      L     H     T
// 1 0.451 0.727 0.100
// 10 0.583 0.792 0.100
// 100 0.715 0.858 0.377
// 500 0.808 0.903 0.591
// 1000 0.847 0.923 0.683
// 5000 0.940 0.969 0.897
// 10000 0.980 0.988 0.989
// 100000 0.998 0.999 0.997
//
// legend: n=current runners L=logMin H=logMax T=logTarget

Some sample data would be, if n=10 and our ewma is at 100% capacity, we would take 10/.1 to get the number of runners we want = 100. This means if there are 10 runners per machine we need to add 9 machines to reach our target. The reason for the log scale is so that at 10000 runners we don’t add 90000 runners, instead it’s 10000/.989=10111–10000=111 new runners. We admittedly don’t have any one cluster over 2000 runners so we’re unsure about that particular case. In any event, there’s an obvious hole here where with our log target we may end launching way too many machines whenever we need to scale up. In order to dampen this we take the percentage we are over the log max in relation to n and multiply the target output by that. For example, n=10 avg=9 H=.79 T=100 9/10=.9 1-.9=.1 1-.79=.2 .1/.2=.5 .5*100=50 new runners. This handles cases where if we’re not that far over capacity we won’t add too many servers. To see if we got all these crazy curves and numbers even remotely right we (we being Nick) ended up building a discrete event simulator, as launching and killing instances takes forever! But that’s content for another blog post.

The scaling thresholds took some fiddling to get right but overall the concept is pretty simple. The other big, obvious requirement is that we need to safely terminate instances only when they have finished processing all of their jobs. This is one of the reasons it would be challenging to use AWS autoscaling groups, as it might be possible, but we want load based scaling, we don’t want to kill instances with jobs running on them and we also want to support multiple clouds anyway. For safe termination, this means we need to track which instances we have up, which instances we are killing, which instances we can and/or have terminated and which instances have jobs on them. This ended up being a pretty simple state machine for the most part, but it also means that the runners have to send in how many tasks they are currently running. When we need to scale down, we mark an instance[s] as dying, stop giving it tasks and wait for all the jobs to complete. This allows us to do one nice trick where if we need to scale back up we can just turn a dying instance back on, saving the cost of adding a new instance; to accomplish this we leave instances up for a few minutes even if the runners are flush (since we’re already billed for an hour on AWS). We also added a hook so that we could terminate instances at will, which was infinitely useful when we were manually managing instances, but still proves useful sometimes. This description elides some of the nitty gritty like what happens when a runner gets partitioned from our scaling service (we handle it!), or what happens when jobs run for 4 hours when an instance is being killed; but this post cover the basics.

The final requirement is that we need this to manage infrastructure for our customers as well as ourselves. In order to take care of this, all of the credentials, thresholds, and parameters can be specified in a simple CRUD cluster API to our worker service. As long as the runners have outbound enabled they will be able to pull down jobs and our scaling service will be able to handle provisioning and deprovisioning of the runners. It’s also possible to just run statically sized clusters and the scaling service will make sure that a certain number of runners are always maintained. We now have multiple customers using Hybrid with this (scaling infrastructure in their AWS accounts) along with ourselves, running almost every one of our clusters on this autoscaling infrastructure now.

At last, the money shot:

This chart should be pretty easy to interpret, the blue line is all the runners, alive and being killed, the green line is the alive runners only, the working runners are ‘busy’, and the ewma can see that it dampens the working spikes / drops a little (needs to be a little slower, working on it!). Previously we were running between 1200–1500 runners statically in the ‘default’ cluster, that is, until we were alerted that we needed more runners or somebody checked out this chart and saw that we could kill 100 runners or so. Now, we rarely pay attention to the runner counts and customers don’t have to deal with increased queue times, either, as the system responds relatively quickly to load. At night when traffic is low we’ll end up only running 500 runners a lot of the time, saving about 50% of cost. The total amortized cost savings in practice have been about 40% for this specific cluster. And in conjunction with some other changes, between those and autoscaling, we’ve been able to reduce our infrastructure bill drastically the last few months. It’s apparent from this chart that there are still many optimizations to be made just for autoscaling alone, but we have come a long way from the days when we were using the people scaler.

With this new knowledge if you want to adjust your cluster, or want to set up an IronWorker cluster on our infra or yours, please reach out to support@iron.io so we can get you set up with some sweet, sweet scaling. We will also be offering this functionality for IronFunctions, please reach out if you are interested!

--

--

Reed Allman
Iron.io Technical Blog

Hopelessly trying not to tie my identity to my job title