Puma Bonus: Speedy Autoscaling

Published in

FiNC Tech Blog

4 min readMar 26, 2019

In my previous post about switching to Puma I promised some bonus information. The Puma switch lead us to an unintended benefit for our autoscaling.

Our SRE team covered our autoscaling setup in this interesting article. I won’t repeat all the details, but to summarise, we use the NewRelic’s App Instance Busy metric. This data is published from our app to NewRelic. Then an Amazon Lamba function polls the NewRelic API to fetch this value and publish to CloudWatch. Finally a CloudWatch alarm triggers if this metric breaches a threshold. The alarm triggers the scale out of more containers.

This setup works well, but there is a fair amount of latency. This latency necessitates setting our scale out thresholds low so we can try to preempt any high traffic. Trying to preempt is not ideal. It means we sometimes scale out when there is no need. It also mean that at our peak traffic we’ve scaled out far higher than needed. Finally high latency means we cannot react to short spikes in load, so we have to run our baseline a little over provisioned to handle that.

The switch to Puma created a problem with our scaling process and ultimately lead to an even better solution.

Puma Stats

When load testing with Puma we noticed NewRelic was no longer correctly reporting the App instance busy metric.

The numbers were much higher than we would expect from the number of concurrent clients I was running in my load tests¹. We needed to fix the problem or find a better source of data.

I scoured NewRelic’s forum for information without luck. So my first instinct was to use Puma’s control/status server and poll this for data. However I noticed that an API for this data was added to Puma 3.12.0. The stats data gives a detailed view of the current state of a Puma cluster:

{ “workers”:2,
  “phase”:0,
  “booted_workers”:2,
  “old_workers”:0,
  “worker_status”:[
  { “pid”:1936,
    “index”:0,
    “phase”:0,
    “booted”:true,
    “last_checkin”:”2018–11–27T05:58:27Z”,
    “last_status”:{
      “backlog”:0,
      “running”:5,
      “pool_capacity”:5,
      “max_threads”:5 } },
  { “pid”:1940,
    “index”:1,
    “phase”:0,
    “booted”:true,
    “last_checkin”:”2018–11–27T05:58:27Z”,
    “last_status”:{
      “backlog”:0,
      “running”:5,
      “pool_capacity”:5,
      “max_threads”:5 } }
  ]
}

The data is fairly intuitive to understand. We are interested in the worker_status array. Each element corresponding to one Puma process in the cluster. We specifically care about pool_capacity and max_threads. The pool capacity is the number of threads a process has free for handling requests. The max threads tells use the maximum number of threads the process can use. Some simple maths lets us calculate the utilisation:

(sum_max_threads - sum_pool_capacity) / sum_max_threads

The code was fairly simple. I wrapped it up in a Gem. A simplified version is as follows

Then in the before_fork block of the config/puma.rb I called the new gem:

before_fork do 
  PumaCapacityToCloudwatch.run if defined?(PumaCapacityToCloudwatch)
end

Highly Reactive

This lets us simplify our autoscaling metrics and greatly reduce the latency. Instead of the data going Rails App > NewRelic > Lambda Function > CloudWatch Metric > CloudWatch Alarm — I create a thread when the Rails application boots. This thread polls the Puma.stat API and using the AWS SDK Cloudwatch gem publishes the utilisation data directly to CloudWatch. I also discovered we had been using the standard resolution CloudWatch metrics and alarms. These had one minute resolutions.

CloudWatch has introduced high-resolution² metrics with 1 second resolution and alarms with 10 second resolution.

The combination of eliminating the latency of publishing to NewRelic, polling the NewRelic API via lambda function and the standard resolution meant our autoscaling was far more reactive. Instead of it taking minutes to detect traffic spikes, the autoscaling could trigger within 10 seconds or less.

This big reduction in latency let us set our scale out thresholds higher and greatly reduce unnecessary scale out and over provisioning. I’ve yet to experiment fully with this and our Production systems. But I predict this will deliver considerable savings. I’ll report the outcome in a future post.

Additional notes

Costs of using CloudWatch’s high resolution metrics

You should be aware there are some costs to consider³. The first is the switch to high resolution CloudWatch alarm. Although these are very minimal:

Standard Resolution Alarm (60 sec) = $0.10/alarm/month
High Resolution Alarm (10 sec) = $0.30/alarm/month

However, pushing metric data to CloudWatch is where you need to be conscious of cost. Amazon charge $0.01 per 1,000 PutMetricData requests. Pushing data every second would therefore cost $26/container/month. At peak times we have hundreds of containers running so costs would be fairly high.

I avoided the cost issue by pushing data to CloudWatch every minute when traffic and server utilisation was low. Only when utilisation spikes do I immediately send utilisation data to CloudWatch. This keeps costs low but ensures we maintain the highly reactive autoscaling.

Ideas for future tasks

Highly reactive autoscaling with containers also needs containers that are fast to spin up and start serving requests. Investigate the use of Alpine Linux for smaller/faster containers.
Use Shopify’s bootsnap gem to speed up boot times of our application [DONE]
Use bumbler gem to find any slow to load gems. Update, fix or replace slow gems.

Additional Links

I contacted NewRelic about the inaccurate data and was informed this is now fixed in newer versions of their `newrelic_rpm` gem https://github.com/newrelic/rpm/blob/master/CHANGELOG.md#v540
Amazon introduced high resolution metrics in July 2017 https://aws.amazon.com/blogs/aws/new-high-resolution-custom-metrics-and-alarms-for-amazon-cloudwatch/
Amazon CloudWatch pricing https://aws.amazon.com/cloudwatch/pricing/