How I Tuning Puma for Ruby on Rails at Kubernetes

Tuning Puma for Kubernetes

Published in

WilliamDesk

9 min readFeb 14, 2024

Puma is a popular web server package for Ruby on Rails; tuning the Puma configuration well would be more efficient for services.

Puma gem: https://rubygems.org/gems/puma

Why refactor?

The core reason is that when we migrate from docker to Kubernetes architecture, the Puma configuration might need some updates to fit the new architecture. So, I re-researched many articles and source repos to help me re-write it.

Before Puma Configuration Tuning, You should need to Check

Do not use the most essential services on the first try. Use Supplementary Services instead.
For example, Do not use member Oauth service for testing first. In this example, I used the Mail Center for testing. (Most requests are internal calls, and if the request failed will retry.)
You must have at least one of the monitoring services to check the service's online status. I use the Datadog to detect the service and Kubernetes status before and after.
You must know why and how for each configuration setting and step-by-step tuning.

Refactor Puma Configuration

Our initial Puma Configuration

Here is our first configuration.

#!/usr/bin/env puma

environment ENV.fetch("RAILS_ENV") { "development" }

if ENV['RAILS_ENV'].nil? || ENV['RAILS_ENV'] == 'development'
  threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
  threads threads_count, threads_count

  port ENV.fetch("PORT") { 3000 }

  plugin :tmp_restart
else
  directory './'
  rackup "./config.ru"

  pidfile "./tmp/pids/puma.pid"
  state_path "./tmp/pids/puma.state"

  threads 0,16

  port ENV.fetch("PORT") { 3000 }

  workers 2

  prune_bundler

  on_restart do
    puts 'Refreshing Gemfile'
    ENV["BUNDLE_GEMFILE"] = "./Gemfile"
  end
end

Refactor key 1: Decrease the capacity of threads.

Some reference links were surveyed, and you should check them first.

According to the last issue discussion, using too many threads is not suitable for the service to accept more requests, and resources may face the Global VM lock.

IME more than 5 threads is usually too much for a webapp (which is why 5 is now the default for Puma on MRI). Over that amount, the GVL is locked >80% of the time but with extra free threads, Puma processes will continue to pick up new requests, which they cannot immediately service, leading to increased latency for no throughput gain.
Reference: https://github.com/puma/puma/issues/2645#issuecomment-1019241622

Global VM Lock:
The Global Virtual Lock (GVL), also known in some contexts as the Global Interpreter Lock (GIL), is a synchronization mechanism used in computer language interpreters. It is primarily employed in multithreaded programs to ensure that only one thread can execute bytecode at any given moment, thereby preventing data races that can occur when multiple threads execute concurrently.
Introduction by ChatGPT.

For our configuration setting, I set too many threads for each worker. So, it's my first reactor target.

Based on the article on the Puma documentation, we know that Kubernetes Pods * Puma workers * threads is the final total amount of capacity.

Our capacity was 64 before, which was 2 (pods) * 2 (workers) * 16 (the max threads). I feared a significant decrease would cause a critical issue, so on the first try, I sequentially decreased threads to 8 and finally to 5 (the total capacity will be 20).

# puma.rb
# Let Puma auto tune the threads per worker in the 1 to 5 range.
# Decrease from 16 to 5 because each thread will have more CPU / memory resources.
threads 0, 5

Note:
If you want an equation for child process count:
TOTAL_RAM / ( RAM_PER_PROCESS * 1.2))
Ref: https://www.speedshop.co/2017/10/12/appserver.html

Refactor Key 2: Workers should reliably detect how many CPU cores could used.

In the initial setting, we set the workers at two because, in our old Docker infrastructure, each service had two vCPUs that could be used.

If multithreaded, allocate 1 CPU per worker. If single threaded, allocate 0.75 cpus per worker. Most web applications spend about 25% of their time in I/O — but when you’re running multi-threaded, your Puma process will have higher CPU usage and should be able to fully saturate a CPU core.
Reference: https://github.com/puma/puma/blob/master/docs/kubernetes.md#workers-per-pod-and-other-config-issues

On my first try, I refer to the new Rails version's Puma configuration setting to detect the CPU.

# Specifies that the worker count should equal the number of processors in production.
if ENV["RAILS_ENV"] == "production"
  worker_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count })
  workers worker_count if worker_count > 1
end

It is a good idea to use the Ruby method to detect the CPU processor count. You can decide whether to base it on the "physical processor count" or "processors seen by the OS and used for process scheduling".

# Module: Concurrent

require "concurrent-ruby"

# Number of physical processor cores on the current system.
# For performance reasons the calculated value will be memoized on the first call.
Concurrent.physical_processor_count

# Number of processors seen by the OS and used for process scheduling. 
# For performance reasons the calculated value will be memoized on the first call.
Concurrent.processor_count

However, it would not work in the Kubernetes.

If your Kubernetes base cluster node has 4 CPUs, and you limit the service to 1000 (µs) CPU resources. Using the Coucurrnet method will always return in 4 (you see the host level resource, not the pod level resources).

For this reason, I need to let Puma directly read the Kubernetes setting. I tried this way:

# check K8S CPU limit by cofig file.
## cpu.cfs_quota_us specifies the maximum CPU time (in microseconds) that the group can use during that window.
## cpu.cfs_period_us is the length of the time window (in microseconds) for CPU access.

## if quota == -1 means no limitation for CPU resource using.

require "concurrent-ruby"
quota_file = '/sys/fs/cgroup/cpu/cpu.cfs_quota_us'
period_file = '/sys/fs/cgroup/cpu/cpu.cfs_period_us'
quota = File.read(quota_file).strip.to_i
period = File.read(period_file).strip.to_i

if quota != -1
  processors_count = (quota.to_f / period.to_f).ceil
else
  processors_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count })
end

If CPU resources are not limited, the service could use the maximum range of the node’s CPU, so I designed it based on the physical processor count. If there is a limit, calculate it!

Refactor Key 3: Always set the timeout for services.

I read the Sidekiq author Mike’s article long ago, but the memory is still fresh. It teaches me a lesson that all network requests should have a timeout .

I refer to this GitHub: the-ultimate-guide-to-ruby-timeouts for setting puma timeout, which is very useful.

Reference: https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts?tab=readme-ov-file#puma

Refactor Key 4: Set the error handling.

Puma has the default error-handling configuration inside. We also have a Sentry system to help us capture errors, so adding the Sentry capture when an error happens sounds good!

# Reference: https://github.com/puma/puma?tab=readme-ov-file#error-handling

lowlevel_error_handler do |e|
  Sentry.capture_exception(e)
  [500, {}, ["An error has occurred"]]
end

The Final Puma Configuration

# This configuration file will be evaluated by Puma. The top-level methods that
# are invoked here are part of Puma's configuration DSL. For more information
# about methods provided by the DSL, see https://puma.io/puma/Puma/DSL.html.

# Specifies the `rails_env` that Puma will run in.
rails_env = ENV.fetch("RAILS_ENV", "development")

# Puma starts a configurable number of processes (workers) and each process
# serves each request in a thread from an internal thread pool.
#
# The ideal number of threads per worker depends both on how much time the
# application spends waiting for IO operations and on how much you wish to
# to prioritize throughput over latency.
#
# As a rule of thumb, increasing the number of threads will increase how much
# traffic a given process can handle (throughput), but due to CRuby's
# Global VM Lock (GVL) it has diminishing returns and will degrade the
# response time (latency) of the application.
#
# The default is set to 3 threads as it's deemed a decent compromise between
# throughput and latency for the average Rails application.
#
# Any libraries that use a connection pool or another resource pool should
# be configured to provide at least as many connections as the number of
# threads. This includes Active Record's `pool` parameter in `database.yml`.
default_threads_count = ENV.fetch("RAILS_MAX_THREADS") { 3 }
threads default_threads_count, default_threads_count

# Specifies that the worker count should equal the number of processors in production and preparing.
if rails_env == "production" || rails_env == "preparing"
  # If you are running more than 1 thread per process, the workers count
  # should be equal to the number of processors (CPU cores) in production and preparing.
  #
  # It defaults to 1 because it's impossible to reliably detect how many
  # CPU cores are available. Make sure to set the `WEB_CONCURRENCY` environment
  # variable to match the number of processors.
  require "concurrent-ruby"

  # check K8S CPU limit by cofig file.
  ## cpu.cfs_quota_us specifies the maximum CPU time (in microseconds) that the group can use during that window.
  ## cpu.cfs_period_us is the length of the time window (in microseconds) for CPU access.
  quota_file = '/sys/fs/cgroup/cpu/cpu.cfs_quota_us'
  period_file = '/sys/fs/cgroup/cpu/cpu.cfs_period_us'
  quota = File.read(quota_file).strip.to_i
  period = File.read(period_file).strip.to_i

  if quota != -1
    processors_count = (quota.to_f / period.to_f).ceil
  else
    processors_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count })
  end

  # Set workers and threads
  if processors_count > 1
    workers processors_count
    threads 0,5

    on_worker_boot do
      ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
    end
  else
    preload_app!
  end

  worker_timeout 15
  worker_shutdown_timeout 8
end

# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
port ENV.fetch("PORT") { 3000 }

# Specifies the `environment` that Puma will run in.
environment rails_env

# Allow puma to be restarted by `bin/rails restart` command.
plugin :tmp_restart
pidfile ENV["PIDFILE"] if ENV["PIDFILE"]

if rails_env == "development"
  # Specifies a very generous `worker_timeout` so that the worker
  # isn't killed by Puma when suspended by a debugger.
  worker_timeout 3600
end

# If puma encounters an error outside of the context of your application, 
# it will respond with a 500 and a simple textual error message
lowlevel_error_handler do |e|
  Sentry.capture_exception(e)
  [500, {}, ["An error has occurred"]]
end

There are some minor changes I didn’t mention before, but you can note them.

Use the on_worker_boot method.
Always check that the ActiveRecord connected to the database (or you can add Redis if needed) is ready-connected when workers boot.
The preload_app! on by default if your app uses more than one worker.
In Rails, the new puma template will also preload_app when the worker amount equals 1.
https://github.com/rails/rails/blob/main/railties/lib/rails/generators/rails/app/templates/config/puma.rb.tt

Monitoring Everything

Remember that all changes you make should always be monitored.

There are no issues when deploying new configurations to production for close monitoring. (Of course, we have a preparing environment and tried to do some stress tests to check whether the service was working, but the request volume is incomparable with the production environment.)

Even though we decreased the request capacity from 64 to 20, the services worked well. Although CPU and memory usage is not significantly better, I still successfully decreased the thread capacity to a more appropriate situation to avoid GVL.

Conclusion

If multithreaded, allocate 1 CPU per Puma worker.
Ref: https://github.com/puma/puma/issues/2645#issuecomment-867629826
Most Pumas will use about 512MB ~ 1GB of memory per worker and about 1GB for the master process.
Ref: https://github.com/puma/puma/issues/2645#issuecomment-867629826
Most Puma will use about 300 MB ~ 500 MB of memory per child thread.
It depends on your web services type and function; I experimented with Ruby 3 and Rails 6 applications, and the reference article said each process using about 200 MB to 400 MB needs to be more significant.
Ref: https://www.speedshop.co/2017/10/12/appserver.html
Setting each Puma worker for 3 ~ 5 threads is the most suitable for general purposes.
Ref 1: https://github.com/rails/rails/issues/50450
Ref 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
Let your configuration dynamically adjust based on the resources given.
Always remember to set the timeout and error-handling mechanism for your web services.
Ref 1: https://www.mikeperham.com/2015/05/08/timeout-rubys-most-dangerous-api/
Ref 2: https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts
Use unimportant services first, and do more testing before deploying to the production environment.
When deployed to the production environment, use monitoring services to help you handle and record every change.