How I Tuning Puma for Ruby on Rails at Kubernetes

Tuning Puma for Kubernetes

William
WilliamDesk
9 min readFeb 14, 2024

--

Puma is a popular web server package for Ruby on Rails; tuning the Puma configuration well would be more efficient for services.

Puma gem: https://rubygems.org/gems/puma

ref: https://puma.io/

Why refactor?

The core reason is that when we migrate from docker to Kubernetes architecture, the Puma configuration might need some updates to fit the new architecture. So, I re-researched many articles and source repos to help me re-write it.

Before Puma Configuration Tuning, You should need to Check

  1. Do not use the most essential services on the first try. Use Supplementary Services instead.
    For example, Do not use member Oauth service for testing first. In this example, I used the Mail Center for testing. (Most requests are internal calls, and if the request failed will retry.)
  2. You must have at least one of the monitoring services to check the service's online status. I use the Datadog to detect the service and Kubernetes status before and after.
  3. You must know why and how for each configuration setting and step-by-step tuning.

Refactor Puma Configuration

Our initial Puma Configuration

Here is our first configuration.

#!/usr/bin/env puma

environment ENV.fetch("RAILS_ENV") { "development" }

if ENV['RAILS_ENV'].nil? || ENV['RAILS_ENV'] == 'development'
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count

port ENV.fetch("PORT") { 3000 }

plugin :tmp_restart
else
directory './'
rackup "./config.ru"

pidfile "./tmp/pids/puma.pid"
state_path "./tmp/pids/puma.state"

threads 0,16

port ENV.fetch("PORT") { 3000 }

workers 2

prune_bundler

on_restart do
puts 'Refreshing Gemfile'
ENV["BUNDLE_GEMFILE"] = "./Gemfile"
end
end

Refactor key 1: Decrease the capacity of threads.

Some reference links were surveyed, and you should check them first.

  1. [Documentation] How to best tune Puma for Kubernetes
  2. Configuring Puma, Unicorn and Passenger for Maximum Efficiency
  3. Workers Per Pod, and Other Config Issues

According to the last issue discussion, using too many threads is not suitable for the service to accept more requests, and resources may face the Global VM lock.

IME more than 5 threads is usually too much for a webapp (which is why 5 is now the default for Puma on MRI). Over that amount, the GVL is locked >80% of the time but with extra free threads, Puma processes will continue to pick up new requests, which they cannot immediately service, leading to increased latency for no throughput gain.
Reference:
https://github.com/puma/puma/issues/2645#issuecomment-1019241622

Global VM Lock:
The Global Virtual Lock (GVL), also known in some contexts as the Global Interpreter Lock (GIL), is a synchronization mechanism used in computer language interpreters. It is primarily employed in multithreaded programs to ensure that only one thread can execute bytecode at any given moment, thereby preventing data races that can occur when multiple threads execute concurrently.
Introduction by ChatGPT.

For our configuration setting, I set too many threads for each worker. So, it's my first reactor target.

Based on the article on the Puma documentation, we know that Kubernetes Pods * Puma workers * threads is the final total amount of capacity.

Our capacity was 64 before, which was 2 (pods) * 2 (workers) * 16 (the max threads). I feared a significant decrease would cause a critical issue, so on the first try, I sequentially decreased threads to 8 and finally to 5 (the total capacity will be 20).

# puma.rb
# Let Puma auto tune the threads per worker in the 1 to 5 range.
# Decrease from 16 to 5 because each thread will have more CPU / memory resources.
threads 0, 5

Note:
If you want an equation for child process count:
TOTAL_RAM / ( RAM_PER_PROCESS * 1.2))
Ref: https://www.speedshop.co/2017/10/12/appserver.html

Refactor Key 2: Workers should reliably detect how many CPU cores could used.

In the initial setting, we set the workers at two because, in our old Docker infrastructure, each service had two vCPUs that could be used.

If multithreaded, allocate 1 CPU per worker. If single threaded, allocate 0.75 cpus per worker. Most web applications spend about 25% of their time in I/O — but when you’re running multi-threaded, your Puma process will have higher CPU usage and should be able to fully saturate a CPU core.
Reference: https://github.com/puma/puma/blob/master/docs/kubernetes.md#workers-per-pod-and-other-config-issues

On my first try, I refer to the new Rails version's Puma configuration setting to detect the CPU.

# Specifies that the worker count should equal the number of processors in production.
if ENV["RAILS_ENV"] == "production"
worker_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count })
workers worker_count if worker_count > 1
end

It is a good idea to use the Ruby method to detect the CPU processor count. You can decide whether to base it on the "physical processor count" or "processors seen by the OS and used for process scheduling".

# Module: Concurrent

require "concurrent-ruby"

# Number of physical processor cores on the current system.
# For performance reasons the calculated value will be memoized on the first call.
Concurrent.physical_processor_count

# Number of processors seen by the OS and used for process scheduling.
# For performance reasons the calculated value will be memoized on the first call.
Concurrent.processor_count

However, it would not work in the Kubernetes.

If your Kubernetes base cluster node has 4 CPUs, and you limit the service to 1000 (µs) CPU resources. Using the Coucurrnet method will always return in 4 (you see the host level resource, not the pod level resources).

For this reason, I need to let Puma directly read the Kubernetes setting. I tried this way:

# check K8S CPU limit by cofig file.
## cpu.cfs_quota_us specifies the maximum CPU time (in microseconds) that the group can use during that window.
## cpu.cfs_period_us is the length of the time window (in microseconds) for CPU access.

## if quota == -1 means no limitation for CPU resource using.

require "concurrent-ruby"
quota_file = '/sys/fs/cgroup/cpu/cpu.cfs_quota_us'
period_file = '/sys/fs/cgroup/cpu/cpu.cfs_period_us'
quota = File.read(quota_file).strip.to_i
period = File.read(period_file).strip.to_i

if quota != -1
processors_count = (quota.to_f / period.to_f).ceil
else
processors_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count })
end

If CPU resources are not limited, the service could use the maximum range of the node’s CPU, so I designed it based on the physical processor count. If there is a limit, calculate it!

Refactor Key 3: Always set the timeout for services.

I read the Sidekiq author Mike’s article long ago, but the memory is still fresh. It teaches me a lesson that all network requests should have a timeout .

I refer to this GitHub: the-ultimate-guide-to-ruby-timeouts for setting puma timeout, which is very useful.

Reference: https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts?tab=readme-ov-file#puma

Refactor Key 4: Set the error handling.

Puma has the default error-handling configuration inside. We also have a Sentry system to help us capture errors, so adding the Sentry capture when an error happens sounds good!

# Reference: https://github.com/puma/puma?tab=readme-ov-file#error-handling

lowlevel_error_handler do |e|
Sentry.capture_exception(e)
[500, {}, ["An error has occurred"]]
end

The Final Puma Configuration

# This configuration file will be evaluated by Puma. The top-level methods that
# are invoked here are part of Puma's configuration DSL. For more information
# about methods provided by the DSL, see https://puma.io/puma/Puma/DSL.html.

# Specifies the `rails_env` that Puma will run in.
rails_env = ENV.fetch("RAILS_ENV", "development")

# Puma starts a configurable number of processes (workers) and each process
# serves each request in a thread from an internal thread pool.
#
# The ideal number of threads per worker depends both on how much time the
# application spends waiting for IO operations and on how much you wish to
# to prioritize throughput over latency.
#
# As a rule of thumb, increasing the number of threads will increase how much
# traffic a given process can handle (throughput), but due to CRuby's
# Global VM Lock (GVL) it has diminishing returns and will degrade the
# response time (latency) of the application.
#
# The default is set to 3 threads as it's deemed a decent compromise between
# throughput and latency for the average Rails application.
#
# Any libraries that use a connection pool or another resource pool should
# be configured to provide at least as many connections as the number of
# threads. This includes Active Record's `pool` parameter in `database.yml`.
default_threads_count = ENV.fetch("RAILS_MAX_THREADS") { 3 }
threads default_threads_count, default_threads_count

# Specifies that the worker count should equal the number of processors in production and preparing.
if rails_env == "production" || rails_env == "preparing"
# If you are running more than 1 thread per process, the workers count
# should be equal to the number of processors (CPU cores) in production and preparing.
#
# It defaults to 1 because it's impossible to reliably detect how many
# CPU cores are available. Make sure to set the `WEB_CONCURRENCY` environment
# variable to match the number of processors.
require "concurrent-ruby"

# check K8S CPU limit by cofig file.
## cpu.cfs_quota_us specifies the maximum CPU time (in microseconds) that the group can use during that window.
## cpu.cfs_period_us is the length of the time window (in microseconds) for CPU access.
quota_file = '/sys/fs/cgroup/cpu/cpu.cfs_quota_us'
period_file = '/sys/fs/cgroup/cpu/cpu.cfs_period_us'
quota = File.read(quota_file).strip.to_i
period = File.read(period_file).strip.to_i

if quota != -1
processors_count = (quota.to_f / period.to_f).ceil
else
processors_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.physical_processor_count })
end

# Set workers and threads
if processors_count > 1
workers processors_count
threads 0,5

on_worker_boot do
ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
end
else
preload_app!
end

worker_timeout 15
worker_shutdown_timeout 8
end

# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
port ENV.fetch("PORT") { 3000 }

# Specifies the `environment` that Puma will run in.
environment rails_env

# Allow puma to be restarted by `bin/rails restart` command.
plugin :tmp_restart
pidfile ENV["PIDFILE"] if ENV["PIDFILE"]

if rails_env == "development"
# Specifies a very generous `worker_timeout` so that the worker
# isn't killed by Puma when suspended by a debugger.
worker_timeout 3600
end

# If puma encounters an error outside of the context of your application,
# it will respond with a 500 and a simple textual error message
lowlevel_error_handler do |e|
Sentry.capture_exception(e)
[500, {}, ["An error has occurred"]]
end

There are some minor changes I didn’t mention before, but you can note them.

  1. Use the on_worker_boot method.
    Always check that the ActiveRecord connected to the database (or you can add Redis if needed) is ready-connected when workers boot.
  2. The preload_app! on by default if your app uses more than one worker.
    In Rails, the new puma template will also preload_app when the worker amount equals 1.
    https://github.com/rails/rails/blob/main/railties/lib/rails/generators/rails/app/templates/config/puma.rb.tt

Monitoring Everything

Remember that all changes you make should always be monitored.

There are no issues when deploying new configurations to production for close monitoring. (Of course, we have a preparing environment and tried to do some stress tests to check whether the service was working, but the request volume is incomparable with the production environment.)

Even though we decreased the request capacity from 64 to 20, the services worked well. Although CPU and memory usage is not significantly better, I still successfully decreased the thread capacity to a more appropriate situation to avoid GVL.

Monitor CPU usage by Pod.
Monitor memory usage by Pod.

Conclusion

  1. If multithreaded, allocate 1 CPU per Puma worker.
    Ref: https://github.com/puma/puma/issues/2645#issuecomment-867629826
  2. Most Pumas will use about 512MB ~ 1GB of memory per worker and about 1GB for the master process.
    Ref: https://github.com/puma/puma/issues/2645#issuecomment-867629826
  3. Most Puma will use about 300 MB ~ 500 MB of memory per child thread.
    It depends on your web services type and function; I experimented with Ruby 3 and Rails 6 applications, and the reference article said each process using about 200 MB to 400 MB needs to be more significant.
    Ref: https://www.speedshop.co/2017/10/12/appserver.html
  4. Setting each Puma worker for 3 ~ 5 threads is the most suitable for general purposes.
    Ref 1: https://github.com/rails/rails/issues/50450
    Ref 2: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration
  5. Let your configuration dynamically adjust based on the resources given.
  6. Always remember to set the timeout and error-handling mechanism for your web services.
    Ref 1: https://www.mikeperham.com/2015/05/08/timeout-rubys-most-dangerous-api/
    Ref 2: https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts
  7. Use unimportant services first, and do more testing before deploying to the production environment.
  8. When deployed to the production environment, use monitoring services to help you handle and record every change.

References

  1. https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server
  2. https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts?tab=readme-ov-file#puma
  3. https://github.com/puma/puma/blob/master/docs/kubernetes.md#workers-per-pod-and-other-config-issues
  4. https://github.com/puma/puma?tab=readme-ov-file#clustered-mode
  5. https://github.com/puma/puma?tab=readme-ov-file#error-handling
  6. https://github.com/puma/puma/issues/2645
  7. https://github.com/rails/rails/blob/main/activestorage/test/dummy/config/puma.rb
  8. https://github.com/rails/rails/blob/main/railties/lib/rails/generators/rails/app/templates/config/puma.rb.tt
  9. https://github.com/rails/rails/issues/50450
  10. https://puma.io/puma/Puma/DSL.html
  11. https://ruby-concurrency.github.io/concurrent-ruby/master/Concurrent.html#physical_processor_count-class_method
  12. https://www.speedshop.co/2017/10/12/appserver.html
  13. https://www.mikeperham.com/2015/05/08/timeout-rubys-most-dangerous-api/

--

--