Optimizing Ruby memory usage

Shreyansh Gupta

Published in

Nerd For Tech

5 min readMay 21, 2024

These suggestions are tailored to CRuby.

1. Dial back the instance counts — discover the true steady state memory usage per instance.

The idea behind this is simple. When you start a Rails app, as different URLs are hit and different modules and classes are loaded into the memory, the memory usage of the app will increase over time. However, it is expected to level off after some time as all/most of the objects have been instantiated.

Don’t just presume you’ve got a memory leak by looking at a memory graph of a short duration. Give the application some time. Make sure you don’t have too restrictive limits (e.g. worker killers) that kill off processes as their memory usage increase.

Tip: For most applications, aim for 300 mb per instance in a steady state. Also applies to Sidekiq worker processes.

2. Stop allocating too many objects at once

Once the memory slots are no longer in use, the memory usage doesn’t just go down. There are several reasons behind this.

Thresholds — GC doesn’t run based on timers. GC runs based on threshold. A threshold is hit when more memory is needed but the Ruby VM has run out of slots. In this case, the GC will first try to clear out the existing unused slots before allocating new memory.
Heap fragmentation — In object space we have pages and in each page we have slots. There are around 408 slots per page. Each page is of 16 KB and each slot is of 40 bytes. Ruby cannot move objects between slots amongst pages. (Reason? C is pointer based. Moving objects will break those pointers.) This means that if there is even 1 slot filled out of the 408 slots in a page, that page cannot be released back to the OS. Check GC::INTERNAL_CONSTANTS for seeing the limits. These limits are dependent upon implementation.
Malloc and free are suggestions, not commands — When free is called, the allocator might decide to hold onto the memory instead of freeing it if it thinks that the memory might be requested again. Or the OS might not reclaim the memory.
Allocators don’t work well if memory is used up in either ends of the heap with free pages in the middle. In such a case, the free pages might not be released back to the OS.

All this might cause long-term slow “leaks” even when the app is in steady state. But it is not really a memory leak. It is Heap fragmentation.

So, to reduce memory usage, instead of focusing of reducing overall memory usage (which doesn’t really go down, as we saw), focus on reducing the memory required by individual actions. This can be done by allocating less objects. Or fixing the N+1s.

Following steps show how to find the problems where a lot of objects are being allocated.

2 (a). Use APM

e.g. Scout, Skylight, New relic.

2 (b). Use `memory_profiler` or `oink`

Ruby gems that are free.

oink shows which actions are blowing up the heap. memory_profiler shows how much memory is used by each block of code.

Tip: Look for bad actions using oink and dig deeper with memory_profiler.

Or, build your own solutions using GC.stat or ObjectSpace.count_objects.

2 (c). If all else fails, move to rake tasks

Throwaway VMs are better than bloated VMs. The idea is that if an action is going to trash our VM, better move the action in a dedicated VM and then trash that VM.

e.g. We can move big export tasks to rake task or Sidekiq workers. Bloated Sidekiq jobs can be run in a separate queue and that queue can be run in a separate dyno on Heroku, so that is the only VM that gets blown.

3. `Gemfile` audit with derailed

It basically goes through each gem in the Gemfile and tells you how much memory requiring each gem takes. Require only those dependencies that you need.

You can also add — require: false — for assets. e.g. if you’re precompiling your assets, then you don’t need sass in production.

4. Jemalloc

In ruby we have a choice of memory allocator. Normally, the program uses glibc memory allocator but we can also use jemalloc (written by Facebook).

They mentioned that the idea behind this was, “Emphasizing fragmentation avoidance and scalable concurrency support.”

This can be done by using LD_PRELOAD env variable or compile ruby with --with-jemalloc flag.

5. Use copy-on-write with preloading

Use Puma, Unicorn or passenger with preloading in production.

Copy-on-write increases memory usage between processes. Check this link to know more about copy-on-write — https://www.geeksforgeeks.org/copy-on-write/

Reality is that some copy-on-write is better than no copy-on-write.

Preloading basically loads and initializes your application code and forks after that point. Before the fork, the code that is loaded is in the shared memory. After that point the workers have their private memory. From the perspective of the workers, they are not aware of this shared memory, it happens at the OS level. When the workers read memory, that works fine but upon writing to the shared memory, the data is copied. Hence it is called copy-on-write. We want to avoid writing to shared memory.

I suppose the idea here is to use preloading to minimize memory usage by loading common modules in the shared memory while using copy-on-write when the workers absolutely need to write to data in the shared memory.

Memory is difficult to measure. Make sure you are measuring the right value. When we look at how much memory will be freed up when we kill a worker, we are considering the private memory of the worker.

If you see that copy-on-write with preloading is not resulting in any benefits, then maybe you need to dig into your memory measurement tool a little more and understand it better. Make sure you’re measuring the right value.

6. Use a threaded web server

Puma, Passenger enterprise have this enabled.

Threads allow us to use the same memory instead of allocating more memory.

Most people aren’t writing code that is so complicated that it wouldn’t be thread safe. So you might want to try it for your application.

You can try running your test suite with minitest/hell to find threading bugs.

7. Keep Ruby and gems up to date

8. Tune malloc

For people running threaded web server in production.

The default malloc — glibc — will create arenas. By doing this, it’s trying to reduce contention for memory reading and writing between threads. It creates arenas every time it detects contention between threads. But the default limit on the number of arenas it can create is 4 or 8 times the number of cores. That can end up being a lot of memory. Reducing this number can reduce the memory usage but it would also reduce the performance by increasing the contention.

Constant — MALLOC_ARENA_MAX

You can also tune mallopt . Check it out yourself.

9. Tune GC

Not recommended unless you are ready to deep dive into the C source code for GC.

But GC tuning can fix issues like too many free slots, slow startups, too many or too few GCs.

Remember that any amount of optimization work before establishing that you have a problem, is premature. Focus on benchmarking, measurement and profiling. Prove that you have a problem. Only then move on to optimization.

These notes have been prepared from Nate Berkopec’s talk. Watch the talk here —