Debugging memory leaks in Ruby on Rails applications

Venu Karna
5 min readSep 8, 2019

--

Debugging memory leaks in ROR applications can be really tricky. It gets even trickier if the issue cannot be replicated on testing environment. We recently encountered one such case and this article describes the steps we followed to identify the root cause.

One fine day there were frequent CPU and memory spikes of our ROR application servers hosted on AWS. Initially we tried to debug this using newrelic, sentry, apache logs and aws logs etc. But since during spikes quite a few APIs started reported high response time, it was difficult to pinpoint if it was a code issue. Then we started looking into apache and passenger configurations to see if there were any config issues. But everything seems normal. We tried reproducing on staging and latest but the configurations seemed fine.

Apache-Passenger Configuration:

PassengerMaxPoolSize 15

PassengerPoolIdleTime 60

PassengerMaxPreloaderIdleTime 0

PassengerStatThrottleRate 300

PassengerMaxRequests 200000

PassengerMinInstances 8

These parameters have been optimized according to server and application specs.

Debugging Steps

When the last spike happened we detached effected instance from autoscaling group and started debugging.

  1. Run: passenger-status

The output will look something like this:

Version : 5.3.1

Date : 2019–03–29 06:35:53 +0000

Instance: vpWkAQc2 (Apache/2.2.34 (Unix) DAV/2 Phusion_Passenger/5.3.1)

— — — — — — General information — — — — — -

Max pool size : 15

App groups : 1

Processes : 8

Requests in top-level queue :

— — — — — — Application groups — — — — — -

/ebs_store/xxx (production):

App root: /ebs_store/xxx

Requests in queue: 0

* PID: 13730 Sessions: 0 Processed: 3090 Uptime: 17h 38m 57s

CPU: 0% Memory : 479M Last used: 0s ago

* PID: 13765 Sessions: 0 Processed: 18 Uptime: 17h 38m 57s

CPU: 0% Memory : 178M Last used: 47m 40s

* PID: 13804 Sessions: 0 Processed: 9 Uptime: 17h 38m 57s

CPU: 0% Memory : 199M Last used: 17h 29m

* PID: 13839 Sessions: 0 Processed: 1 Uptime: 17h 38m 56s

CPU: 0% Memory : 179M Last used: 17h 29m

* PID: 13867 Sessions: 0 Processed: 0 Uptime: 17h 38m 56s

CPU: 0% Memory : 85M Last used: 17h 38m

* PID: 13893 Sessions: 0 Processed: 0 Uptime: 17h 38m 56s

CPU: 0% Memory : 84M Last used: 17h 38m

* PID: 13919 Sessions: 0 Processed: 0 Uptime: 17h 38m 55s

CPU: 0% Memory : 84M Last used: 17h 38m

* PID: 13945 Sessions: 0 Processed: 0 Uptime: 17h 38m 55s

CPU: 0% Memory : 84M Last used: 17h 38m

Check the CPU and memory used by each of the passenger processes. Optimize the passenger parameters so that the total memory is not consumed by each process is not exceeding the threshold.

If there is any process which is taking 100% CPU, then we need to see what request each of the passenger processes is handling at that point of time.

For that run: rvmsudo passenger-status — show=requests

The output will look something like this:

“1–12439” : {

“connected_at” : {

“local” : “Mon Mar 25 12:25:08 2019”,

“relative” : “18h 13m 38s ago”,

},

“connection_state” : “ACTIVE”,

“current_request” : {

“app_response_http_state” : “PARSING_HEADERS”,

“host” : “www.xxx.com“,

“last_data_receive_time” : {

“local” : “Mon Mar 25 12:25:08 2019”,

“relative” : “18h 13m 38s ago”,

},

“method” : “GET”,

“path” : “/api/v1/abc/xxx?d=ca550bcf-a2d9–4b33-ab08-dad77ada9e26&no_of_days=0&c=94163703–0b71–41e2-bfb3-efda8a730d7f&start_date=2019–04–01T00:00:00Z”,

“refcount” : 1,

“request_body_already_read” : 0,

“request_body_fully_read” : true,

“request_body_type” : “NO_BODY”,

“response_begun” : false,

“session” : {

“gupid” : “18b1430-TPer5IOkoi”,

“pid” : 25166

},

From the above output we can see that “pid” : 25166 is stuck while handling a request(mentioned in path param). In this case it is “path” : “/api/v1/abc/xxx?d=ca550bcf-a2d9–4b33-ab08-dad77ada9e26&no_of_days=0&c=94163703–0b71–41e2-bfb3-efda8a730d7f&start_date=2019–04–01T00:00:00Z”

The “relative_time” param will show for how long this request is being processed.

The above output clearly shows that the process is stuck while calling “xxx” endpoint. This clearly indicates that the end point has some code issue. But generally an endpoint internally may be calling multiple functions. It becomes difficult to pinpoint the exact piece of code which is causing the problem. We need some tool which can tell us what is going on `inside’ a program/process while it executes — or what another program was doing at the moment it crashed.

For this we will use GDB.

Debugging Using GDB

First install gdb if not already installed:

sudo yum install gdb.

Then run the following command to attach GDB the process.

gdb -p <pid>

Getting a Ruby backtrace out of gdb involves interacting with the Ruby interpreter in the running process that is attached to. And to get that, the interpreter needs to be in a semi-working state.

Most mechanisms for dumping Ruby backtraces will output them to stdout or stderr. First we need to redirect the output streams to an accessible location. To do this, we rely on POSIX conventions. stdout and stderr are file descriptors 1 and 2 on POSIX-compliant systems. We want to close these file descriptors, and reopen them attached to files that we can actually read from. Closing is easy:

(gdb) call (void) close(1)

(gdb) call (void) close(2)

The call command tells gdb to invoke close from within the target process. Next, we need to re-associate file descriptors 1 and 2 with a file we can actually read from. We could use a file on the filesystem, but it’s even more convenient to see the output directly in gdb. To do that, we want to figure out the device file corresponding to our current TTY and open it twice. Since file descriptors are assigned sequentially, descriptors 1 and 2 will end up being re-associated with our TTY device. Here’s how this looks:

(gdb) shell tty

/dev/pts/0

(gdb) call (int) open(“/dev/pts/0”, 2, 0)

$1 = 1

(gdb) call (int) open(“/dev/pts/0”, 2, 0)

$2 = 2

Now, any output generated by the target process will be echoed directly to our console so we can see it in our gdb session.

Finally, we need to get the Ruby interpreter in the target process to spit out a backtrace. For this we can use gdb to call the rb_backtrace() function from the Ruby interpreter directly:

(gdb) call (void)rb_backtrace()

Subset of the output in our case looks something like this:

from /ebs_store/xxx/vendor/cache/ruby/2.3.0/gems/activesupport-4.2.0/lib/active_support/core_ext/range/each.rb:7:in `each’

from /ebs_store/xxx/app/models/d.rb:1938:in `block (2 levels) in yyy’

from /ebs_store/xxx/app/models/s.rb:1279:in `yyy’

from /ebs_store/xxx/app/models/s.rb:1279:in `<’

This clearly shows that the problem was in yyy function. It even print the lines where it is stuck. This narrow downs the culprit code to few 10’s of lines.

In our case it was a while loop in which variable used was not incrementing because of configuration issue. This caused an infinite loop. Once the cause was identified the fix was pretty straight forward.

--

--