We Solved Our Rails Memory Leaks With jemalloc

Daniel Desmeules
motive-eng
Published in
5 min readFeb 14, 2022

In our Ruby on Rails system, we found ourselves having to restart our Puma processes preemptively on a regular basis to prevent them from running out of memory at peak hours. After integrating jemalloc, we no longer need this preventive measure; our servers’ memory usage remains stable.

If you follow our KeepTruckin blog posts, you may recall that we have a Ruby on Rails monolith that handles all of our processing from thousands of ELD devices. The ELD devices transfer data from the road to our servers; this data is used to power a number of customer features, such as tracking vehicle locations, sending out dispatches, and tracking drivers’ safety at every driving event.

Our Puma application servers frequently ran out of memory. This was a major risk to our business because these web servers process almost all of our APIs. We also have backburner servers, which we use as background job processors, and scheduler servers, which also run background jobs. When we noticed memory problems, the biggest impact was on the application servers. Our most heavily used backburners were also restarting, but less frequently than our Puma servers.

Our Quick Fix Still Left Things Broken

We made a quick fix by using a Jenkins job to restart the Puma service whenever memory exceeded a certain threshold. This unsatisfactory short-term solution had many operational and performance implications:

  • It placed us at risk should any Jenkins job fail to complete the restart. The machines in question could become completely inaccessible due to their high memory usage, which left us vulnerable to a major incident.
  • We hadn’t eliminated latency altogether. There was still the period of time when an increase in memory usage caused potentially higher latency — the period of time until the Puma process was restarted.
  • If any new memory leaks arose, they would be obscured by our restarting “bandaid.”

We found that to prevent our Puma servers from running out of memory, we were preemptively restarting them multiple times a day. This was much more frequent than when we first implemented the Jenkins fix. This was due to having a much bigger load on our services, as Keeptruckin is enjoying a fast rate of growth. It was urgent to find a solution that wouldn’t require services to be restarted.

Why Were Our Processes Growing?

There were a few possible causes:

  • Were there actual memory leaks in the code?
  • Was local caching increasing process size? We have a hash-based caching layer. We were suspicious that not all entries were released properly.
  • Was memory becoming fragmented? This was a likely possibility; it is a known issue when Ruby is running on Linux and using the standard malloc call. The recommendation for this pain point was to switch to jemalloc on Linux. Jemalloc is an open-source replacement memory allocator which emphasizes fragmentation avoidance and scalable concurrency support. People were seeing successful results, as in this example and this one.

Instead of chasing unknown culprits, we went straight to the common source of many of our challenges: known issues with Ruby. We began with jemalloc, thinking, “If this solves the problem, we won’t need to investigate for leaks and local caching.”

We Experimented With jemalloc

Because memory fragmentation was already a known issue in Ruby systems like ours, we began to experiment with jemalloc. You can enable jemalloc in two ways:

  • Install the jemalloc package on the host and add environment variable LD_PRELOAD=path_to_shared_library. This is the easiest way to test jemalloc. It relies on using a Ruby binary that was built with the shared system libraries we have. We added wrappers to start our processes with this variable. This option also allowed us to control the rollout of the feature using a configuration entry, making it easier and safer to enable in production outside of a normal deployment.
  • Build a custom Ruby integrated with jemalloc. This second option would require changing our Amazon Machine Images (AMI), and therefore was not our preferred approach. The rest of the integration would have been simpler with this approach because the wrappers would have been unnecessary.

We selected the first approach and started with Puma, because our biggest problem lay in that process. We modified the Puma start script to add the LD_PRELOAD environment variable if the library is present and enabled in the config. We verified that jemalloc was loaded by checking the mapping of the Puma process using the /proc filesystem.

Our Results

The results were very impressive. In fact, the outcome was much better than we expected. The Puma processes ran more than four days without a single restart. Memory consistently stayed around 1.2–1.3 GB. It would sometimes climb higher, but would also go back down. This is a positive side effect of jemalloc returning memory to the system when it’s no longer needed (something that standard malloc doesn’t do even without fragmentation).

We concluded that our issue was indeed memory fragmentation, as opposed to any actual memory leak in the code. The clearest way to describe our results is that we have never had to restart our Puma servers since enabling jemalloc. (We decided to keep the Jenkins job just in case some new memory leak arises later.)

Two Pictures Are Worth 1,000 Words

The graph below illustrates a Puma instance without jemalloc. Notice the rapid increase in memory usage.

Now, this graph (below) shows our results after enabling jemalloc. The process depicted in this graph had been running for two days.

jemalloc Everywhere

Recall that memory fragmentation impacted our application servers the most. It is understandable that memory fragmentation had a smaller impact on our backburner and scheduler servers: their jobs run in a single thread, processing one job at a time, and when the job is done, the server recovers all its memory and starts again. In this context, memory fragmentation is less disruptive because memory is allocated and fully released after every job. We still did, however, enable jemalloc on our backburner and scheduler processes as well.

Come join us!

Check out our latest KeepTruckin opportunities on our Careers page and visit our Before You Apply page to learn more about our rad engineering team.

--

--