Chasing Lock Contention in Our Ruby On Rails Servers

Published in

motive-eng

5 min readFeb 28, 2022

At KeepTruckin we use a StatsD client to collect RoR system metrics and send them to Datadog. We were using statsd-instrument, a StatsD client library for Ruby, to communicate with the Datadog StatsD agent. While using this client in production, we were surprised to discover serious lock contention leading to an unacceptably large number of blocked threads. In this post, we describe how we detected and resolved this lock contention.

We Use a StatsD Client for RoR

KeepTruckin relies on a Ruby on Rails (RoR) monolith to power a variety of the features used by our customers. Our RoR Puma servers process data from hundreds of thousands of ELD devices that enable us to track vehicle and asset locations, monitor driver safety, and issue dispatches, to name a few examples.

We use a statsd-instrument as our client to collect the server throughput metrics and send them Datadog.

The StatsD protocol is widely used across the industry, credited with being simple, fast, and lightweight.

The Discovery: Synchronous Work, No Buffering

Because of its solid reputation, we had to blink a few times before understanding that StatsD was causing the latency. The irony was that the metrics used to study performance had themselves become a bottleneck. We set out to remove that bottleneck and improve scalability. (Note that this blog describes lock contention problems we faced in early 2021. Since then, updates to the statsd-instrument library have resolved these issues. Back in early 2021, however, statsd-instrument did not offer asynchronous processing or buffering. These features were added in version 3.2.0 and we were using version 2.9.2.)

StatsD in Ruby

In investigating how statsd-instrument worked, we discovered that all work is done synchronously in the calling thread, and that the client doesn’t support the buffering of multiple events. This had an overall effect on all latencies, and on occasion caused long pauses when too many metrics were being created concurrently. In some cases, we even observed that using StatsD within a database transaction caused the “idle in transaction” problem (an error thrown when a transaction on a database connection has been started but not completed, and there are no longer any queries running).

Lock Contention

Ruby on Rails keeps us on our toes with various challenges, most of which are widely known and discussed in the RoR community, and already have a number of solution options in the form of libraries and extensions. Lock contention, however, turned out to be a rarely discussed issue.

The client we were using to talk to StatsD had no buffering and was not asynchronous. This meant that every time statsd-instrument sent a metric, it sent it synchronously using the calling thread, making us wait for the metric to be sent. In this scenario, the server’s own processing is affected by the simple act of sending metrics.

The statsd-instrument client has a lock inside, so even if there had been multiple threads trying to send metrics, only one thread at a time could take the lock and send the metric. We had 16 threads, and the more metrics we tried to send, the more contention there was for that lock.

The way we run Puma at KeepTruckin will make our point about lock contention clearer.

How We Run Puma

We use MRI as our ruby interpreter, which requires that only one task use the CPU at a given time. Although every machine has multiple CPUs, if you run one Puma server, you only use one CPU.

We use 10 worker processes with 16 worker threads on every Puma machine (we chose that combination to maximize the usage of our 16 CPUs while preserving good latency). There was only one StatsD client per worker, so we had up to 16 threads all trying to send metrics using one synchronous client instance. This is what led to the contention in our servers.

The Solution: Change Client

We decided to switch to a different StatsD client: the datadog statsd client. For starters, datadog_statsd bunches messages together, so there is less traffic overall. Additionally, it is asynchronous, so messages are sent to separate threads to avoid blocking processes. Its benefits included:

Asynchronous operations that prevent blocking the calling threads
Buffering, which allows multiple metrics to be sent in one message
Support for Unix Domain Socket, which is more efficient and reliable than UDP

We created our own StatsD wrapper, which provides the same API as had been provided by the statsd-instrument library we used before. The actual implementation — that is, the choice between using statsd-instrument and dogstatsd-client — can be changed using a configuration variable. We also added a custom RuboCop to prevent the statsd-instrument API from being used directly in future metrics, as it would not function properly.

Improved Performance

The results have been quite positive. In some cases, the latencies went from about 2.2 ms to 1.0 ms. The graph below was captured after Production had been running the dogstatsd-client for a few hours (the client switch was performed around 13:00). They show the difference in performance on our cache metrics. These metrics measure the time it takes to get values from our local and remote caches; in other words, memory vs. Redis.

We switched clients at around 13:30. Notice that the latency on the local cache calls went from about 1.5ms to .75ms. This is probably the best-case scenario, but it does demonstrate the effect the contention was having. The effect is also detectable on the remote cache calls, but it’s less visible because the overall latency is higher.

Looking Back, Going Forward

The old library we were using in early 2021 now supports both asynchronous work and buffering. However, it was not compatible with our older Ruby version, and so, at the time, we had no choice but to change clients. This turned out to be a happy constraint, because we like the new dogstatsd-client; it supports all the extensions we use and is developed by the same people as the Datadog agent, so we can assume full compatibility.

Caching (above) is only one high-frequency StatsD usage example. Other heavy StatsD usage code paths will enjoy similar benefits.

Come Join Us!

Check out our latest KeepTruckin opportunities on our Careers page and visit our Before You Apply page to learn more about our rad engineering team.