How ActionCable broke Puma
Four few known facts that are often ignored:
- POSIX systems (Linux, BSD, macOS) use “file descriptors” for sockets.
These file descriptors are a numerical value that usually translates to a kernel array index linked to the process.
POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified in a file descriptor set. The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed- size type, with FD_SETSIZE defined as 1024, and the FD_*() macros operating according to that limit. To monitor file descriptors greater than 1023, use poll(2) instead.
This effectively limits each Puma process to 1023 open file descriptors, including sockets, database connections, etc’.
This is usually okay, assuming the number of concurrent connections isn’t too high or the server isn’t idle.
4. Using ActionCable drives the number of concurrent connections upwards, potentially breaking the
select system call used by Puma.
This requires an application server that uses
poll, such as the iodine application server.
The Tipping Point
It’s interesting to note that the designers of ActionCable knew that
select isn’t a good choice for persistent connections. After all, persistent connections will stay alive for a longer time, so it’s much easier to hit the 1023 limit.
What, it seems, the team didn’t consider is this:
- ActionCable holds on to 1023 concurrent connections that are happily doing nothing.
- Puma calls
selectand breaks the underlying system.
Why Didn’t Anyone Seem To Notice
There’s two reasons why this isn’t often detected:
Reason number 1:
ActionCable performance isn’t amazing… a quick review of ActionCable’s performance (on any server) shows that ActionCable shouldn’t be used with more than 1,000 concurrent connections anyway.
In fact, by 3000 clients, ActionCable should uses close to 1Gb of memory, requiring us to horizontally scale our application and hiding the issue.
Note that the benchmarks I link to (doubt them if you want) show that ActionCable+Puma
max-rtt (maximum round-trip time) for 2,000 connections is
max-rtt: 3878ms, which is higher than iodine’s
max-rtt: 3579ms for 20,000 connections(!).
But wait, don’t these benchmarks prove that the
select system call isn’t broken…?
No, which brings us to reason number 2:
During these benchmarks, as well as other tests we often run, Ruby never calls the
select system call because the connections are always busy.
The Linux kernel doesn’t impose a limit on
libc does)… which means that on Linux we are writing overflowing bits on
libc structures, which will come back to haunt us further down the road.
On BSD and macOS, we wouldn’t be so “lucky”, since the kernel itself imposes limitations on
select. On these systems we are likely to experience failures much sooner.
This is pretty similar to gambling, not so great for production systems.
What To Do?
ActionCable could be safely used with any application server that supports
hijack and uses
kqueue rather then
select. This includes iodine and (I believe) passenger. It also includes ActionCable’s poor performance.
However, if performance and stability are meaningful for you, switch to either AnyCable or iodine’s native WebSocket and pub/sub.
Author: Boaz Segev (a.k.a, Bo) is the author for the iodine gem and the facil.io C framework.