How ActionCable broke Puma
Four few known facts that are often ignored:
- POSIX systems (Linux, BSD, macOS) use “file descriptors” for sockets.
These file descriptors are a numerical value that usually translates to a kernel array index linked to the process.
select system call breaks when the file descriptor value is over 1023 (on some systems 2047). On Linux, this is a
libc restriction rather than a kernel restriction:
POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified in a file descriptor set. The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed- size type, with FD_SETSIZE defined as 1024, and the FD_*() macros operating according to that limit. To monitor file descriptors greater than 1023, use poll(2) instead.
3. The Puma application server for Ruby uses
select under the hood (I’m waiting for this to change).
This effectively limits each Puma process to 1023 open file descriptors, including sockets, database connections, etc’.
This is usually okay, assuming the number of concurrent connections isn’t too high or the server isn’t idle.
4. Using ActionCable drives the number of concurrent connections upwards, potentially breaking the
select system call used by Puma.
This requires an application server that uses
poll, such as the iodine application server.
The Tipping Point
It’s interesting to note that the designers of ActionCable knew that
select isn’t a good choice for persistent connections. After all, persistent connections will stay alive for a longer time, so it’s much easier to hit the 1023 limit.
This is why ActionCable uses the
nio4r gem, which offers support for
What, it seems, the team didn’t consider is this:
- ActionCable holds on to 1023 concurrent connections that are happily doing nothing.
- Puma calls
selectand breaks the underlying system.
Why Didn’t Anyone Seem To Notice
There’s two reasons why this isn’t often detected:
Reason number 1:
ActionCable performance isn’t amazing… a quick review of ActionCable’s performance (on any server) shows that ActionCable shouldn’t be used with more than 1,000 concurrent connections anyway.
In fact, by 3000 clients, ActionCable should uses close to 1Gb of memory, requiring us to horizontally scale our application and hiding the issue.
Note that the benchmarks I link to (doubt them if you want) show that ActionCable+Puma
max-rtt (maximum round-trip time) for 2,000 connections is
max-rtt: 3878ms, which is higher than iodine’s
max-rtt: 3579ms for 20,000 connections(!).
But wait, don’t these benchmarks prove that the
select system call isn’t broken…?
No, which brings us to reason number 2:
During these benchmarks, as well as other tests we often run, Ruby never calls the
select system call because the connections are always busy.
However, when the connections aren’t busy (no benchmark / test is running), than the
select call is performed and the whole thing starts to fall apart.
The Linux kernel doesn’t impose a limit on
libc does)… which means that on Linux we are writing overflowing bits on
libc structures, which will come back to haunt us further down the road.
On BSD and macOS, we wouldn’t be so “lucky”, since the kernel itself imposes limitations on
select. On these systems we are likely to experience failures much sooner.
This is pretty similar to gambling, not so great for production systems.
What To Do?
For those of you that thought that iodine or AnyCable are just performance concerns — bad news, they are stability requirements.
ActionCable could be safely used with any application server that supports
hijack and uses
kqueue rather then
select. This includes iodine and (I believe) passenger. It also includes ActionCable’s poor performance.
However, if performance and stability are meaningful for you, switch to either AnyCable or iodine’s native WebSocket and pub/sub.
Author: Boaz Segev (a.k.a, Bo) is the author for the iodine gem and the facil.io C framework.