Cluster fucks when scaling Socket.IO

Or why all Socket.IO store based modules are fundamentally broken.

Arnout Kazemier
4 min readMar 7, 2014

When Socket.IO 0.7 got released it was a complete rewrite of the old Socket.IO code and came with a new concept called stores. It was said to be the magical scale sauce that would allow you to scale to multiple processes and servers by changing the store option. Heck, you could even scale without needing a sticky load balancer because they are so hard to setup, right? Socket.IO shipped with a MemoryStore by default and there was work done on a RedisStore that would make the multiple processes scaling possible using the build-in Pub/Sub channels of Redis.

The RedisStore was never formally announced but people started using it anyways as it was getting advised on StackOverFlow and other message boards. The community picked up the store concept and started building bindings to other message queues and databases such as Memcached, MongoDB, ØMQ, Azure Servicebus and even many more which all worked through the Socket.IO store interface.

The store concept of Socket.IO is build on the idea of syncing all the connection data between every connected node within your cluster. It doesn’t matter which kind of store you are using in Socket.IO as they are all using this concept as it’s build in to Socket.IO store interface, not the stores that you are using. If you didn’t already throw up in mouth a bit when you read the word syncing, I’ll explain a the issues a bit in closer depth.

So in order to understand why syncing is bad for Socket.IO stores we first need to know what is synced. We can find this information in the initstore function of the Socket.IO manager. So this is:

- Handshake data, the handshake data includes ALL request headers, query strings, ip address, information, urls and possible custom data that you’ve added during authorization.
- Ids of all connections that are open, connected and even closed.
- Room names and each id that has joined the room.

All this data will be synced through pub/sub to every connected Socket.IO server. So if you have 2 node process and they both use Socket.IO stores they will both have all the data of all connections in their own process memory. Not in redis as you might assume. It might not be an issue if you have 500 connected users, but once you approach 5.000> connections this can add up quickly.

If you’ve carefully read and start to understand this concept, it can actually be used as an attack vector against Socket.IO servers which are connected using the Socket.IO stores. The attacker only needs to create a script that initiates a handshake request with your server which connects with the longest query string allowed with as much custom HTTP headers possible and you could easily blow up a server in a matter seconds because all this data is serialized using JSON.stringify which will start blocking the event loop for all the serialization and when you receive such a large packet it also needs to parsed again by JSON.parse and eventually it will be stored in the V8 heap of your node’s process. This will eventually cause it explode out of memory due to the V8 heap limitations of 1.7 GB and your whole cluster will be FUBAR.

But storing data isn’t the only problem. Usually when you want to scale out you just want to be able to add new servers without taking your whole cluster down. This is impossible with Socket.IO. When you add a new server you need take your whole cluster down because the synced data is not stored in a general location. It’s only stored in the process’s heap, and when you join the cluster it doesn’t ask the other node processes for the old data so it’s a completely blank and empty server and can therefor not accept them and will close the connection with a “unknown handshake” error. The only way to combat this is to take your whole cluster down. Add new node processes/servers, bring them all online at the same time and start accepting connections again.

The following issue might or might not be specific to the Redis store but it’s worth mentioning anyways. For every connection that is established the store will subscribe to 5 different channels. So if you have 10000 concurrent connections you will also have 50000 subscriptions to your redis server which causes redis to use a magnitude more CPU than if it would just use single or a couple of channels to communicate over for ever connection.

So again, let me make this really clear: These issues are not related to just RedisStore that ships in Socket.IO. This is the problem for EVERY store that uses Socket.IO’s store interface. Scaling your servers using a sticky load balancer with a custom store/broadcast solution should start to sound really tempting by now.

--

--

Arnout Kazemier

Founder of Observe.it, Lead Software Engineer at Nodejitsu and passioned open source developer.