Load-balancing Websockets on EC2

How we set up a robust and scalable stack on top of socket.io


For the last few months, we have been working on a completely new Storify Editor which enables real-time collaboration.

Storify real-time collaboration

Stack

The editor is a standalone server built with Node.js and socket.io in order to have a 2-way, real-time communication between the client and the server. The goal is to push changes to a collaborator’s browser when another collaborator makes a change on a same story. A behavior similar to Google-Docs.

In order to have a robust and scalable solution, this feature has to live on many load-balanced instances.

Elastic Load Balancer

Elastic Load Balancing

Amazon’s Elastic Load Balancer (ELB) is a the best way to serve traffic across availability zones in an AWS region.

Once we put ELB in front of the Storify Editor boxes, we noticed that it was round robin-ing between hosts, which broke our sessions.

What is the problem?

As explained by the socket.io authors: “the requests associated with a particular session id must connect to the process that originated them.” This is due to the fact that socket.io uses different transports, and XHR Polling or JSONP-Polling rely on firing several requests during the lifetime of a “socket.”

Let’s take the example of emitting an event to all the users. For the users using a bi-directional communication channel, socket.io can write to them directly. But for the users using long-polling, they might not have sent a request we can respond to. Or, they could be in the middle of these requests. In this situation, socket.io has to buffer these messages. The user can claim these messages when he sends a request, but he has to be on the same process.

ELB has a “Sticky Session” routing policy, but websockets are not supported.

Sticky Sessions

We decided to try using HAProxy, since it has very full featured routing settings. So, we stuck HAProxy between the ELB and the Storify Editor instances.

HAProxy’s option “balance source” selects which server to use based on a hash of the user’s IP address. That ensures that a user will connect to the same server.

That sounds easy. But HAProxy didn’t see the source IP because of the ELB. …

Proxy Protocol

With HTTP based routing, ELB provides the client’s IP address via the X-Forwarded-For header. But TCP based routing has no such feature.

Support for Proxy Protocol, added last year to ELB, allows the backend server to identify the client’s connection information when using ELB’s TCP load balancing.

Full network diagram

As websocket uses the TCP protocol, this feature makes possible for our backend to know who is the user.

Let me show you how to enable the Proxy Protocol at the ELB level.

  1. Create the policy for `elb-storifyeditor`
aws elb create-load-balancer-policy —load-balancer-name elb-storifyeditor —policy-name EnableProxyProtocol —policy-type-name ProxyProtocolPolicyType —policy-attributes AttributeName=ProxyProtocol,AttributeValue=True

2. Enable the policy for `elb-storifyeditor` for port 3030, on which the backend server listens.

aws elb set-load-balancer-policies-for-backend-server —load-balancer-name elb-storifyeditor —instance-port 3030 —policy-names EnableProxyProtocol

HAProxy configuration

Now that the Proxy Protocol is enabled on the ELB, let’s tell HAProxy to use it:

frontend storify_editor_frontend
bind *:3030 accept-proxy name storify_editor_frontend
maxconn 1000
default_backend storify_editor_backend

The support for Proxy Protocol v2 in accept-proxy was added recently in the version 1.5.0 of HAProxy.

For the backend configuration, we just specify:

  • the check endpoint and timeout
  • the balance algorithm
  • and the list of backend servers
backend storify_editor_backend
timeout check 5000
option httpchk GET /status?all=1
balance source
server storifyeditor1.prod.livefyre.com storifyeditor1.prod.livefyre.com:3030 maxconn 1000 weight 10 cookie websrv1 check inter 10000 rise 1 fall 3
server storifyeditor2.prod.livefyre.com storifyeditor2.prod.livefyre.com:3030 maxconn 1000 weight 10 cookie websrv1 check inter 10000 rise 1 fall 3

Here is the complete HAProxy configuration.

Conclusion

This whole stack gives us a robust way to scale our new feature, which uses socket.io for the communications.

We think that AWS Elastic Load Balancer is the best tool for load-balancing instances in EC2, so we didn’t want to get rid of it. We wouldn’t have used HAProxy if ELB had a TCP option for sticky sessions.

Proxy Protocol came to the rescue to give HAProxy enough information about the client to redirect him to the same backend server.

Update

08/06/2016

“It turns out the polling transport breaks on some corporate networks, because they load-balance traffic on the network across multiple external IP addresses.” — @devongovett

The solution was to switch from using Source IP affinity to an application layer persistence solution with a session cookie.

backend storify_editor_backend
timeout check 5000
option httpchk GET /status?all=1
balance roundrobin
cookie storify_editor_host insert indirect nocache
  server s1 storifyeditor1.prod.livefyre.com:3030 maxconn 1000 weight 10 cookie websrv1 check inter 10000 rise 1 fall 3 check cookie s1
server s2 storifyeditor2.prod.livefyre.com:3030 maxconn 1000 weight 10 cookie websrv1 check inter 10000 rise 1 fall 3 check cookie s2

On the first request, HAProxy sets a cookie with the host used, and then it uses that cookie to determine the host for subsequent requests.


If you have questions, comments for feedback, we’d love to hear it. You can contact me on Twitter @philmod. The other part of this “we” is Andrew, who can be found @andrewguy9.