Building real-time messaging server in Go

This story is a write-up after my talk at goway.io about Centrifugo. Some of things here have already been mentioned in my previous posts. The original slides available here.

The video version is hosted on YouTube, though it can be a bit hard to perceive as that was my first talk in English. Anyway, maybe you are an enduring person:

My name is Alexander Emelin, I am a software engineer from Moscow. At moment I work in Avito which is one of the biggest classified in the world. At work we are helping people to sell and buy things thus giving those things the second life and simply let people save their money.

But today I won’t talk about things we do at work — I’ll talk about open-source project I started about 6 years ago. The project is called Centrifugo. This is a real-time messaging server. It was originally written in Python (in Tornado framework) but then migrated to Go language. Just about week ago I released version 2 of Centrifugo — so this is actually first public announce of new server. All my points in this talk relate to this new version.

So what is a real-time messaging? When I say real-time message I mean message you deliver to your application clients as reaction to some event happened in your application. The important property of such message is that it’s delivered to client almost immediately — in milliseconds after being sent by backend. This kind of messages is very useful in practice for multiplayer games, chat rooms, dynamic counters, live charts and so on.

The instant nature of real-time message require it to be pushed from server to client via persistent connection. Polling strategies when client periodically asks server for updates are not very good choice here.

Motivation

If you decided to add real-time events to your application you actually have lots of options. When choosing real-time solution it’s important to take into account many factors: your backend and frontend languages. Are you starting project from scratch or already have production application. Are you ready to pay money for real-time solution? And of course the nature of task you need to solve.

There are many real-time messaging solutions in the wild. This slide contains some of them.

If you are making your backend in asynchronous concurrent language like Erlang, NodeJS or Go then you pretty fine even without any framework — though you will need to solve some specific problems on front end and backend sides. Like scalability, proper connection management an so on.

Phil Leggetter from pusher.com created a wonderful source describing modern real-time technologies. Check it out — tons of great servers, libraries and services. Also a good overview exists in deepstreamhub blog. If you are starting your search for real-time solution — start from there.

My personal background is Python and to be more concrete Django framework. As you know Django is a classic worker-thread framework where each worker runs in its own process and OS thread and blocks for a time of request execution. Now with persistent connections like Websocket you are quickly run out of available workers so your main application stops accepting new requests. This is what Centrifugo was originally created for — deal with persistent connections thus giving backend a possibility to serve general short lived requests and publish new messages to clients using Centrifugo API when needed.

Django is not alone — there are lots of similar model frameworks in Python and other languages. As Centrifugo works as separate service it’s possible to just integrate with it without introducing many changes in application code. Actually nothing stops one from using Centrifugo together with application written in NodeJS or Go language as it it has some useful features I describe very soon.

Design overview

In a nutshell Centrifugo is a server that allows to handle persistent connections from application clients and provides API to publish real-time messages to connected clients who interested in that message. Clients indicate their interest to specific messages subscribing to channels (or in other words topics). So Centrifugo is actually just a PUB/SUB server.

Now we can look at simplified scheme of Centrifugo integration with application backend:

As you can see there are 3 parts involved into this scheme — your application backend, Centrifugo and your application users. Users connect to Centrifugo over Websocket or SockJS with JSON web token for authentication, subscribe to channels and listen to messages. As soon as your backend has a new event it publishes it to Centrifugo API and message travels towards connected clients. In Centrifugo you can use simple HTTP requests with JSON in body or GRPC to call API methods.

Let’s imagine you are making real-time comments platform. As soon as your user creates new comment you first send it to backend using convenient way — for example AJAX request in browser, on you backend side you validate comment, save into database if needed and then publish into Centrifugo API to channel related to this comment and this comment will be broadcasted to all active subscribers by Centrifugo.

Actually this means that real-time messages flow one way in this design — from server to clients. In most case you don’t need bidirectional real-time communication. Most of modern applications are read-oriented — users mostly read content, write requests is a much more rare thing — so for many real-time applications we can easily make write using non-persistent connection.

Let’s look at high-level features of Centrifugo:

  • Language-agnostic
  • Decoupled from application backend
  • JWT authentication with expiration support
  • Simple integration with existing application
  • Scale horizontally with Redis
  • Solid performance
  • Message recovery after short disconnects
  • Information about active subscribers in channel
  • Cross platform (Linux, MacOS, Windows)
  • MIT license

Centrifugo is used in many projects, mostly in web applications. It’s pretty popular in Python and PHP communities. Some companies that use Centrifugo are Mail.Ru, Badoo, sports.ru, Spot.im. Spot.im has the largest installation of Centrifugo I’ve heard about — it’s 300k clients online with 3 million messages per minute.

Real-time transports

Let’s speak about real-time transports used in Centrifugo a bit.

Websocket is the most obvious choice these days. It has a big advantage — it works everywhere — in web browsers and inside mobile applications. It’s low overhead protocol on top of TCP that works over the same ports as HTTP. Connection starts with Upgrade mechanism over general HTTP then switches to TCP session. I won’t talk a lot about Websocket protocol details — sure are familiar with it as it’s widely spread these days.

We remember those days where using Websocket in web applications was almost impossible due to poor browser support.

But these days situation is way better.

The current browser support of https://caniuse.com/#search=websocket is about 93%. Some clients still use browsers without websocket support, several browser extensions can block websocket traffic to specific domains so in some scenarios we still need fallback to HTTP transports if we want all our users to successfully connect.

In order to solve this Centrifugo uses SockJS as Websocket polyfill. This means that when Websocket connection cant be established one of the alternative transports will be used.

Only 2 of these transports based on persistent connection — xhr-streaming and eventsourse. But as they based on HTTP they can’t be bidirectional so SockJS emulates bidirectional communication using separate short-lived HTTP requests from client to server.

Having fallback HTTP transports could give one interesting advantage in web browser environment. As we know we live in a time when HTTP/2 grows in popularity. If HTTP/2 used persistent HTTP connections will be multiplexed in one real TCP session automatically by HTTP/2 implementation. While when opening new tabs of Websocket application you establish new TCP connection to the server. This can be solved with synchronization over LocalStorage or with SharedWorker but HTTP/2 just gives multiplexing out of the box for HTTP based transport connections.

There is a specification in progress that will allow to run websocket over stream of an HTTP/2 connection.

This was a short overview Centrifugo motivation and general concepts. Now let’s be more specific about implementation details — how Centrifugo built inside?

Internals

As mentioned above there are 2 of them:

When I was working on version 2 I also experimented with GRPC bidirectional streaming as client server transport. But after some measurements I found that GRPC bidirectional streaming has no advantages compared to Websocket. For example if we land 10k clients to one Centrifugo node then in websocket case memory consumption on server will be around 500 mb while in GRPC case it will be 4 times bigger — about 2gb of RAM. And in my syntetic tests Websocket was just up to 3 times more performant in regard to CPU usage.

Centrifugo has its own protocol which is described in protobuf schema — it’s very similar to JSON RPC in its high level. Also there are some similarities to MQTT in terms of business logic. There are two available serialization formats: JSON and Protobuf.

I already mentioned that Centrifugo can scale to many nodes. To allow features work across many nodes we have an Engine entity. This is actually interface with pretty big amount of methods. Engine allows:

  • publish messages to Centrifugo nodes where channel subscriber connected. So every client can connect to each node and receive messages possibly published to another node API
  • provide a way to save publications into cache with configured size and life time. This is used for missed message recovery process when client lost its connection for a short time and then reconnects.
  • manage presence information — i.e. information about active subscribers in channels.

There are 2 engine implementations at moment — in Memory and Redis engine.

Memory engine allows to run one Centrifugo node as there is no mechanism to connect nodes somehow at moment.

Redis Engine allows to scale Centrifugo to many nodes that will be connected via Redis PUB/SUB and also has built-in sharding support and high availability over Redis Sentinel. All communication with Redis done using Redigo library (thanks again Gary Burd, you are literally my hero).

Message delivery model

In simple case the delivery model is at most once. There are no guarantees that message will be delivered. Obviously as Centrifugo based on PUB/SUB model and Redis PUB/SUB mechanism involved — it’s impossible to imagine another guarantee. But this is actually enough for most applications because Centrifugo has a role of message transport — all application state lives in main application database so can be recovered when needed.

But Centrifugo provides an interesting message recovery mechanism. Centrifugo keeps a configurable buffer of messages in cache and can automatically recover client state after short disconnect sending it missed messages after client resubscribes. When this mechanism on every publication in channel has incremental sequence number, clients can remember last sequence number they received and thus use it to recover state on reconnect. In this case Cenrtifugo send missed publications to client synchronizing this process with PUB/SUB so messages come to client in correct order. If Centrifugo not sure all publications were restores it says client about this using special flag in response.

Optimizations

There are several optimizations used throughout Centrifugo source code. Let’s look at some of them.

To work with Protocol buffers we use gogoprotobuf library. It uses code generation and produces very optimized code to marshal and unmarshal protobufs. This code is up to 3–6 times more performant than standard Protobuf library which is based on reflection and allocates more.

Another optimization is automatically merging different messages to one client into one frame. This allows to reduce amount of write system calls under load.

If we look at flame graph of real production Centrifugo v1 instance we can see how wide are flames related to write syscalls.

Protocol design helps to merge messages together. In JSON format case several messages can be combined together using JSON streaming format where each individual protocol messages delimited by new line symbol. And in Protobuf case we use length-delimited format where each individual message prefixed with varint message length. So we can simply write different messages into temporary buffer and then write them to connection in one system call. Of course we can also reuse those temporary buffers using sync.Pool and we do.

The next optimization is using Redis pipelining to achieve best performance when working with Redis. Pipilining allows to send several commands to Redis in single request. We build Redis pipelines by collecting individual requests coming from different goroutines using smart batching technique. Let’s look at this pattern on example.

See also on Go Playground.

Imagine we have a source channel from which we get items to process. We don’t want to process items individually but in batch. For this we wait for first item coming from channel, then try to collect as many items from channel buffer as we want without blocking and timeouts involved. And then process slice of items we collected at once. For example build Redis pipeline from them and send to Redis in one connection write call.

When it’s not possible to use pipelining because we have an operation that consists of several dependant steps we use Lua scripts to do the hard work. Lua scripts executed in just one round trip to Redis and moreover they are executed atomically.

And we try to combine messages on client side — in our clients — to reduce read syscalls amount.

Centrifuge library

As you see from my talk Centrifugo is a standalone server with its own mechanics built in Go. The question I was asked several times is it possible to reuse Centrifugo functionality from within Go code? The answer was:

– well, yeah, you can — but for your own risk as Centrifugo was never designed for this.

While working on Centrifugo v2 one of my goals were to segregate separate library that will be the core of Centrifugo v2 but still can be reused by Go community.

The result of my work is Centrifuge library. You still need to understand that Centrifuge as library is very specific in terms that it has mechanics inherited from Centrifugo server.

Centrifuge library provides several things beyond Centrifugo server functionality:

  • Native authentication with middleware
  • Tight integration with business logic
  • Bidirectional messaging
  • Built-in RPC calls
  • More freedom in channel permission management

Let’s looks how library feels. We can’t look at everything as the time is very limited, but here is the full program that uses Centrifuge library and does nothing useful.

Let’s zoom in the most interesting part:

This is how you can set event handlers to process new connection event, client disconnect, client subscription request and attempt to publish.

Client libraries

At moment we have several clients for Centrifuge and Centrifugo:

Important thing that clients can work both with Centrifugo server and custom server built on top of Centrifuge library. Let’s look at some patterns that is possible to do using Javascript client.

The first one is PUB/SUB:

The second is bidirectional messaging:

Or RPC calls:

That’s all for today. Here are some links related to my talk: