Four years in Centrifuge

Alexander Emelin
15 min readJun 7, 2016

--

It’s almost 4 years ago I started working on Centrifuge/Centrifugo project. This is quite a long time and quite a long story of development so I think it’s time to summarize what happened to project during this period and where we are now. If you are interested in push technologies, websockets, real-time messaging, PUB/SUB systems or you are just a fun of open-source world — this article can be interesting to read.

As a short description — here I am talking about server (see on Github) that allows to instantly deliver JSON messages to your application clients (over Websocket or SockJS protocols)— right after your application backend knew about some event. So called real-time messages. Chats, notifications, counters, real-time comments and charts, collaborative editing, games — this is what it is built for. Most of my writing here is in context of web applications. But I’ll mention mobile and desktop apps below too.

As a note, I am from Moscow, Russia — English is not my native language — corrections are welcome.

So… Everything started with Centrifuge.

Centrifuge

Centrifuge was quite a distinctive project from its beginning. There were some reasons for it.

First of all — it was not a real-time message broker at the beginning. I started to work and planned to create something similar to Sentry (btw note that Centrifuge name sounds like `Sentry` + `fuge`). But with some differences — the idea was to have a server that listens for different events happening somewhere in internet, aggregates and displays it. I wanted to monitor new package releases on PYPI (Python Package Index) and maybe something else in future. At that time apart from its name Centrifuge got web interface with
SockJS support. And it was written in Tornado.

Meanwhile at my work (Mail.Ru) we actively used real-time messages in our intranet portal — for comments, charts, calendar events, counters etc. For those purposes SSE (Server-Sent Events) technology with fallback to long-polling was a pretty good choice. So we used cyclone-sse daemon built on top of Cyclone web server (Tornado fork on top of Twisted) and it did this work pretty well. In Centrifuge I’ve seen a potential to grow into cyclone-sse replacement with more protocols/transports supported, scalability, built-in authentication, admin web interface, useful real-time features etc. So from that moment Centrifuge got a development course. And one day Centrifuge completely replaced our SSE/long-polling daemon.

Centrifugo

Last year Centrifuge migrated from Tornado to Go language. And changed its name from Centrifuge to Centrifugo. So from this moment I will only tell about Centrifugo when mean real-time server. I’ll write a bit more about migration to Go language below. The last version at the moment of this writing is v1.4.5

Choosing real-time solution

When choosing real-time solution it’s important to take into account your ecosystem. Which language you use for backend side? Are you starting project from scratch or have a working application. Are you ready to pay for real-time solution?

I can only tell my story from position of Python programmer. But even if you write code in PHP, Ruby etc — I think we have a lot in common in the end.

Let’s speak a little about state of real-time web in Python. There are so many articles, discussions and presentations in this area. One of the most discussed topics is Django and real-time. What if your web application built with Django and you want to deliver some events to your users in real-time? You have several options for this task. Let’s review them.

Rewrite your site entirely using asynchronous framework like Tornado or another programming language with concurrency supported (NodeJS, Go, Erlang etc). Well, we all understand that in most situations this is not a real option. Costly and in many cases not desirable.

So you should adapt your existing Django application.

You can go with Gevent-based solution. One magic line of code allows to patch Python standard library so Django can work with lots of persistent connections without worrying to run out of worker number. Greenlet magic. There are some libraries which proved to be a nice choice solving this problem. On the other hand this is a serious shift in your project philosophy.

One of the options is nginx-push-stream-module— I’ve heard several success stories from people using it. But for me it’s not so flexible and opaque. Mostly because I am not a C programmer.

The other way is coming. It’s called django-channels and actively developed at moment by Andrew Godwin to be added into Django release (1.11) very soon. It’s a complete shift at how Django positioned — new ASGI interface transforms Django into bunch of workers listening for events from message broker like Redis. All clients communicate with separate interface servers which send events to your Django worker via that message broker. Well, this is a new, exciting and promising option for new projects. Though it’s not ready for production yet and not fully investigated.

Andrew Godwin’s talk on PyCon 2016

The next way is using standalone server or service. This can be cloud service like pusher.com or pubnub.com. Clients connect to cloud service and that service deals with lots of your connections from your application users. Service has an API to publish new events to interested clients. Very good solution if you are ready to pay and you can afford your data to travel over third-party side.

Or this can be hosted solution — server that you install on your server machines. It can be written in any asynchronous language or framework. I.e. something like Centrifugo. This is no so flexible — for very complicated real-time apps you may need a more tight integration of real-time provider and application backend — for example for dynamic real-time multiplayer games. But it’s OK for most use cases.

Here I wrote about real-time solution choices from my Python/Django ecosystem. There are another techniques btw — like BOSH for example, but I can’t write a lot about them because had no experience with it.

Most of these approaches can be applied even if your application written in other language than Python. Phil Leggetter created a wonderful source describing modern real-time technologies. Check it out — tons of great servers/libraries/services. If you are starting your search for real-time solution — start from there.

Also apart from ecosystem there are some sort of applications where you need not only a stream of events coming from server to client but more complicated data synchronization with your database. In this case you may need something like Firebase or recently released Horizon based on RethinkDB. So take your application needs into account too!

Btw if you are NodeJS guy — you have tons of options — Meteor, Horizon, Derby, Faye, Primus! You are like Vicente del Bosque before Euro 2016 — so many wonderful players to choose from, so much enjoyable headache:)

Spanish football players skiping Euro 2016 in France

Client side

Let’s put away backend side for a while. When adding real-time you should also decide which technology to use on client side. You need persistent connections from browser to your server (btw that’s why Django can do this itself at moment — you run out of workers as soon as several clients connect to your application).

Most of modern applications are read-oriented. Users mostly consume content than generate new. We can easily make requests for creating new content/event — simple HTTP POST request or maybe RPC request to backend will do the work. So the main problem in real-time messages is how to deliver messages from server to clients waiting for updates. Which transports do we have to solve this?

The most obvious choice is using Websockets, but it’s better to use Websocket polyfill libraries like Socket.io or SockJS to provide fallbacks for users with old browsers (IE less than 10, mobile browsers).

Centrifugo uses SockJS. This means that when no Websockets support available one of the alternative transports will be used:

  • xhr-streaming (xdr-streaming in IE)
  • eventsource (SSE)
  • htmlfile
  • xhr-polling (xdr-polling in IE)
  • jsonp-polling

Some of these transports are so old and exotic that it’s hard to find device utilizing it. Some of them use iframe objects to solve connection problems with another domain. Hopefully one day web and Centrifugo in particular will be able to get rid of all these transports and use pure Websockets only.

The interesting thing here is that we were all waiting for a time when Websockets will be supported by the vast number of devices. And seems we are almost at this time! Only Opera mini still lacks full Websocket support. The funny thing here is that we now have HTTP/2 growing in popularity. Binary protocol multiplexing all connections to the same domain over one real TCP session. While opening new tabs of Websocket application you establish new TCP connection to the server. This is a bit disappointing. HTTP/2 protocol does not have Upgrade mechanism so it’s not possible to run Websockets on top of HTTP/2 in any way now — there is no such specification. But as a workaround (if opening new connection for a tab is a problem) in some browsers we can utilize SharedWorker to multiplex our tab sessions over one real WebSocket connection.

If you want you can use Centrifugo using pure Websockets without any fallback to HTTP-based transport as Centrifugo server provides an endpoint for pure Websocket connections. If your clients all have modern browsers and you will use Websockets over TLS — pretty good choice then. Javascript client called centrifuge-js abstracts communication with Centrifugo for developer in a simple API.

Websocket endpoint is also useful when you want to connect to Centrifugo from non-browser environment. It’s not a big deal to find a good websocket client for any existing programming language. This made it possible to create Centrifugo clients for Android and iOS devices. We also have Go client now. That all means that Centrifugo can be used from browser, mobile and desktop applications.

Before continue with this post I highly recommend to read some great articles on this topic:

What Go language gave to project?

As I promised above, let’s talk about Go a bit. I can’t tell about all positive aspects here because we go too deep for this article then. But here some of the most exciting notes about server migration to Go.

First of all — performance. Statically-typed and compiled into machine code modern language can’t be slow if properly designed right?:) Go designed by great engineers Robert Griesemer, Rob Pike, Ken Thompson and other experienced developers. It’s much faster than CPython and has concurrency support built-in. So no need to use any asynchronous framework like Tornado to develop application. Depending on task Centrifugo up to 50x faster than Centrifuge. For example, in Tornado it took about 5ms to perform publish API operation. Now it’s about 100 microseconds!

Next thing I want to mention is multi-core support. While in Python we should start several server instances or play with multiprocessing, in Go runtime scheduler distributes goroutines among all available CPU cores.

Go allows to compile program into one single binary statically-linked executable file. Say no to dependencies. Moreover you can cross-compile your application on almost every popular modern platform. So I am developing on Mac OSX and can build Centrifugo directly on Mac for Linux, Windows etc.

Also when I was started migration to Go I was new to this language so I learned a lot. A lot of things that hidden for Python developer. Now I am much more careful about performance, allocations and protocol design.

Btw there was a small discussion on hacker news on this topic.

How does Centrifugo fit to application?

So how Centrifugo actually works? Let’s look at simplified scheme below:

Here you can see 3 entities: your application backend, you clients and Centrifugo.

As soon as clients open application you should give them connection parameters. Clients then use those connection parameters to connect to Centrifugo over Websocket or SockJS protocol. Let’s look at parameters in detail.

First of all is user ID. This is a just string with your user ID. So Centrifugo knows your application user ID for each connection. Note that it’s possible to use empty string as user ID when your application does not have users or you allow everyone to connect to your stream (this is called anonymous access, read more in documentation).

Next is current UNIX timestamp seconds as string. For example “1465208781”.

One more parameter is optional info string — JSON-encoded additional information about connection. For example, you can include user name into it.

And finally. We can’t just trust client saying that he is user with certain ID. So last parameter is HMAC SHA-256 token generated based on secret key (which is known only by Centrifugo and your backend), user ID, timestamp and info string.

So clients provided valid parameters and successfully connected to Centrifugo. As soon as you have any event on your backend you can publish it into Centrifugo over API (fan-in messages) and that message will be delivered to all interested clients (fan-out messages). Look at key word interested. We can not send every message to all clients. Here channel concept appears — as soon as client connected to Centrifugo it subscribes on channels from which it will receive new messages. Channel is just a string like “news”, “post_764_updates” etc. Nothing new here — this is how almost all PUB/SUB systems work — clients subscribe on topics and PUB/SUB broker controls fan-out this way.

Here is an example code how to use Centrifugo from browser using our Javascript client:

We use required connection parameters to connect, subscribe on channel “news” and handle messages coming from this channel in subscribe callback function.

So as you can see now Centrifugo is language agnostic — it does not matter which language your application backend uses. All communication between backend and Centrifugo goes through API.

We have HTTP API clients for Python, Ruby, PHP, Go and NodeJS now. But HTTP API is so simple that if you are not satisfied by existing library or use another not-supported language — it’s really simple to write new API client in no more than 100 lines of code.

So integration with existing application is very simple — you don’t need to rewrite your code in any way, don’t need to change language/framework or project’s philosophy.

I should also mention that message delivery model in Centrifugo is “at most once” — this means that messages can be theoretically lost during the way. But actually this is how most of modern real-time solutions work — in practice it’s OK for most applications so no real need to sacrifice speed and simplicity for the sake of reliable message delivery which is really hard to properly implement.

Features in short

As I showed before — there are lots of projects for solving real-time problem. Most of them only provide PUB/SUB feature — i.e. a way to deliver messages to clients. Centrifugo have more to offer out of the box to build real-time apps. Here is a very short description of main features Centrifugo has.

Presence information. This is an information about current connections to specific channel. The most obvious example is chat room — we want to see to is online at moment.

Next — join/leave events — again in an example with chat room this is notifications when someone joins/leaves room.

Another thing is that Centrifugo allows to keep message history (cache) for channels during configurable amount of time and with configurable size. This is not very useful alone but allows to recover missed messages (during short network disconnections for example).

Another common problem is client load balancing. What if our application has 100k clients online? Of course in theory we can use only one machine to handle all connections (and even more). But this is bad because of two main reasons:

  • service availability. What if your Centrifugo machine breaks?
  • more clients mean more CPU resources and higher message latencies — why not to relax your instance adding another one?

So to scale on several machines and load balance clients between them Centrifugo can be started with Redis engine. Centrifugo nodes will be connected over Redis PUB/SUB mechanism and all temporary information
such as presence and message history information will be kept in Redis instead of process memory. Centrifugo works with Redis Sentinel to prevent Redis being single point of failure.

Horizontal scalability is now limited by Redis instance throughput. But Redis is extremely fast. It can handle more than 100k PUB/SUB messages per second so this should be enough for most applications over the internet. My personal applications don’t even need more than 100 messages per second. But if you are using Centrifugo and approaching 100% CPU limit for Redis instance then open an issue on Github and we will try to work on Redis sharding to balance load among different Redis instances.

Thanks to Go standard library — Centrifugo has HTTP/2 support. This can be useful when using HTTP-based SockJS transports — because all connections to the same domain in different tabs will be multiplexed into one TCP connection and you get rid of persistent connection limit to the same domain from HTTP/1.1 specification.

Centrifugo also has built-in administrative web interface. Here is a screenshot from it:

And finally prebuilt DEB/RPM packages and Docker-container help with deployment process.

Ah..! And it’s MIT licensed — use it for free, contribute, fork, modify.

How we use Centrifugo

Remember I said that we have Centrifugo in intranet of Mail.Ru? Besides it is used by several other Mail.Ru projects (I know about 5 different projects), also in several projects over the world. The highest load I know about is 6k new messages per second on a site with 50k users online.

One interesting thing I personally built using Centrifugo myself is cooperative web game. People come to special room — everyone with its own device — laptop, smartphone, tablet. Then those people divide into teams. After this actual game starts. Players get questions on there device screens in real-time, answer on them. On a big screen on a wall real time game statistics is shown — it changes instantly after every affecting action. Game leader controls a game from its own device. Btw game leader sees who is online at moment (thanks to presence information) and helps to players who offline and can not enter a game for some reason (can’t find a link to game for example). And thanks to SockJS — there were no such people with device that had no support for real-time updates.

The sad thing about open-source is that it’s sometimes hard to say who and how exactly uses your project. So if you are reading this and already use Centrifugo or will use it in future — please find a couple of minutes to write me about your use case (see my mail in my account on Github). Every such message gives a lot of motivation to continue developing project!

Another hard thing is that being general purpose language-agnostic server means that we need to support lots of API and client libraries I talked about. I could not do this alone without open-source community help. Big thanks to all who contributed into project:

That’s all for a moment. I intentionally missed technical details here because the main goal was to talk about Centrifugo in general — how it was born, which technologies it utilizes and how it fits applications in 2016.

Here some links to go further if you are interested:

Thanks!

--

--