How Centrifugo solves real problems of real-time messaging applications
Recently I realized that while I personally understand why Centrifugo is useful and believe that it’s awesome I can’t quickly explain its benefits and reasons why someone may want to use it in production application. Really — serving persistent connections (Websocket, EventSource etc) with Go or NodeJS is quite simple these days and really a breath of fresh air especially after trying to do the same in Python, PHP, Ruby or similar languages without builtin concurrency.
For a very simple use case any hand-crafted or one of many available open-source solutions will work just fine. But as soon as your target is a business critical production application you need to start thinking about things like scalability, proper connection management, message recovery upon reconnects, observability and more. Many of araising things are not trivial to implement, many just take reasonable time to be made. And most of them are effectively solved by Centrifugo server. In this post we will go through some of real life problems in real-time messaging app development and describe why you may consider using server like Centrifugo for your project.
Centrifugo works with every backend
The first selling point here is that Centrifugo works in conjunction with any backend. This is especially valuable when backend is written in language or framework without built-in concurrency.
In one of my previous posts here I described the initial motivation behind Centrifugo — it was originally built to work in conjunction with Django backend. Django is based on worker/thread model and serving many concurrent persistent connections with it will quickly result into running out of available workers. Centrifugo is a separate service that easily handles thousands of persistent connections and provides an API to communicate with connected users.
At Mail.Ru where I first integrated Centrifugo with Django application we developed a lot of interesting things without stepping out to the slippery slope — like using Gevent or Django Channels. We developed chats, live counters, real-time games, st. Valentine’s Day dashboard and many other cool things. All without huge refactoring or moving from Django to asynchronous stack.
Working in conjunction with Django means possibility to work together with any backend technology out there. You can continue to use your favourite framework — be it Django, Yii, Laravel, Rails — and still add real-time events to your application. Without hacking your backend and changing project philosophy. The integration process is simple and straightforward as soon as you understand basic network and security concepts.
The fact that Centrifugo works as separate service and provides API to communicate with connections means that it can be a universal element you can take from one project to another even if you are starting your new project with new language.
One more concern that araises here is user authentication. Usually you authenticate WebSocket or HTTP connection using some sort of session mechanism on your backend. Centrifugo doesn’t know about your application authentication strategy thus providing a way to authenticate connection over JWT. JWT must be created on application backend and passed to client. Centrifugo also supports JWT expiration and automatic refresh hooks.
There is also an ongoing pull request to proxy authentication to application backend over HTTP on every connection attempt. In most cases this means that you have to run Centrifugo on the same domain with your main application to have cookie passed to Centrifugo on connect (at least at browser). While proxying authentication to backend can fit some use cases JWT still has its advantages — especially in massive reconnect workflow. With JWT connections can reconnect to Centrifugo without any involvement of your main app as soon as token is not expired. Otherwise you have to build application capable to serve a huge amount of requests reconnecting almost at the same time. For many deploys in the wild this can be a real disaster.
Scalability
Another important thing is scalability. As your application grows — more and more users will establish persistent connections with your real-time endpoint. A modern server machine can handle thousands of open connections but the power of one process is limited — you will eventually run out of available CPU or memory. So at some point you may have to scale user connections over several machines. Another reason to scale connections over several machines is high availability (when one servers out of order).
There are many real-time messaging solutions on Github and paid online services. But only few of them provide scalability out of the box — most of them work only in one process. I don’t want to say that Centrifugo is the only server that scales. There are still many alternatives like Socket.IO, SocketCluster, Pushpin and tons of others. My point is that possibility to scale is one of the main things you should think about when searching for real-time solution or building it from scratch. You can’t really predict how fast your app will run out available resources on single machine — software scalability is not a premature optimization and in most cases having scalable solution out of the box will simply give you more room for improving application functionality.
Many online services are capable to scale too. But look at pricing — most of those solutions are rather expensive. In case of pusher.com you are paying 500$ in a month but only get 10k connections max and strongly limited amount of monthly messages you should care about. This is ridiculous. Of course Centrifugo is self hosted and you must spend your server’s capacity to run it. But I suppose the cost is not comparable in many cases.
Centrifugo scales well with Redis PUB/SUB, supports application-side Redis consistent sharding out of the box and integrates with Redis Sentinel for high availability. We served up to 500k connections with Centrifugo having 10 Centrifugo node pods for connections in Kubernetes and only one Redis instance which consumed only 60% of single processor core!
There is also an ongoing pull request that adds possibility to scale PUB/SUB with Nats server as broker.
Well structured Protobuf protocol
Centrifugo is not a young project. The development started about 7 years ago and in the end of 2018 server got a massive refactoring with v2 version release. With Centrifugo v2 we now have client-server protocol defined as Protobuf schema.
And while Centrifugo can still transfer JSON data over the wire it now also supports binary Websocket connections with Protobuf message format. This means that the data travelling between client and server can be compactly and very efficiently serialized.
The protocol is designed to only send required data over the wire omitting empty unnecessary fields. It can also be simply extended in backwards compatible way to add new features. We will talk about some protocol features which Centrifugo provides soon.
Websocket polyfill for browsers
While Websocket mostly works these days there are cases when users can’t establish WebSocket connection (even with TLS). This can happen due to browser lack of support, corporative proxies or browser extensions. In some kinds of apps this is acceptable but what if you have a requirement that every single user could connect to your app from everywhere? In this case you really need a fallback option.
Centrifugo provides fallback option over SockJS — a very mature and popular polyfill library that has several HTTP-based transports like Eventsource, XHR-streaming, XHR-polling etc. With raw WebSocket and SockJS nearly each user will be able to connect and receive real-time updates.
We had a story where Websocket connection to our server has been blocked by popular ad blocker browser extension. And none of our users have suffered from this fact — they effectively switched to XHR-streaming. Another story is about real-time corporative intellectual game I developed where each player (company employee) came with its own device — smartphone, tablet, notebook — and the game just worked everywhere. This was a big help to game organizers — no need to spend time solving connection problems.
Useful primitives to build real-time apps
Real-time applications has its own specifics. To abstract away communication with individual connections Centrifugo provides a channel mechanism. Each connection can subscribe to one or more channels and communication between backend and client can be done over publishing data to channel. So actually this is just a PUB/SUB mechanics. On backend side Centrifugo provides HTTP and GRPC API to publish data into channels.
There are several kinds of channels with different security levels and some options that can be defined in configuration. For example so called private channels where every subscription attempt must be additionally signed.
Within channel developer has some useful features like history cache, presence information (information about active channel subscribers), also join and leave notifications when client subscribes on channel (or unsubscribes from it). This is very important to have to build game lobby for example.
One more problem that must be solved in some scenarios is how to restore messages missed by client during reconnect (temporary internet connection loss, real-time backend restart). This can be especially useful in restart case to prevent overwhelming your main database from thousands of client requests that want to restore state. Centrifugo solves this by providing recovery feature. Every channel can optionally keep a stream of Publications and client can recover missed messages upon reconnect providing the sequence number of last seen Publication in channel. Thus effectively restoring its state just like connection have not been lost at all. Centrifugo also solves a problem that you need to pass initial sequence number to client on first subscribe to channel (this is tricky even in EventSource case where last-event-id
mechanism built into protocol but initial event-id
must be passed to client somehow).
The best thing is that Centrifugo uses highly efficient PUB/SUB mechanism of broker (i.e. Redis at moment) and combines it with keeping stream of Publications inside history storage (also Redis at moment). At the moment of message recovery Centrifugo synchronizes PUB/SUB and missed Publication extraction from storage and as result client receives Publications in correct order. The algorithm behind this described in one of my previous posts. So instead of at most once delivery guarantees you get at least once guarantee with history recover feature on.
There is an ongoing pull request to proxy RPC calls from client to application backend over HTTP. This way developer will have an option to utilize an open persistent connection to send any command initiated by client to backend and react on it in. Thus reducing the amount of information travelling between client and server in comparison with HTTP where each request contains all headers and must be authenticated.
Proper persistent connection management
Connections require proper management. In request-reply case everything is quite simple — you have request which includes all necessary headers and you are authenticating it every time upon receiving from client. Now with persistent connections you only authenticate connection once on establishment and then connection can be open forever. If user had been deactivated in your app you need to close its connection. Centrifugo provides hooks to invalidate such sessions so your data will be protected.
Modern real-time apps can serve thousands and hundreds of thousands online connections at moment. In case of real-time backend restart tons of user connections will start reconnecting. Again — in request-reply world clients won’t send any additional requests to your app while it’s restarted as request initiated only when user performs an action on application page. With persistent connection case every connection must be reestablished after being interrupted. Centrifugo is pretty fast to handle a high rate of re-connections and re-subscriptions to channels. Client libraries help a bit using exponential backoff algorithms on reconnect.
Performance
Centrifugo is pretty fast. Not only because it’s built with Go programming language but also due to several internal optimizations.
According to benchmark I did Centrifugo can broadcast up to 700k messages per second with Protobuf protocol and up to 270k messages with JSON protocol. The numbers are quite artificial as in any benchmark and highly depend on many factors but I think the order is pretty good. I believe that in terms of performance it should fit many applications around.
Centrifugo has highly optimized communication with Redis — over limited amount of connections, using pipelining and smart batching techniques to reduce amount of network RTT.
For example messages towards client can be automatically merged together to reduce write syscall amount. Protocol designed in a way that merging different messages going to client connection can be done efficiently using JSON-streaming format in case of JSON encoding and length-delimited frames in case of Protobuf encoding.
You can find more on optimizations in one of my recent posts on topic — https://medium.com/@fzambia/building-real-time-messaging-server-in-go-5661c0a45248
Ready to deploy
To simplify deploy process there are Docker image and RPM and DEB packages for popular Linux distributions. Actually due to the fact that Centrifugo written in Go developer has option to just download prebuilt binary for target operating system and run it in a way he wants. No additional runtime required to be installed — just a single binary which statically links all necessary stuff.
Centrifugo comes with metrics in Prometheus format and optionally can export metrics into Graphite. Also metrics can be extracted over API or simply watched in admin web interface (admin interface embedded to server binary too). This means that developer can monitor the state of running Centrifugo nodes.
Client libraries for popular application environments
So Centrifugo provides some useful primitives on top of raw connections and has its own protocol incapsulating them. This means that to connect to Centrifugo special client library is required. In my opinion this is both the advantage and the biggest disadvantage of Centrifugo. The advantage is that all features described above are incapsulated by client libraries. You get them for free on client side. The disadvantage is that client libraries are quite difficult to implement and maintain. Though at moment we cover most popular application environments with the following official clients:
- centrifuge-js for browser, nodeJS and React Native
- centrifuge-swift for iOS apps
- centrifuge-java for Android and general Java
- centrifuge-dart for Dart and Flutter ecosystem
- centrifuge-go for Go language
- centrifuge-mobile based on gomobile project for iOS and Android development
Built on top of library for Go language
And one more interesting thing is that Centrifugo is built on top of Centrifuge library for Go language. This means that Go developers can additionally tweak real-time messaging server and provide custom authentication, channel authorization rules, custom PUB/SUB brokers (instead of Redis), custom presence and history storages. And while Centrifugo server is mostly designed to stream messages in one direction — from server to client — library allows to exchange messages in fully bidirectional way with full control over each client connection.
All client libraries mentioned above can work with server that uses Centrifuge library as the underlying protocol is basically the same.
Conclusion
As you can see there are many things that must be taken into account when building real-time messaging application. In this post my goal was to explain why Centrifugo exists and emphasize that it has been developed with production usage in mind solving some of real life problems out of the box.
You get everything above for free as Centrifugo is an open-source MIT-licensed software. Of course free cheese only in mousetrap (as we say in Russia). You still need to understand risks, properly consider what fits better to your use case.