Livestream Messaging Systems
As the Livestream Platform evolved, we have added and replaced messaging systems in different parts of the product. Since the Livestream Platform was the product of a three-week hackathon, it began with a very naive, but still very solid solution in Redis. At that time, Redis was primarily used for storing Livestream’s object metadata, backed by MySQL, but we went ahead and used it for messaging as well, initiating backend tasks like sending email, SMS, and other notifications.
The Redis workflow for notifications was very simple. Events in the backend layer were propagated with a predefined JSON schema into a Redis queue. The Task Manager listened to this queue and created worker tasks by fetching the metadata from the Redis server and pushing jobs into individual worker queues.
Once we were out of hackathon mode, we started facing a few difficulties, like acknowledging messages at different stages downstream. We needed a solution which at least provided the feature of acknowledging to clients upon the completion of a task. We looked to Gearman as our next destination; it was implemented, and worked like a charm.
How Gearman works —
“A Gearman powered application consists of three parts: a client, a worker, and a job server. The client is responsible for creating a job to be run and sending it to a job server. The job server will find a suitable worker that can run the job and forwards the job on. The worker performs the work requested by the client and sends a response to the client through the job server. Gearman provides client and worker APIs that your applications call to talk with the Gearman job server (also known as gearmand) so you don’t need to deal with networking or mapping of jobs. Internally, the gearman client and worker APIs communicate with the job server using TCP sockets. ”
Since Gearman supports multiple language interfaces it is possible to run a heterogeneous application where clients submit jobs in one language and execute in another. This was useful since we have workers written in Node.js and Scala, and they were able to communicate without much hassle. Architecture-wise, not much had changed. All workers were wrapped around a small class called GearmanWorker. Clients were wrapping the same job in GearmanJob and submitting to the central Gearman job server. This quickly fixed our problem with missing acknowledgements.
Since we were constantly introducing new features to the Platform, and the traffic was increasing, so was the load on our messaging queues. Scalability wasn’t the issue, but there was a lack of tools to visualize the load on those queues. We needed a system which could be monitored 24/7, and allow alerts to be set up. At this point, we thought of switching to RabbitMQ, a much more mature and heavy-weight solution. But before switching, we also considered other options like Kafka. Based on our research and considering the maturity of the products, however, RabbitMQ was found to be the more stable solution. It took some time before we could switch to RabbitMQ completely, but once it was done, we had visibility into the queues, the message acknowledgements, and the routing mechanisms.
With RabbitMQ, it’s simple to route the same message to multiple workers. For example, when an account is created, sending an email and tracking that in Mixpanel are two different tasks performed by two different workers. With the Redis approach, Task Manager would require updates to create multiple tasks and push them into different worker queues, but this is a simple configuration change in RabbitMQ.
RabbitMQ also guarantees at-least-once delivery via an ACK mechanism, and this was needed for most of our critical notifications, like account activation and password reset emails. It also provides a way to monitor failed jobs by configuring
dead_letter exchanges, and alerts on these queues allowed us to detect problems and re-process them if necessary. Introducing new workers which use existing notification messages is easier since changes aren’t necessary in either Task Manager or the backend layer. We can create queues and bind to
worker_exchange for the required messages directly in RabbitMQ.
Again, architecture didn’t change here as well, and RabbitMQ gave us the flexibility to bypass Task Manager by using exchange bindings for some of the workers.
One of our other products, Analytics, is also using Apache Kafka in a parallel installation to RabbitMQ. Since Analytics was planned to tackle high traffic, we needed a high throughput system. We had set a target of 2mm concurrent viewers, which equates roughly to ~60k req/sec (assuming traffic from clients is distributed uniformly). When we were looking to use RabbitMQ for Analytics, we found some scalability challenges. We needed a solution which works like a producer sink, where our applications can dump data for processing later on.
We ran performance tests on RabbitMQ and Kafka, and the results are as follows (Note: these are rough numbers, and there’s no claim that you will get the same results).
RabbitMQ (Single Node, 100 Producer Threads, 5 Consumers, 32 Cores, 16GB RAM, Message Size: 100 bytes)
- Production rate and consumption rates are dependent
- Max rate of publish <25k msg/sec
Kafka (Single Node, 100 Producer Threads, 5 Consumers, 32 Cores, 16GB RAM, Message Size: 100 bytes)
We also found some supporting blog posts:
- Kafka: ~ 700k msg/sec (with producer and consumer both running) (with 3 cheap machines) — read here
- RabbitMQ: ~ 1 mm / sec (with producer and consumer both running) (with 30 Google Compute engines) — read here
Ultimately, we decided to use a Kafka cluster for the Analytics system. To date, it’s performed well, and we haven’t looked back.
To conclude, we currently run a mix of Redis, RabbitMQ, and Kafka as the primary messaging options to cater to our platform needs. Kafka is used where incoming traffic arrives at a very high rate, where latency and delivery guarantees are flexible. RabbitMQ caters to the need to deliver critical notifications like email, and Redis is mostly used for non-critical scheduling tasks.