BigData & Web3. RabbitMQ High Availability and High Load. Part 3

Dmytro Nasyrov
Pharos Production
Published in
3 min readAug 2, 2024

Part 1 of the article is here

Part 2 of the article is here

High Availability

Fault tolerance in RabbitMQ is ensured primarily by clustering mechanisms. Launching one rabbit instance is already launching a cluster consisting of one node. RabbitMQ clustering is what is called “by design”. Classic clustering in rabbits is queue replication.

For each queue, a master node is selected, which is responsible for processing this particular queue. The queue will be located and work only on this node if not explicitly configured. For classic replication, it is necessary to specify additional policies for queues. As a result, we will get the queue working on the master node and replication of the state of this queue to mirror nodes (replicas). The number of replicas is set through the policy for each queue individually.

On the application side, we don’t need to know which node is the master. You can work with any node — requests will be automatically proxy to the master node.

It is clear that if the master of such a cluster fails, there is some risk that messages will not have time to replicate, and not all data will be on the replicas. There is a separate type of queue with its limitations (including performance) — quorum queue. They are designed to provide maximum guarantees for the safety of each message in the cluster. You can read about their specifics and limitations in the official documentation. I can say that the performance of such queues is two times lower than classic ones (and resource consumption is even higher).

Cluster operation requires connectivity between nodes of no more than 30 ms, which makes it impossible for the cluster to operate between remote DCs — the shovel and federation mechanisms already known to us are used to duplicate messages.

Well, where would we be without HAproxy — here the balancer helps not only to balance the load across the cluster but also to ensure fault tolerance. We use it before the cluster so as not to build complex connectors in applications. We simply connect to HAproxy, and it will connect us to the node that is currently running.

Network diagram

A simplified cluster of three nodes can be depicted as follows

With publisher

In case of failure of the 1st node, the publisher operation becomes impossible (without reworking the connector)

Therefore, we install a balancer (we connect the consumer in the same way through the balancer)

You can say Hi to us at Pharos Production — a software development company

https://pharosproduction.com

Follow our product Ludo — the reputational system of the Web3 world

https://ludo.com

--

--

Dmytro Nasyrov
Pharos Production

We build high-load software. Pharos Production founder and CTO.