Clustering RabbitMQ on ECS using EC2 autoscaling groups
ECS Service’s new Multi-target group feature allowed us to greatly simplify RabbitMQ clusters on ECS
Since we introduced chat and notification services in THRON, we have needed to have a RabbitMQ cluster in our infrastructure, in which to keep the various queues of such events separate by customer and type.
We implemented RabbitMQ in our AWS site with custom EC2 Linux instances configured with Ansible. This was before we began using Docker and that type of design was our standard way to approach AWS instances.
What we wanted to improve
The cluster was structured as follows:
- two RabbitMQ nodes on EC2 instances in the same VPC that could communicate with each other (to keep them synchronized);
- a Classic ELB with both nodes registered to distribute application queries to the cluster evenly;
Over the years this implementation has proven to be quite successful, except for a few weaknesses:
- it could not handle automatically any partitioning between nodes, caused by various problems (OutOfMemory, EC2 issues, spikes in requests) and we had to manually intervene to restore the cluster;
- it was not practical and quick to upgrade the system to the newer versions of RabbitMQ, which was rapidly evolving at the time, and ended up freezing to obsolete versions for a long time;
So over time, we’ve taken into account to evolve the architecture of that cluster.
Is there a better way to cluster RabbitMQ?
Initially, we did a scouting to see if there was some RabbitMQ as-a-service product that was worth evaluating, to eliminate the burden of maintaining that infrastructure. Apart from CloudAMQP, we didn’t find much else that was right for us. Unfortunately, we had to discard it for a poor cost-benefit ratio.
We had built most of our architecture without relying on outsourcing, so we started looking for a self-designed solution, which is also more fun and help you learn more things, isn’t it?
RabbitMQ clustering requirements
With the advent of containers and their dedicated AWS services and the convenience of implementing all kinds of solutions with them, we spontaneously decided to overhaul the architecture using ECS.
The solution seems trivial: create an ECS cluster with 2 different services, one for each RabbitMQ node. In the past we had already tried this path, unfortunately, we had not been able to meet cleanly and efficiently some requirements that RabbitMQ has:
- expose 2 TCP ports (5672 for AMQP and 15672 for HTTP-API and WebAdmin);
- let the nodes communicate with each other;
- persistence on data storage for fast recovery in case of down/update;
The challenge of exposing 2 ports in ECS
Each RabbitMQ node must expose the AMQP and HTTP-APIs ports to the application clients that use it. To reach ECS Services, you would normally associate them to an EC2 Target Group which is populated by the Service, thus making the targets (EC2 and Port instance) known to the balancer.
In ECS, the most conceptually correct port assignment is the dynamic one in bridge mode, so you need to switch to a Target Group. The main problem was that you couldn’t associate more than one Target Group with a Service, so you couldn’t expose both ports you needed.
Since last summer (July 2019), however, this has become possible: https://aws.amazon.com/about-aws/whats-new/2019/07/amazon-ecs-services-now-support-multiple-load-balancer-target-groups/
However, at the time of writing, it is not yet possible to associate multiple target groups with a single ECS service by using the AWS Console: you can only perform association through APIs (and up-to-date clients) and CloudFormation.
How to make nodes talk to each other
The static formation of a RabbitMQ cluster requires prior knowledge of the hostnames of the nodes. If the environment is dynamic and does not allow it, you can use cluster-formation plugins.
In our old infrastructure, having static network-addressed EC2 instances with their associated DNS records, the nodes’ names were known a priori and everything went smoothly.
In ECS, however, Tasks generally live in the Docker of the underlying EC2 instances, so their reachability is decided based on the Task Networking mode you choose.
As a potentially convenient first way, we tried the plugin built into RabbitMQ AWS Peer Discovery, which creates the cluster by autonomously locating nodes via EC2 Tags or membership in an EC2 AustoScalingGroup.
This plugin, with Tasks in “bridge” mode, attempted to use the internal hostnames of containers, which are themselves unattainable. Trying in “host” mode, however, he tried to use the private hostnames of the EC2 instance. This solved the problem of reachability, but: when a node was dying or when ECS instances were being updated, the AutoScalingGroup created new ones that had a different private hostname, so theRabbitMQ cluster nodes were no longer known.
CloudMap service discovery to the rescue
So we thought we’d change our approach and use the Service Discovery offered by CloudMap. This allowed us to have dedicated DNS records, managed in Namespaces in Route53, for each node in each ECS Service.
This service, when used in “bridge” or “host” mode, generates and maintains SRV type DNS records that point to the ECS target (host-port). Sounds nice, doesn’t it? Too bad RabbitMQ does not support SRV records for cluster nodes, it only supports (at the time of writing) A/AAAA records. To get records like that, we had only one thing to do: to bring Tasks to “awsvpc” mode, which dedicates a dedicated EC2 network adapter to each task, and then a different IP addressing for each of them. In this way, the ports are always statically mapped to the network adapters and the records could be simple Type A records.
It works, let’s proceed with tests!
During the various iterations where we deleted and recreated RabbitMQ cluster nodes, we noticed some slowness in updating CloudMap-managed DNS records. Sometimes it would even take 1 or 2 minutes. This led to a “partitioning” problem at the first start, when the records did not yet exist, and the nodes ended up starting in standalone mode before creating the Cluster. To solve this problem, we added a static mapping to resolve the hostname to itself, by adding an entry in the /etc/hosts file of each node, forcing the DNS resolution to the localhost (127.0.0.1) address.
How to manage cluster state storage
RabbitMQ accepts volatile or persistent messages and we want to minimize the risk of losing messages when the cluster is under memory pressure. For this reason, we need to design a robust method of keeping the data on disk safe in the event of an application crash or to correctly manage node updates.
That’s why we thought we’d manage the persistence of cluster nodes on independent and pre-generated EBS disks. Essentially, when an ECS Task of a RabbitMQ node leaves, thanks to the Docker Storage Plugin “rexray/ebs” that we pre-installed in the EC2 image of our ECS hosts, we delegate to Docker the assembly of its EBS disk, mapping it into the path of the container used for data persistence.
New architecture’s advantages
This new architecture is far more elegant and also provides several improvements compared to the previous one:
- Improved maintainability: We can now use the Docker image based directly on the official one, maintained by the RabbitMQ team. The only things we add to those standard images are cluster configurations and system limits, to make them adequate to our production environment. This also allowed other departments to be able to upgrade RabbitMQ independently, without the need to rely on the DevOps team. An update of RabbitMQ has gone from requiring 1 man-day (including image preparation and all related activities) to 1 hour.
- Improved robustness: when a node dies for whatever reason, it is immediately recreated and is immediately visible within the cluster, while also maintaining the state of the data in storage. This reduced our extraordinary maintenance work to fix cluster partitionings, which in 2018 happened about ten times.
- Improved monitoring: the newer RabbitMQ versions come with the rabbitmq_prometheus plugin that exposes application metrics in Prometheus format. It was very easy to integrate monitoring and alarms to our Prometheus/Grafana monitoring system. If you’re interested in how we use Prometheus, read our previous article as well.
By not relying on an external supplier we also do not need to engage in the supplier qualification tasks and it’s easier to perform ISO27001 assessments for risk management.
We are very satisfied with the new architecture since it provides an elegant solution to our needs, but we have identified some further improvements that we would like to implement.
The main improvement regards how we reach the RabbitMQ cluster from the applications. The internal DNS record is unique and is currently associated with a Network Load Balancer that targets cluster nodes. We have kept this design to have a lower regression risk when moving from the old infrastructure. In the future, we could easily manage this record directly at The CloudMap Service Discovery, thus simplifying the architecture and eliminating the fixed costs of the Balancer.
Sometimes, minor updates from the cloud provider enables you to pursue very powerful refactorings that lower both risks and management costs.
Have you created RabbitMQ clusters? If so, how did you do it? Was the described approach useful to you? Let us know.