
RabbitMQ EC2 Auto Clustering
I started using RabbitMQ about 6 months ago and have been determined to get them auto clustering inside an environment where at any time one may die. Searching online I was not able to find any tutorials or information on completely automating the process. This week I finally had enough time to get the process 100% automated using Ansible and thought I would share my experience. Some of this will just be general RabbitMQ information while some is Ansible or EC2 specific.
You will first need to set all the .erlang.cookie files to be the same between clustering nodes. This is simple however just opening /var/lib/rabbitmq/.erlang.cookie and changing it will lead to a RabbitMQ server that has become unresponsive. In order to properly change it make sure you have stopped the rabbitmq-server service and then make the change to .erlang.cookie. In the clustering documentation they do a nice job of telling you it needs to be identical but fail to mention this tiny piece of information that isn’t obvious if you haven’t used erlang before.
The second thing that is a big help is the rabbitmq-env.conf file. This will allow you to set environment variables for the user your rabbitmq-server starts as. You can see all of the variables available for configuration here. The only ones that I am concerned about for purpose of auto clustering are RABBITMQ_USE_LONGNAME.
RABBITMQ_USE_LONGNAME allows you to use a FQDN instead of the short hostname which is ip-*-*-*-* by default in EC2. This will not resolve however since trying to telnet to ip-*-*-*-* on any port will fail. You can use ip-*-*-*-*.ec2.internal however and it will properly resolve. Just setting RABBITMQ_USE_LONGNAME to true will still cause starting your node to fail until you change the hostname to be ip-*-*-*-*.ec2.internal as well. If you are on a Debian based system make sure you use hostnamectl to change the hostname as using hostname to make the switch will cause the hostname to rollback on reboot.
Now that you have your server running using a FQDN you no longer have the need to modify your /etc/hosts file which is the practice in all online tutorials I can find. The only thing left to do now is find all other servers that are running RabbitMQ.
To accomplish this I made a new ansible module that would let me get all instances in EC2 and filter based on tag key / value. You can take a look at the module here which is currently pending inclusion into the ansible-modules-extras repo. Once this infromation is returned you can loop through the list and try to bind to all nodes. This will include your current node which will throw an error in ansible but it is safe to ignore.
If you want to take a look at the full ansible playbook and how to set this up in your EC2 account take a look at https://github.com/nowait-tools/ansible-rabbitmq. It is also available on ansible galaxy https://galaxy.ansible.com/list#/roles/4450. Hopefully this will get other people in the ansible role community to think about better automating server clustering.