Self-recovering consul cluster

Published in

LevOps

4 min readJun 1, 2017

Hey all enthusiastic DevOps followers. I’m back with an another post.

I think it’s a bit late to write this post but I still think good to show my perspective against deploying cluster-of-things in AWS. You could be happy with consul’s EC2 tag feature but reading through could be helpful too. As you know, devil in the details. I hope you will enjoy it.

To me, firstly, the main advantage of using any cloud services is, having resilience/highly availability and cloud services focus to make that easy/painless. So that, when I need to deploy any cluster in AWS, I try to deploy the cluster in an Auto Scaling Group.

Secondly, service discovery is not only an another service that you deploy and register your services to know where they are. There are other things that you can call as service discovery as well such as DNS, CDP etc. Basically, if there is a service/protocol which tells you that exactly where the bits you need for, it’s a service discovery.

The combination of cloud metadata and awscli is your service discovery as well. It’s because cloud metadata always knows where the resources are, and you can inquiry cloud metadata by awscli. Even, you can make sure if the services are healthy/unhealthy.

Automation Challenges
How do you discover your service discovery? Traditionally this has been a challenge for distributed systems. The technique often involves spinning up a cluster in one operation and then performing a second operation once the IP addresses are known to join the nodes together. This two-step approach not only makes automation challenging, but also raises questions about the behavior of the system when losing a node. Autoscaling could bring another node online, but an operator would still need to manually join the node to the cluster.
https://www.hashicorp.com/blog/consul-auto-join-with-cloud-metadata/

At my past role, we needed to do an R&D for service discovery. I tried few of them but decided to go with consul. Consul is easy to understand/use, comes with a K/V store, RAFT’s implementation works very well etc. In short, yet an another amazing Hashicorp product.

After couple of tries and understood how consul works, I was back to the beginning; It’s not a good idea to create standalone instances and assume them as a cluster. The cluster should be able to recover itself if any of instances fails or its IP address or DNS name should not be hardcoded into other services to register or I shouldn’t need to woken up by PagerDuty at 3am and fix a problem. I immediately changed my mind and decided to deploy them in an ASG. With that way, I could have a highly available cluster.

Important thing over here is having this mindset and the rest is not a rocket science at all. I only needed to set up an ASG that spins up instances, and introduce the members of cluster to each other. That’s all.

I chose AWS Linux as the operating system. It’s always up-to-dated and comes with upstart. I’m not big fun of upstart but found easier to handle service dependencies by it at that moment.

I wasn’t sure if the different versions of consul instances can work together in the same cluster and ended up with creating an RPM package. If one of the instances dies, a new one won’t come with the newer version of consul. yum install consul through YUM’s S3 IAM plugin is more reliable than curl consul with every new instance.

The RPM package includes these;

Consul binary
upstart definition of consul daemon
upstart definition of consul-join

And the rest, with user data script

installing awscli, yum-s3-api plugin
populating consul daemon’s configuration /etc/consul.d/server/server.json
finding not changing values forINSTANCEID, INSTANCE_ASG_NAME and CLUSTER_SIZE (size of the ASG) and writing down to /etc/sysconfig/consul-server

What happens after I deploy the ASG;

ASG spins up new instance(s)
Runs the steps I’ve just mentioned above
Starts consul-server service
Starts consul-join service as a dependency

consul-server daemon is a simple upstart definition and only reads the cluster size from sysconfig defaults and runs the consul daemon.

consul-join finds ASG’s healthy nodes and grabs PrivateDNSNames for the instances and runs consul join command.

If you deploy a new cluster or scale-up the cluster or ASG spins up a new instance, all of them finds each other and joins to the first listed instance. And voilà! You have the HA consul cluster.

I think checking the code could make more sense. I’ve put all the code on GitHub and you can find in here. Btw., the project was under development and passed all the tests successfully but never went to production. So that, the code is not at its best.

Conclusion

If you meant to deploy a cluster in AWS, deploy in an ASG. Try to avoid creating standalone instances for clusters as much as possible.
Do not only think about the time you’re in but the consider the entire lifecycle and highs-and-lows.
Do not hesitate to use cloud metadata and the cli tool to discover the resources you have.
Use awscli more and more, it’s a great tool. It’s a part of AWS-101.

Self-recovering consul cluster

Conclusion

Written by Levent Yalcin