Terraforming AWS — Setting up Highly Available Consul Cluster in Less Than 5min

Damjan Znidarsic
6 min readDec 27, 2015

--

Terraform exists for a while now but good luck finding examples of use that are something more than extremely trivial. I’ll walk you through Terraform and show you an example Consul cluster on AWS with Terraform and Ansible to showcase how these tools would be used in real deployments.

What is Consul?

Service discovery and configuration. It’s a bit out of scope for this post to go into detail but if you have more than a few servers you would probably benefit from it. Check it out. I’m using it as an example because it’s a new addition to my infrastructure and it’s what I was actually working on lately.

Plan

We want to set up a ‘one button’ Consul cluster deployment.

  • We need at least 3 servers to be able to survive 1 server failure. Quorum requires at least (n/2)+1 members. We can also modify consul_servers_count variable to deploy 5 servers. As easy as that.
  • We must ensure instances are deployed in separate availability zones to guard against AZ failures. By default AZs will be selected by AWS and more often than not you end up with a cluster in a single AZ. We promised HA to our bossman. Not good.
  • Using Ansible and ansible-consul role we can have Consul installed w/o hassle. No shell scripts please.

What is Terraform?

It’s a description of your complete infrastructure. With it, you can launch your infrastructure and collaborate on it with others. It supports a number of providers such as: AWS, DigitalOcean, Azure and OpenStack to name just a few.

This example certainly works if you run terraform apply, but what if you want to get AMI dynamically for Ubuntu depending on the region, if you want to launch multiple servers and if you want to run Ansible to provision all that. Ugh.

This is where it gets a bit tedious and Terraform documentation, although very good, is unclear about some things. Hopefully this fills some gaps for people.

In hindsight it all looks pretty self explanatory, but I did spend a couple of days to figure everything out. My production setup is more elaborate but a minimal setup is available in this GitHub repo should you want to use it as a base for your Consul cluster.

Terraform — The MAGIC

  • Using tf_aws_ubuntu_ami module to dynamically load the ami id. If we change aws_region, it will find new ami ids to use.
  • Dynamic count means we’ll spin up 3 servers or change one variable to add 2 more.
  • Using tf_aws_availability_zones module we cycle through AZs with element(), which ensures every server will be launched in a separate AZ. Notice the additional split() — TF doesn’t support arrays in variables for some reason.
  • Using remote-exec provisioner we dump some variables to /tmp, to be picked up by Ansible. This way they are only defined once in variables.tf
  • Using local-exec provisioner we run Ansible on the host with self.public_ip. It will run the playbook with the IP of the instance currently being created.
  • Because we are referencing the server ip with aws_instance.consul_server.0.private_ip, Terraform knows to create the server on index zero first and wait with others until the first server is created and provisioned. This is quite genius. Much magic. Very sweet.

Terraform - The GOOD

It’s a really good promise and for the most part well executed — define your infrastructure in easy to read syntax, run terraform plan to see what it intends to do, run terraform apply to do it!

Terraform Consul cluster setup end result

All providers are run in parallel. You can spin up 20 servers in no time at all.

It uses the magic of dependancy graphs to figure out what it needs to do and in what order. This means that if you set a security group to an instance, it will create the security group first. You don’t need to think about the order of things, it does it all for you.

So Consul requires 3 servers to run for production setup. But there’s a catch. When bootstrapping the cluster you need to reference an existing server for Consul to be able to connect to it and establish quorum. Terraform helps if we reference first server in a provisioner. It will spin up and provision the first server to get its IP and then proceed to setup all others, in parallel, using that IP. Very nice.

It has partial state which essentially means that it will try to pick up from where it left off in case of errors.

Integration with HashiCorp’s Atlas means you can push Terraform files to them, they will run the plan and if successful, you can apply it via their GUI. It gives you an ability to collaborate on infrastructure with others and more importantly, full change history.

Additionally adding Consul to Atlas means you don’t need to run Consul UI yourself.

Outputs are a handy way to print some data on screen after successful apply. See above.

You can use terraform destroy to vaporize your environment in seconds. Very handy for testing.

Terraform — The BAD

Variables and variable interpolation is absolutely horrid. Stupid things like no arrays for variables, interpolation working in some places but not in others, absolutely no way to debug variables and interpolations. Most of the time I spent battling this. Very frustrating. HashiCorp would benefit from some developer UX love in this area. As a user I don’t care how complex things can be under a simple interpolation. It should just work.

Terraform will load all *.tf files in a directory. This doesn’t work for sub-directories though — which brings us to Modules.

In order to group things in folders you need to use modules and it’s a really weird thing. Module has it’s own scope so you need to re-define a lot of variables and specify their value in a module definition.

Blah.

I didn’t find a better solution but if anyone knows it, please comment.

Provisioners are only ran on resource create, making it hard to develop since you are likely to have some errors in the provisioner from the start. Would love an option to re-run them. Reminds me a bit of this issue with Packer that still infuriates me every time I use it.

̶D̶o̶ ̶n̶o̶t̶ ̶m̶e̶s̶s̶ ̶w̶i̶t̶h̶ ̶.̶t̶f̶s̶t̶a̶t̶e̶ ̶f̶i̶l̶e̶s̶.̶ ̶I̶t̶ ̶w̶i̶l̶l̶ ̶b̶r̶e̶a̶k̶ ̶T̶e̶r̶r̶a̶f̶o̶r̶m̶.̶ ̶W̶h̶i̶c̶h̶ ̶m̶a̶k̶e̶s̶ ̶m̶e̶ ̶w̶o̶n̶d̶e̶r̶ ̶h̶o̶w̶ ̶d̶o̶e̶s̶ ̶t̶h̶i̶s̶ ̶w̶o̶r̶k̶ ̶i̶f̶ ̶I̶ ̶c̶o̶m̶m̶i̶t̶ ̶i̶t̶ ̶t̶o̶ ̶G̶I̶T̶ ̶a̶n̶d̶ ̶h̶a̶v̶e̶ ̶a̶ ̶c̶o̶n̶f̶l̶i̶c̶t̶ ̶i̶n̶ ̶t̶h̶e̶s̶e̶ ̶f̶i̶l̶e̶s̶ ̶d̶o̶w̶n̶ ̶t̶h̶e̶ ̶l̶i̶n̶e̶.̶ ̶H̶m̶m̶.

Terraform 0.7 has much improved state management!

Terraform vs Ansible vs CloudFormation

Ansible

It’s possible to provision infrastructure with Ansible. But it has ad-hoc state, no parallelism and limited support for dry runs.

CloudFormation

It does support parallelism and that’s about the only thing it does right. It’s AWS only — TF and Ansible both support multiple providers outside of AWS. No dry runs whatsoever. Change and pray it will work. Also JSON?! Yuck.

Overall

Once you figure out all the gotchas, Terraform really is a super addition to your infrastructure efforts. Gone are the days I’m clicking through AWS Console furiously to set up servers and then try to figure out how to get them into Ansible inventory to configure them.

Although it’s not ideal because it lacks support for some AWS services, it wins hands down compared to competition. Their issue count on GitHub is a bit unsettling but that is to be expected with the number of providers they support.

Check out the demo GitHub repo, it has everything you need to bring your Consul cluster up in <5min.

--

--