Out with OpsWorks, In with Terraform

Published in

locusinnovations.com

6 min readJan 31, 2018

I’m looking for a suitable replacement for a client’s current CI/CD pipeline which includes: CloudFormation, OpsWorks coupled with both Chef v.11 and v.12, CodeDeploy and Atlassian Bamboo.

Specifically I am looking to replace OpsWorks with a more streamlined and fully automated solution. Some of the issues I have with OpsWorks include the following:

The proclaimed auto-healing is unreliable at best and ineffective at its worst. It basically stops an unhealthy instance and starts a new one, running the configure lifecycle event to re-associate with any EIPs or Load Balancers. The issue that I often see is OpsWorks losing connectivity with the instance state (did it finish shutting down, did it come back up, is it stuck in the ‘configure’ lifecycle state?) then just giving up, resulting in a failed state.
Obviously the auto-healing ‘feature’ is not ideal for EC2 instances with ephemeral store-backed root volumes; you can manage and specify a custom AMI to boot from instead as-needed with these instances of course, but you still need to wait for your instance-store to initialize unless you are using a solution like DRBD to back your ephemeral data up to an EBS device for added durability.
There is no true inherent auto-scaling for OpsWorks; you either add more complexity to traditional auto-scaling by adding lambda and SNS to the workflow as described here or you pre-provision the number of instances you will need to have on warm stand-by and configure time- or load-based scaling. The latter solution implies that you are able to predict your capacity and scaling needs It also inevitably leads to configuration drift as you are forced to keep these warm stand-by instances consistent with updates and config changes.
OpsWorks is based on Chef and therefore tightly coupled to a client/server pull-based config management framework that is very particular about which agent version is installed since they are continually coming out with updated versions. Once you update your agent, and I suggest that you do NOT configure your stack to ‘Use latest version’, other dependencies in your recipes will undoubtedly break and require additional troubleshooting and tracking down.
Chef itself tends to have multiple layers of dependencies. As mentioned above, the OpsWorks agent version tends to be rather tightly coupled with dependency versions defined in OpsWorks cookbooks, custom Chef recipes, cookbooks or Berkshelf community cookbooks. Tabula rasa can also be used to abstract away the dependency on OpsWorks cookbooks. All of which adds to the complexity of your config management solution, maintainability, sustainability and supportability.
The unreliable auto-healing in addition to the tight coupling leading to chasing down Chef rabbit holes, all contribute to more hands-on involvement in this semi-automated solution which makes it prone to human err and inconsistency.
Lastly (I’m sure I could keep going here, but I think you get the general idea), Chef is based on Ruby which tends to be a higher-priced skillset that is not as widely used as, say Python, for example. This client is a small organization that should not be one day burdened with finding and paying someone they hope is skilled enough to manage this nebulous environment.

After a bit of research, I have decided to do a Proof of Concept using Hashicorp’s Packer, Terraform, and RedHat’s Ansible.

Packer will come into play for AMI transformation and management. My client is a small enough organization that maintaining a combination of a few hybrid and fully-baked AMI’s makes sense for them.

Ansible is a configuration management tool (as opposed to orchestration management) to deploy software packages and configuration onto your servers. It will be maintained in source control and used in conjunction with Packer to aide in AMI transformation and management.

Ansible is based on a feature-rich set of Python modules to choose from and offers an agentless architecture. It is a much lighter-weight installation compared to the layers of OpsWorks, Chef and Ruby inter-dependencies.

You do not need to know Python to use Ansible, however, as it has a very easy-on-the-eyes human-readable format to build playbooks. It can be executed in a declarative manner; you specify your desired final state and Ansible will decide whether each task needs to be run or not.

Terraform is orchestration management software used for standing up and tearing down your infrastructure resources; servers, load balancers, auto-scaling groups, security groups, VPC resources, etc. Although both Terraform and CloudFormation run a close race, I chose Terraform over CloudFormation for a couple of reasons.

Terraform is provider agnostic; it is not tied to one cloud provider such as CloudFormation only supports AWS infrastructure and resources. You could support an external DNS provider with Terraform for example, as opposed to being tied to Amazon’s Route53. A short list of providers supported by Terraform include: Bitbucket, Datadog, DigitalOcean, Fastly, GitHubm, Gitlab, Google Cloud, Heroku, OpenStack, Mailgun, Microsoft Azure, StatusCake and New Relic. My client happens to use both StatusCake and NewRelic, so potential bonus functionality there.

Terraform is easier on the eyes in my opinion than CloudFormation, which is written in JSON or YAML. Both tools keep track of the state of existing resources and operate in a declarative manner when deciding whether to add, update or tear down from your defined stack. Terraform’s detailed and readable summary of changes that will be applied via the terraform plan command are more in depth than CloudFormation's basic overview provided with a change set.

There is no way to handle or import existing resources into a CloudFormation stack. With Terraform, you can import in addition to the ability to query attributes from existing resources.

Terraform makes it super simple to modularize your infrastructure and re-use resources using modules in a sub-directory structure or to import a GitHub repo, for example.

Remote state is supported to enable delegation and cross-team usage of the tool to avoid conflicts.

Although Terraform does not inherently support rolling updates to AWS auto-scaling groups, this can easily be achieved as explained nicely here by Rob Morgan. The key components as he outlines are as follows:

Both the auto scaling group and launch configuration have create_before_destroy set.
The launch configuration omits the name attribute which allows Terraform to auto-generate it, preventing collisions.
The ASG interpolates the LC name into its name so any changes force a replacement of the ASG.
We set the wait_for_elb_capacity attribute of the auto scaling group, so Terraform does not prematurely terminate the current auto scaling group.

I’m excited to start building out this solution and to see my client’s CI/CD processes transform with reliable automation, predictability, and lower risk! Stay tuned and I’ll be posting more about my journey to replace this OpsWorks pipeline and learning the ins and outs of combining Terraform+Ansible.

Out with OpsWorks, In with Terraform

Written by Salle J Ingle