Managing your (OpenStack) infrastructure with Hashicorp Terraform

This article is a part of “How we continuously deliver hybrid software solution in a big corporation” series, which is currently being written. As it’s a complex topic, I have decided to create multiple shorter blogposts for particular sub-topics. This is one of them.

Our CI/CD process consists of multiple subsequent zero-touch steps of delivering integrated multi-vendor software service from zero to user-facing services at the end. Provisioning is the very first step — which, using only cloud credentials- creates a set of pingable and sshable IPs with very restrictive firewall rules, where we install and configure the software using other tools.

For creating the virtual infrastructure (basically, VM instances) in our cloud environment, we decided to go with Hashicorp Terraform. At the beginning, our other options were Ansible (because our company uses it extensively), OpenStack (OS) Heat and OS CLI. To be complete, we should have also considered Juju, but we were just not aware of it at the time.

We did not go with Heat, because we did not like the syntax and we hated the fact, that you have to use different commands for stack creation and updating.

We did not go with OpenStack CLI, because it simply does not make sense. When you want to create the instance, you do your nova create or openstack server create, but if the instance already exists, you have either to deal with non-zero exit codes of the tool or implement the “logic” around it. OS CLI also changes probably more often than anything else (OS folks knowing what I’m talking about are now laughing).

At the very beginning (if we omit to play with OS Horizon, the web interface), we provisioned instances using Ansible’s core OpenStack modules but very soon we hit the wall and realized it’s not the way. Ansible (by design) does not store the current status of anything — whenever you run the provisioning playbook against existing infrastructure, it wastes several valuable minutes (you want to run your CI/CD pipelines as often as possible without any unnecessary delays) until it checks whether instances are yet running or not.

With Terraform, we gained the superpower to manage our (existing) very-many-resource infrastructure in less than one minute keeping creation time the same order-of-magnitude as with Ansible. Every run is (or should be) idempotent, meaning if you run your provisioning 1024 times with the same input, against any real state, you should get 1024 same results at the end. Unfortunately, this is not the case at all times, as we will show later.

Terraform enables you to write the definition of your infrastructure using HCL, Hashicorp’s own JSON-compatible language.

You can find exhaustive HCL examples on the net, but we will also show some basic code snippets so someone untouched with TF catches a glimpse. So the basic example, that creates a security group is:

resource “openstack_networking_secgroup_v2” “common” {
name = “sec-common”
description = “Rules common to all IAM servers”
}

You can also have variables in your syntax:

resource “openstack_networking_secgroup_v2” “common” {
name = “${var.tf_prefix}-sec-common”
description = “Rules common to all IAM servers”
}

and use interpolations to the basic JSON you know allows you to refer to values of another resource attributes:

resource "openstack_networking_network_v2" "net-lb" {
name = "${var.lb_network}"
admin_state_up = "true"
}
resource “openstack_networking_port_v2” “portal_vip” {
name = “portal.${var.domain}”
network_id = “${openstack_networking_network_v2.net-lb.id}”
admin_state_up = “true”
security_group_ids = [“${openstack_networking_secgroup_v2.common.id}”]
}

As soon as you infrastructure code is written (or, to be rigorous, infra is declared), you need to make it live (create it in the cloud). Terraform talks to your cloud APIs for you (and is absolutely not limited to OS only) and makes sure that infrastructure is always up-to-date with what you code in. Before doing that, you’re supposed to create a plan — simply compare your assumed current state, stored in a text file with the reality, using API calls which fetch it from the cloud. Once you are happy with the plan, you call apply and if any changes are scheduled, they are actually performed.

Terraform lifecycle

But there are “dragons”. If your cloud environment is not fully automated (our case) or you don’t have permissions to perform certain API calls (our case as well), you might be doomed and Terraform might prepare some surprise(s) for you.

Whenever you write your CI/CD pipelines for infrastructure management, bear in mind that any of your resources might (and will, I bet) be destroyed at any time. Expect TF to result in buggy states as well (their latest version is 0.9.something). For the reasons you might and for the reasons you might not anticipate. In a production environment, you should always check the plan before it gets executed (applied, in TF terms) or make sure that if something gets destroyed unexpectedly, it does not hurt you (or your users at the end). For CI/CD pipelines, this is a showstopper — you have to pause the pipeline, check the plan and approve it by clicking some magic button that starts it again. We did not find out how to programmatically decide whether the plan is or is not “fine”.

There is also a problem with state files. By default, Terraform stores your current state in a file in the directory where you run it. This renders it very impractical for usage in teams because sharing the state file with your teammates is a big pain I don’t need to explain in detail.

Fortunately, recent TF versions support storing the state file in various backend services (like S3, or in our case, OpenStack Swift), but only recently support locking (but, for some backends only, not Swift) — if two TF processes are run in parallel, you can quickly get into a big trouble when both processes will try to write to the same file. As we terraform multiple environments (dev, stage, prod) with the same scripts, we have to configure the storage backend (replacing REPLACE-THIS with swift container name) just before running TF plan/apply:

terraform {
backend “swift” {
path = “REPLACE-THIS”
}
}

and

sed -i 's/REPLACE-THIS/tf-state-'$ENV'/' remotestate.tf

In some unclear and hardly expected places, TF just does not support configuration and we’ve experienced TF creating a swift container like

${var.container_name}

(yes, including the dollar sign, curly brackets, and all the text). In the recent version, they explicitly say it’s not supported and you have to wrap it yourself.

In very recent versions of TF, backends are changing. This will happen to you as well — when something is not implemented in TF yet, you can’t wait for it and pick some workaround, which might then reside in your code long after it’s implemented (in our corporate words, we call this interim solution:))

When writing the infrastructure definition, we started with just creating instance resources only—leaving related resources (like ports) handling to the TF magic. But we quickly ran into trouble —during the dev phase, when we re-provisioned our whole infrastructure several times a day, IP addresses of instances were always changed (IPs are provided at port creation time automatically by the cloud) and as we don’t have automated DNSaaS yet, we had to update DNS records for dozens of instances manually. For this reason, we decoupled the instance creation to creation of every single resource possible — instance, port(s), network(s), subnet(s), security group(s) and SG rule(s) (see illustration). When an instance is destroyed for a good reason, the port is then kept (if not changed, of course) and when the instance is recreated, the port is assigned back to it. That effectively keeps the port’s previous attributes (including IP address) and probably saves some API calls as well.

We encourage everyone to do the same with their resources from the beginning, as you will definitely run into a situation when you need to modify some attribute of some resource and if that resource is not coded in your infra code, you will have a hard time dealing with the situation.

OpenStack resources and their dependencies

TF enables you to import your existing infrastructure parts into its state file, but you have to do this one-by-one and by hand. TF does not always support importing the resource you might be using — for example openstack_compute_instance_v2 (yes, the basic block you need). If possible, always create your infrastructure from scratch using TF and don’t mess with import, otherwise, you may start hating the tool sooner than loving it.

TF is a clever piece of software and does the dependency handling for you, but with limited warranty — not all the time. Here you can meet dragons again.

When a resource is modified (you modify your infrastructure definition) and TF wants to achieve the desired state, it either modifies the existing resource (if possible) or destroys the existing one and creates new with different attributes. Flavor (alias for pre-defined instance templates in OS) is a good example of an attribute that can hardly be modified on the fly (I’ve never seen a public cloud that will allow you to do this) and instance name is a good example of an attribute that is simple to modify without instance deletion.

But the glitch is in the term possible, mentioned above. The logic behind if possible resides on both sides — TF and the remote API. As it makes no sense that some feature (like instance renaming) is implemented in TF and not implemented in remote API, it makes perfect sense that if that feature is implemented in remote API, it might not necessarily be implemented in TF. When this is the case, TF does the deletion and recreation as the best effort practice — but when some resources are dependent on the recreated resource, they WILL BE recreated as well. Imagine an SG you named incorrectly and a hundred of rules that belong to this SG — when you recreate the SG, you flood your API with deletion of SG rules, SG itself and then recreating the SG and all the rules.

The fun does not end here — when deleting an SG, you need to unassign this SG from the relevant port. This is currently not implemented (correctly) and TF fails to do so (and returns an error).

Conclusion

Our conclusion on TF is that it’s our favorite, very powerful tool for infrastructure management, but one absolutely has to know all the steps it’s doing for her and never simply trust it will behave as she expects unless she tried it herself. When using, change your mindset to “everything that can be destroyed, will be destroyed and I have to deal with it”.

This is good if you understand what happens under the hood of OpenStack, but definitely is not good for OpenStack beginners.

Good news is, in case you run into any trouble, never hesitate to file an issue on GitHub — TF folks react and fix problems very quickly.