Why use Infrastructure-as-Code and how to be successful at it

Published in

cloudnativeinfra

7 min readApr 27, 2019

Infrastructure-as-Code (IaC) is the idea that you can represent and manage your infrastructure in a source file, the same way you manage your application source code.

You describe the compute, storage and network requirements in a text file kept in your version control system, and the IaC tool take care of spinning up the infrastructure according to your code.

When you need to make changes to the infrastructure, you follow a workflow similar to application code changes and the IaC tool makes the changes happen.

Why Infrastructure-as-Code?

The most obvious benefit is that infrastructure change management becomes easier, as you will always have a versioned representation of your infrastructure in your SCM.

Using IaC allows you to build environments rapidly without any human intervention.

It can be a part of a delivery pipeline or it could be set up to fire in response to specific business events, allowing automated infrastructure changes without human intervention.

Another key benefit is consistency of build. If you need to manage several environments like Dev, QA, Staging, Prod, etc, spinning those up from the same code-base ensures they all behave the exact same way. If you are a System Integrator or Independent Software with multiple customers, you should consider building out your customers’ environments this way to ensure consistency and manageability.

If you have ‘snowflake’ or ‘pet’ servers that were manually built, generally admins would be scared of rebuilding these servers, as the confidence levels on getting them back up and running in the same condition would be low. By adopting IaC, you can turn these ‘pets’ into ‘cattle’ where you no longer worry about the individual configurations/state and you can boldly run the IaC tooling to rebuild the server, knowing fully well that it will be in a good working state by the end of the rebuild.

Should I just use my cloud provider’s CLI or API to create my infrastructure, is that it?

No, you need to use a designated IaC tool to reap the benefits. Technically, you can use the CLI or API, but these do not support idempotent operations.

Imagine if you write a procedural script using the CLI or API to create a compute instance. A few lines later, the script aborts. When you rerun the script, it will create another compute instance (because the statements were executed again). Or if the command was to destroy an instance, it may fail with an error because it was already deleted in the first run and it is not able to find it when you re-run the script.

To get out of this mess you end up writing boiler plate code to check if the resource already exists before you attempt any operation on it — which then ends up being messy procedural code.

This is where a proper IaC tool is useful as they support idempotent operations. This is how IaC tools operate:

You describe the desired infrastructure resources in a file (for e.g., a virtual network with three public subnets, a compute instance on one of them with a block volume attached to it). You describe what you need, you never describe how to create them — the IaC tool figures how to create them.
The tool looks at what you have described in your code and logs in to your cloud account and checks if those resources are present.
If the resources are not present, they are created.
If the resources are already present with the same attributes, no action is taken (as what you expect is already present).
If matching resources are found with differences, the IaC tool assumes you want them changed and makes the change happen.
The tool does not throw errors/fail/create unintended duplicate resources in any of these cases, because these operations are idempotent.

To summarise, you describe the desired infrastructure resource state, the IaC tool identifies the delta and applies the changes to make the reality match your desired state without you having to write any procedural code.

The language used to describe the desired state is usually a domain specific language specific to the tool (for e.g., HCL for Terraform or YAML for Ansible).

When invoked, the IaC tool inspects your code to see if your desired state matches the reality and employs the reconciler pattern to apply the change.

Caveats and gotchas

Successful adoption of IaC also requires you to have a plan on how you handle certain situations arising from change management in general.

The typical changes you need to deploy belong to one of these categories:

Routine application changes in the form of deployments, ideally delivered through a CD pipeline. If a deployment results in issues, you can rollback to the previous version of the code. But a lot of infrastructure changes cannot be ‘rolled back’ in the traditional sense. If your new version of code created an object storage bucket and you reverted back to a version without the object storage bucket, your IaC tool may delete that object storage bucket along with any data in it. Make sure you have backups done at critical checkpoints (in addition to the routine backups) to safeguard against side effects impacting changes. Some IaC tools like Terraform and Ansible support dry runs to simulate and report what changes are going to be performed. Having a manual intervention to validate scenarios where resources get deleted can be a good practice.
Reactive changes in response to production events (for e.g., additional instances needed to cope with unforeseen load). Imagine this: You have a VM or Kubernetes cluster that has been declared to have a desired state of 10 instances. A production issues happens and an Ops/SRE person (or an autoscaling process) decides to scale it up to 20. Your next routine deployment should not overwrite this back to 10. An alternative approach to handle this is to follow GitOps, which can detect configuration drifts and send a pull request back merging the change back in to your main branch. Terraform, for e.g., can be brutal if you make changes in the cloud’s console for an emergency fix (as the statefile no longer matches reality). Ensure you use the tool to deploy the fix or update the statefile after the fact to ensure the statefile matches reality.
Proactive patches, security and version upgrades. Take an example where you have an Ubuntu 16.04 on a Tesla P100 GPU instance running Nvidia Docker (which in turn needs a specific version of Docker), before you push out an upgrade to any of these, you need to make sure the resulting combination is compatible. Though this does not have much to do with your infrastructure code, that is where you may run into errors (or may fail silently and you may need to have test cases to ensure upgrades did not break anything).

Resources carrying stateful data like databases, block volumes and object storage deserve special attention when you write your IaC.

When faced with a change scenario where some attribute of one of these resources is changed, the IaC tool may decide to destroy and spin up a new one that matches your specified criteria, deleting all existing data with it. This depends entirely on the combination of the tool you are using and how your cloud provider has implemented their plugin for the tool.

Service discovery can be a challenge in the world of IaC with some legacy workloads that rely on physical aspects of infrastructure like IP addresses.

The use of a service mesh within a Kubernetes cluster can help services locate each other in a world where IP addresses change constantly due to pods being replaced, but that may not be feasible in the VM world with some legacy workloads that rely on IP addresses of the VMs. You will need to remember to update DNS or use floating IP addresses to give these applications a stable end point when your IaC code rebuilds these environments.

Some IaC tools like Terraform rely on statefiles to track the current state (as opposed to peeking into the cloud account to see what is present). It is important to keep this file in a commonly accessible place as you will not be able make any changes without the statefile.

These challenges highlight the need for prior planning of how application changes impact your IaC and testing of your IaC code for handling possible scenarios before you apply them in your production environment.

There are changes resulting from IaC adoption that you will need to get used to, but the rewards of adopting IaC make it worthwhile.

You can always take it on one step at a time. Start with use of IaC tools to do new installations, you can then gradually expand its use to do ongoing maintenance.

If you want to know more about the capabilities of specific IaC tools, watch this space for a new post soon that discusses the most commonly used IaC tools.

Why use Infrastructure-as-Code and how to be successful at it

Why Infrastructure-as-Code?

Should I just use my cloud provider’s CLI or API to create my infrastructure, is that it?

Caveats and gotchas

Recommended Reading:

Infrastructure as Code

Virtualization, cloud, containers, server automation, and software-defined networking are meant to simplify IT…

Cloud Native Infrastructure

Cloud native infrastructure is more than servers, network, and storage in the cloud-it is as much about operational…

Written by Ramnath Nayak