Terraform for Network Engineers: Should you be implementing Infrastructure as Code?

Dan Kirkwood
HashiCorp Solutions Engineering Blog
8 min readJun 5, 2020

--

Photo by Ricardo Gomez Angel on Unsplash

In my role as a Solutions Engineer at HashiCorp, every day I get to talk with our users and customers about how our products fit into their workflows. One of the most common questions I hear is around Terraform management of particular infrastructure areas.

This blog answers that question for one area in particular: The network. It assumes basic networking and automation tooling knowledge. (If you are looking for a more general explanation of Terraform and Infrastructure as Code, take a look at this article from my colleague Sean Carolan.)

A day in the life of a network ticket

To understand Infrastructure as Code (IaC) and the difference it makes to network operations, let us consider the following request:

I need a subnet!

A simple enough request on the surface. The first thing that probably would happen here would be an exchange over email, a ticket with comments or discussion over your chosen internal IM platform with the following questions:

  • What kind of workloads will reside on this subnet?
  • How many hosts will it hold?
  • What kind of traffic will ingress and egress from the subnet?
  • Does it need internet access?

And so on.

Once this information is known, one or multiple operators would need to interact with a number of systems to fulfil the request, for example:

  1. Check an IP Address Management (IPAM) tool for available address space
  2. Consult existing route tables to make sure your subnet meets summarisation constraints
  3. Find the router or route table that will own the route, and configure the subnet
  4. Configure the L3 gateway interface
  5. Configure the VLAN in your L2 domain via VLAN definition and trunk links
  6. Configure the VM port group or L2 interfaces for the access layer
  7. Configure the firewall to ensure access to/from the subnet to other resources

At this point I am going to pause and ask you to think about this process at your organisation. Do the steps above become a ticket? Maybe multiple tickets?
How many people have the knowledge to complete this task end to end? How often do they get something wrong along the way? How often does someone catch that something went wrong before an application is deployed that has connectivity problems? How consistent is this process across your on-premise and cloud environments? Is it consistent across networking overlays and underlays?

I hope it is obvious here that some kind of automation would be beneficial for this repetitive task. As a network engineer you have many automation tools at your disposal, whether they are vendor provided or home-grown. To understand why Terraform is a good fit, I have broken down some common challenges with fulfilling this request through common vendor or home-grown automation tools. I have also included examples of how Terraform helps to overcome those challenges

The challenges Terraform will help you overcome in network automation

Complexity

The first challenge is that many different vendor systems are involved for a single logical request, requiring knowledge of each CLI / GUI or API.

Terraform offers a single easy to read, machine parseable language called Hashicorp Configuration Language (HCL) which is consistent no matter which technology you are interacting with. Vendors and the Terraform community also maintain providers which are the API interaction point with your infrastructure. Terraform code for a subnet can be a single file which describes everything we need to satisfy the request.

The outcome here is that operators do not need to learn how to interact with each API separately and can spend time provisioning infrastructure. Team members have a single view of the subnet request that is easy to read and understand. In the example below, we use HCL to create a subnet across three cloud environments. Note that HCL does not try to abstract away from the service model of each cloud. Terraform does not force you down the path of the lowest common denominator when interacting with cloud resources.

Creating a route in HCL across GCP, AWS and Azure

Safe data handling

The second challenge with setting up your own automation of network changes is that the manual retrieval and translation of attributes, such as the subnet from our IPAM tool, is time consuming and error prone.

Terraform includes simple syntax for passing data between resources in the same request. The requested subnet from the IPAM tool can be referred to as ipam.subnet on the L3 interface, on the firewall, in security groups and so on.
With this in place we have less time spent doing manual copy-past actions.

This in turn lowers the risk of operator error, one of the main causes of network outages. In the example below, we use Terraform to claim an IP address from Infoblox, and then pass that IP address to a configuration block for a vSphere virtual machine.

Passing a reserved IP address from Infoblox to a vSphere VM

Order of operations

Network operators need to follow an order of operations to complete this request which is likely specialist or tribal knowledge. If this process is documented, time and effort are required to keep the documentation up to date.

Resource graph for Cisco ACI, showing Terraform computed dependencies and opportunities for parallel operations.

Terraform is a declarative tool, meaning it will work out the dependencies in this request automatically without the operator needing to know the correct order of operations. As terraform computes the dependencies between resources, it also knows which resources can be created in parallel.

The great outcome for the business here is that more operators are empowered to complete this request. Less time is spent by the team creating and maintaining documentation. When thinking about operating at scale, Terraform offers out of the box more efficient creation of resources with preconfigured parallelisation.

Lifecycle

Many operations teams focus on creating resources to meet business needs, but resource lifecycle is often ignored. This leads to stale infrastructure which can starve resources such as IP pools, router CPU or Firewall rule space.

Terraform will keep track of this subnet after its creation. If the code is changed (lets say the requestor soon needs a /25 instead of a /26) Terraform will work out which elements of infrastructure need to be updated to make the change. The whole subnet can also be removed from every piece of infrastructure where it was provisioned with a simple ‘destroy’ command.

With a tool which includes lifecycle management as a default, more resources are available to the teams that need them. By reducing the stale infrastructure elements on a platform, the platform becomes easier to understand and maintain.

Knowing which ACLs can be removed from a device due to configuration changes can be challenging

Collaboration

Infrastructure teams struggle to work collaboratively when using CLI or GUI systems. Scripts and API interactions can lead to code sprawl and lack of visibility into who is making changes. Different capabilities in terms of code or network knowledge mean that not everyone in the team is empowered to make changes when needed.

Terraform can integrate directly with version control systems or CI/CD pipelines to implement best practice around code hygiene. Terraform code can also be broken into modules, which are published by teams with the domain knowledge (for instance the network team) for consumption by other teams who just need the outcome (for example application teams)

The outcome here is that infrastructure is more predictable, repeatable and can be tested. The business has a centralised audit trail of changes to an environment. Teams can consume a module without knowing the intricacies of network management.

A module for creating a network in Oracle Cloud. The module take care of creating a DNS label, route table, and gateway for the end user.

Increased risk

We see in some teams that begin to move fast with automation that by removing the human element from infrastructure provisioning more risk is introduced of either something going wrong at a large scale, or abuse of the automated system.

Terraform includes a policy as code framework that is customisable around governance and compliance controls. Types of policies include:

  • Guardrails for best practice — “Subnets should be under /24 unless approved by the network team”
  • Security policy — “Access to workloads in the DataCenter should only be given on ports 22 and 443 and only from trusted entities”
  • Operations governance — “No changes to the network after 5pm on Friday and before 8:30am on Monday”

Sentinel helps to lower risk around automated changes. By defining policies in code, they are auditable and enforceable before any change is made. Terraform also enables integration into existing change management processes for elevation and change approval.

In the screenshot below, some infrastructure provisioned by Terraform is first checked against best practice as defined by the Center for Internet Security (CIS). Only resources in this Google environment that follow best practice will be provisioned. (See the repository of CIS guidelines in our Terraform Foundational Policy library here.)

Terraform policy checks

Infrastructure as Code with Terraform sounds great… why am I not using it already?

We have seen huge adoption of Terraform within teams who have gone all-in on public cloud. Terraform makes sense in these environments because they are usually very dynamic, and also because the public cloud vendors expose their full functionality via a rich API. (Take a look at the network provisioning capabilities on AWS, Azure and GCP.)

The recommended model of managing infrastructure with Terraform is via API, and for the most part Terraform integrations (called ‘providers’) focus on system orchestrators rather than deploying configuration box-by-box. Many network vendors are now offering APIs as first-class integration points as well. Some examples include:

While Terraform cannot manage every network today, we have a huge community of vendors and developers constantly releasing new providers. Keep an eye on the HashiCorp blog for announcements in this space.

How can I get started?

You can start managing your network with Terraform today. To work on your local machine, visit our downloads page. You can also run Terraform on our SaaS platform which has a free tier. We also have plenty of well-documented getting started guides at our learn site.

I hope this blog has shown the possible use cases for Infrastructure as Code in network management, and where Terraform is particularly useful in the safe and reliable automation of infrastructure provisioning. Thanks for reading and see you next time!

--

--