About Infrastructure on AWS, Automated with Terraform, Ansible and GitLab CI

13 min readApr 7, 2019

In my first story want to share our workflow of fully automating our infrastructure creation and management on AWS.

When I started working with Terraform some time ago, I came across this tutorial on how to automate Terraform with GitLab, published by Tim Berry here on medium.com. It was a great starting point to automate Terraform with GitLab CI but some additional work was still required for us, especially the creation of a custom Docker image including Ansible. This led me to publish an own article on this topic.

The Toolset

The tools we to use to automatically create and provision AWS resources are:

Terraform (v0.11.13) with to create and manage the resources
Ansible to provision the EC2 instances
Terraform Ansible Provisioner (by radekg) to run Ansible playbooks straight out of Terraform
GitLab CI to automate the execution of Terraform with GitLab CI pipelines

I won’t explain too much about the tools themselves but rather provide a detailed example on how to combine them. Some experience with how to use each of them individually would therefore be of advantage.

The Project Layout

When executing one of the terraform commands it checks for all files with an .tf extension in the current working directory. This enables very flexible project structures. But Ansible on the other hand requires a bit stricter directory layout. So let’s start with a rough overview of our GitLab project.

├── ansible-provisioning
│ └── roles
│   └── my-global-role
│     └── ...
│
├── global
│   ├── files
│   │ └── user_data.sh
│   └── {main|outputs|terraform|vars}.tf
│
├── environments
│ ├── dev
│ │ └── {main|outputs|terraform|vars}.tf
│ ├── stage
│ │ └── {main|outputs|terraform|vars}.tf
│ └── prod
│   └── {main|outputs|terraform|vars}.tf
│
├── modules
│ └── my_module
│   ├── ansible
|   │ └── playbook
|   │   └── …
│   └── {main|outputs|terraform|vars}.tf
│
├── .gitlab-ci.yml
└── README.md

Terraform files to define our environment specific resources, such as the VPC, IGW, subnets, etc. are stored in the directories environments/[dev, stage, prod]. In the global directory we define components that apply for all environments, e.g.the default ssh_key_name for the ubuntu user or the user_data.sh script. It serves as cloud init script to do a basic instance provisioning. Its content is “exported” via a Terraform output value. These outputs can then be imported and used in the different environments via Terraform remote_state import.

We use the terraform.tf file to define the Terraform backend, the AWS provider and (optionally) import the mentioned terraform_remote_states as follows:

# TerraForm Backend
terraform {
  backend "s3" {
    bucket = "[...]"
    dynamodb_table = "[...]"
    key = "[...]"
    region = "[...]"
  }
}# AWS Provider
provider "aws" {
  region = "[...]"
}# Import State "global" From Remote S3 Bucket
data “terraform_remote_state” “global” {
  backend = “s3”
  config {
    region = "[...]"
    bucket = "[...]"
    key = "[...]"
  }
}

Note that we do not set AWS credentials neither of the backend or the AWS provider sections directly. Instead, they are set as environment variables (and will be as well in the GitLab CI pipeline later on). We only set the AWS region to work in here.

The modules folder contains our Terraform modules which group resources that are required repetitively. A module can then be easily imported into environments by embedding the module(s). Check out the Terraform Docs for more information about how to use modules.

ansible-provisioning is used as storage directory for our Ansible roles that are module independend. That way, we have them defined only once and can easily maintain them in that single location. In addition, every module has an individual ansible/roles directory to store the module specific roles.

Finally, the .gitlab-ci.yml file defines our GitLab CI pipeline. More on that later…

Create and Provision EC2 Instances

Now, let’s create the Terraform definition of an EC2 instance in a module. We can then create specific Ansible roles for the provisioning of exactly that module’s instance type. For simplicity, I only focus on the key elements required to explain this whole automation workflow rather than the full resource definition.

resource "aws_instance" "default" {
   ...
   user_data = "${data.terraform_remote_state.global.user_data}" 
   ...
   associate_public_ip_address = true
   key_name = "${data.terraform_remote_state.global.ssh_pubkey}"   # ignore user_data updates, as this will require a new resource!
   lifecycle {
     ignore_changes = [
       "user_data",
     ]
   }
}

Within the user_data.sh we install python (required to execute Ansible tasks on the remote host) and create our default user accounts, including a user terraform as which the Ansible provisioner will connect to the EC2 instance.

#!/bin/bashtimedatectl set-timezone UTC
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
apt-get install -yqq python python3...   # setup default admin user accounts# setup user 'terraform'
useradd -m -p "[PASSWORD_HASH]" -s /bin/bash -G sudo terraform
mkdir -p /home/terraform/.ssh
chown terraform:terraform /home/terraform/.ssh
chmod 0700 /home/terraform/.ssh
echo "[PUBLIC_KEY]" > /home/terraform/.ssh/authorized_keys
chmod 0644 /home/terraform/.ssh/authorized_keysdeluser --remove-home ubuntu

As you can see we already set a password for the terraform user in form of the password hash and write the user’s ssh public key to the authorized_keys file. It is also added to the sudo group in order to run Ansible tasks as root user. Last but not least, we delete the default user ubuntu which is not needed anymore as we have our own user accounts all set.

Now that we have the prerequisites defined we can create the provisioner resource that runs the Ansible provisioner. Infos on the bold and italic parts will be explained in the Secrets Handling section below. Additional information on the Ansible provisioner configuration can be found in its GitHub repository.

resource "null_resource" "default_provisioner" {
  triggers {
    default_instance_id = "${aws_instance.default.id}"
  }
  
  connection {
    host = "${aws_instance.default.public_ip}"
    type = "ssh"
    user = "terraform"   # as created in 'user_data'
    private_key = "${file("/root/.ssh/id_rsa_terraform")}"
  }  # wait for the instance to become available
  provisioner "remote-exec" {
    inline = [
      "echo 'ready'"
    ]
  }  # ansible provisioner
  provisioner "ansible" {
    plays {
      playbook = {
        file_path = "${path.module}/ansible/playbook/main.yml"
        roles_path = [
          "${path.module}/../../../../../ansible-provisioning/roles",
        ]
      }
    hosts = ["${aws_instance.default.public_ip}"]
    become = true
    become_method = "sudo"
    become_user = "root"
    
    extra_vars = {
      ...
      ansible_become_pass = "${file("/etc/ansible/become_pass")}"
    }
    
    vault_password_file = "/etc/ansible/vault_password_file"
  }  
}

If you create and assign an AWS Elastic IP Address to the instance, make sure to replace "${aws_instance.default.public_ip}" with “${aws_eip.default.public_ip}”, otherwise the provisioner won’t be able to connect to the instance. Note that there are both a host and a hosts parameter. host is used for the remote-exec provisioner which is required in order to wait for the instance to become available, while hosts defines the hosts that the Ansible provisioner will connect to. More on that below…

The provisioner will trigger when the ID of the default instance changes, as defined in the triggers {} map of the null_resource. It then runs the Ansible playbook located at ${path.module}/ansible/playbook/main.yml which is our module specific playbook.

Ansible Files within the Terraform Module

Let’s have a look at the module’s directory structure to get a better idea of how it all is tied together:

├── modules
  └── my_module
    ├── ansible
    │ └── playbook
    │   └── roles
    │   | └── my-module-specific-role
    │   |   └── tasks
    │   |     └── main.yml
    │   └── group_vars
    │   | └── all
    │   |   └── vault   # stores encrypted Ansible secrets
    │   └── main.yml
    │
    └──{main|outputs|terraform|vars}.tf

The structure of the ansible subdirectory is (one of) the regular Ansible structure(s). In the group_vars/all directory we can store a vault to securely store Ansible variables. We use an individual vault for each module.

Now, let’s check out the module’s Ansible playbook located at my-module/ansible/playbook/main.yml.

---
- hosts: all  # executed on 'all' hosts set in the tf provisioner
  roles:
    - my-global-role
    - my-module-specific-role
   (- ...)
  become: yes

We define all as hosts, as we cannot use a classic Ansible inventory here. However, we can set a single or multiple host IPs directly in the provisioner resource as in hosts = [“${aws_instance.default.public_ip}”]. We could also specify more instance IPs here of course, e.g. join the IPs of multiple created EC2 instances when using the Terraform count parameter in the instance resource definition.

As you can see, the playbook includes both a global role (“my-global-role”) from the ansible-provisioning directory (as shown in the project overview at the beginning of the article) and a module specific role which is stored at module/my_module/ansible/playbook/roles/my-module-specific-role. In the global role we can for example install docker. That would be a task that is required often, but still not on every instance and is therefore not included in the user_data.sh script. In the my-module-specific-role role we could then spawn certain docker containers to make the instance a worker for whatever the Terraform module is intended for.

Secrets Handling

So, what about the ssh private key, vault password, etc. that are configured in the provisioner resource? We come to this when we look at why I wrote this article in the first place.

As we indent to automate Terraform with a GitLab CI pipeline I had to created a Terraform Docker image with the required tools. It is available on Docker Hub and GitHub. The reasons for the need of a custom image where:

The official hashicorp/terraform:full image is with ~580 MB quite large and in addition lacks the Ansible provisioner (another ~30 MB).
The official hashicorp/terraform:light image is indeed only ~40 MB in size, but unfortunately lacks both the AWS provider and (of course) also the Ansible provisioner.

Thus, with the custom image I was able to bundle all required tools together and still reduced the image size from a total of ~610 MB to ~230 MB.

As secrets should never be stored in a Docker image itself, we will set the secret files through the GitLab CI pipeline configuration. To still be able to apply Terraform locally without pushing changes to GitLab first, we can also set the secrets (and files) locally. Required are:

/etc/ansible/vault_password_file: Contains the secret to decrypt the vault.
/etc/ansible/become_pass: Contains the password of the terraform user
/root/.ssh/id_rsa_terraform: Contains the SSH private key of the terraform user
The AWS credentials AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set as environment variables

Setting them locally as well is particularly required because we have those static paths set within the Terraform resources! The Ansible provisioner for instance would fail if we do not have the /root/.ssh/id_rsa_terraform or /etc/ansible/become_pass files available locally.

Of course, for local usage you need to install the Ansible provisioner as well. You find the installation instructions here.

The AWS region is set directly in the provider configuration as mentioned earlier. If you do not want to use AWS credentials as environment variables locally, e.g. because they would collide with another access key already configured, you can run the following command to initialize the Terraform backend with AWS credentials instead of terraform init only, as used in the CI pipeline:

terraform init -backend-config=”access_key=[…]” -backend-config=”secret_key=[…]

To ensure don’t store any secrets in the GitLab repository, our .gitignore contains the following:

# Terraform local state (including secrets in backend configuration)
.terraform
terraform.tfstate.*.backup# Ansible retry-files
*.retry

Configuring the GitLab CI Pipeline

That brings us to the configuration of the GitLab CI pipeline… After creating the project, we first need to set the required secrets in our project settings:

You can optionally protect these variables for specific branches or alike. Please consult the GitLab CI Docs for further information.

The pipeline itself is configured in the .gitlab-ci.yml file. To keep it short for the first, the code below shows only the basic pipeline for the Terraform dev environment. If you already know about how to configure GitLab CI pipelines you can jump to the end of the page for the full configuration!

image:
  name: rflume/terraform-aws-ansible:lateststages:
  # dev environment stages
  - validate dev
  - plan dev
  - apply devvariables: 
  AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
  AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY# Create files w/ required the secrets
before_script:
  - echo “$ID_RSA_TERRAFORM” > /root/.ssh/id_rsa_terraform
  - chmod 600 /root/.ssh/id_rsa_terraform
  - echo “$ANSIBLE_VAULT_PASS” > /etc/ansible/vault_password_file
  - echo “$ANSIBLE_BECOME_PASS” > /etc/ansible/become_pass# Apply Terraform on dev environment
validate:dev:
  stage: validate dev
  script:
    - cd environments/dev
    - terraform init
    - terraform validateplan:dev:
  stage: plan dev
  script:
    - cd environments/dev
    - terraform init
    - terraform plan -out “planfile_dev”
  artifacts:
    paths:
      - environments/dev/planfile_devapply:dev
  stage: apply dev
  script:
    - cd environments/dev
    - terraform init
    - terraform apply -input=false “planfile_dev”
  dependencies:
    - plan:dev

First, we define the Docker image to run the pipeline. We set it to my custom image for the named reasoned.

Next, the stages to be executed by the GitLab runner are first only defined, and configured in detail afterwards.

Then, we set the environment variables and create the files holding the required secrets from our GitLab CI project variables. We also set the correct file permissions for the ssh key.

Finally, we do the actual stage configurations. These are similar for every Terraform environment and/or the Terraform “global” project files. We always perform three steps:

terraform validate
terraform plan, and
terraform apply

The terraform init command is required to provide the AWS credentials to the Terraform backend. Terraform will by default check for credentials in the environment variables. As we have them set in the variables: section of the .gitlab-ci.yml file, Terraform will find these.

The terraform plan command creates a plan file. It is passed as artifact to the next stage in order to make sure that only the changes shown in the output of the planning stage are actually applied with the terraform apply in the next stage.

At this point, we already have a fully automated pipeline!

Pipeline Improvements

However, we can improve the pipeline to optimize execution times and add some security mechanisms, so let have a look at the following configuration snipped:

  allow_failure: false
  only:
    changes:
      - environments/dev/**/*
      - modules/**/*

If we appended this to the stages, we would add the following features:

If a single stage fails, the whole pipeline will fail too. This is achieved by allow_failure: false
GitLab CI allows one to limit the execution of stages to changes in specific files or directories (/**/* notation). In the example above, the stages would only be run if files in the environments/dev directory changed in a commit, or if any of our Terraform modules were added, updated, deleted, etc.

This improves the runtime of the pipeline, because not every stage is executed in every pipeline. To add a security layer we updated our apply stages with this:

  only:
    refs:
      - master
    changes:
      - environments/dev/**/*
      - modules/**/*
  when: manual

The refs: -master statement ensures that changes are only applied when merged into the master branch. Combined with the enforcement of a certain amount of required merge request approvals we ensure that the changes are reviewed before being applied. A good thing with this is that changes to the infrastructure can be seen directly from the logs of the planning stage(s) and a review of the Terraform code is not required.

The Final Pipeline

Finally, we ended up with a different combination of all these given pipeline features for the different stages, which brings us to our final pipeline configuration (without staging stages, which are equal to dev):

image:
  name: rflume/terraform-aws-ansible:lateststages:
  # 'global' stages
  - validate global
  - plan global
  - apply global
  # Dev env stages
  - validate dev
  - plan dev
  - apply dev
  [... STAGING ...]
  # Prod env stages
  - validate prod
  - plan prod
  - apply prodvariables:
  AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
  AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY# create files w/ required secrets (so that they’re not stored in
the docker image!)
before_script:
  - echo “$ID_RSA_TERRAFORM” > /root/.ssh/id_rsa_terraform
  - chmod 600 /root/.ssh/id_rsa_terraform
  - echo “$ANSIBLE_VAULT_PASS” > /etc/ansible/vault_password_file
  - echo “$ANSIBLE_BECOME_PASS” > /etc/ansible/become_pass# Global
# ------validate:global:
  stage: validate global
  script:
    - cd global
    - terraform init
    - terraform validate
  only:
    changes:
      - global/**/*   
      # no modules are included in 'global', so we do not need '- modules/**/*' hereplan:global:
  stage: plan global
  script:
    - cd global
    - terraform init
    - terraform plan -out “planfile_global”
  artifacts:
    paths:
      - global/planfile_global
  only:
    changes:
      - global/**/*apply:global:
  stage: apply global
  script:
    - cd global
    - terraform init
    - terraform apply -input=false “planfile_global”
  dependencies:
    - plan:global
  when: manual
  allow_failure: false
  only:
    changes:
      - global/**/*
# DEV ENV
# -------validate:dev:
  stage: validate dev
  script:
    - cd environments/dev
    - terraform init
    - terraform validate
  only:
    changes:
      - environments/dev/**/*
      - modules/**/*plan:dev:
  stage: plan dev
  script:
    - cd environments/dev
    - terraform init
    - terraform plan -out “planfile_dev”
  artifacts:
    paths:
      - environments/dev/planfile_dev
  only:
    changes:
      - environments/dev/**/*
      - modules/**/*apply:dev:
  stage: apply dev
  script:
    - cd environments/dev
    - terraform init
    - terraform apply -input=false “planfile_dev”
  dependencies:
    - plan:dev
  allow_failure: false
  only:
    refs:
      - master
    changes:
      - environments/dev/**/*
      - modules/**/*
[... STAGING ...]
# PROD ENV
# ----validate:prod:
  stage: validate prod
  script:
    - cd environments/prod
    - terraform init
    - terraform validate
  only:
    changes:
      - environments/prod/**/*
      - modules/**/*plan:prod:
  stage: plan prod
  script:
    - cd environments/prod
    - terraform init
    - terraform plan -out “planfile_prod”
    - echo “CHANGES WON’T BE APPLIED UNLESS MERGED INTO ‘MASTER’!
  artifacts:
    paths:
      - environments/prod/planfile_prod
  only:
    changes:
      - environments/prod/**/*
      - modules/**/*apply:prod:
    stage: apply prod
  script:
     - cd environments/prod
     - terraform init
      - terraform apply -input=false “planfile_prod”
  dependencies:
    - plan:prod
  when: manual
  allow_failure: false
  only:
    refs:
      - master
    changes:
      - environments/prod/**/*
      - modules/**/*

Security Concerns

With creating the secrets in the Docker container only from within the pipeline, protecting them in the project settings, and using Terraform’s file() function to read them only while executing Terraform instead of passing them as variables, secrets are not revealed in the output of the GitLab CI pipeline logs.

However, the Ansible become_pass (the terraform user’s password on the remote hosts) is read by Terraform and then passed to the provisioner. It is therefore revealed in plaintext in the pipeline logs and you keep that in mind when using the provisioner.

I have created an issue regarding this concern that you can follow to be notified about updates.

Pitfalls With Automating Terraform

A topic to discuss when automating Terraform is that of how to handle the GitLab CI pipelines and its jobs. To be precise, there are a couple things to keep in mind:

Do not stop pipelines manually if possible. Terraform handles the defined resources in a state file. When stopping a pipeline Terraform cannot gracefully stop and a locked state is the result. You can force-unlock it but this is a manual step you generally want to prevent.
Check the changes from the planning stage! Manual job triggers are a handy feature GitLab CI and when it comes to Terraform automation we do not want to miss it. From time to time, unwanted (or at least unexpected) changes were scheduled for application and we were happy about the chance to double check why they were scheduled. Wether some tags were changed via the AWS CLI or an EBS volume was manually increased in size, someone always applies chances outside of Terraform. We occasionally ended up adding a parameter to the ignore_changes map of a resource in order to prevent subsequent modification or even recreation.
Do not retry a deployment job. Instead a new pipeline should be started on the respective branch. The reason is that the job is likely to fail again with the error message "Failed to load backend: This plan was created against an older state than is current. Please create a new plan file against the latest state and try again." This error occurs when a plan file is reused after its changes where already partially applied.

However, one thing I have learned in past is that even the best automation processes are not perfect…

Summary

We have created a custom Docker image which is smaller than the official “full” tagged Terraform image, but extended the “light” image with the Terraform AWS provider and radekg’s Ansible provisioner to enable us to automate our Terraform workflow with GitLab CI pipelines.

Within Terraform we created a AWS EC2 instance which is not only automatically created by Terraform, but also automatically provisioned with Ansible.

I hope I was able to provide some useful information to help you improve your Terraform workflow by including the Ansible provisioner and creating automatic pipelines in GitLab CI!