Disposable development and QA environments with AWS ALB Host-based Routing

Published in

Strigo Engineering

7 min readJul 3, 2017

In this post, I will describe Strigo’s approach toward disposable development/QA environments. The goal here isn’t to provide a complete tutorial (everyone’s systems are different) but rather to show off a concept.

TL;DR: AWS ALB (Application Load Balancer) is the v2 API of AWS’s ELB and one of its cool features is that you can configure rules to redirect requests based on certain policies. On top of Path-Based Routing — the only way to previously route when using an ALB — you can now use Host-Based Routing.

Host-based Routing means that the host header (e.g. dev.strigo.io) is used to route to a specific group of instances. That means that we can generate rules (up to a limit) directing traffic to different instance groups. We decided to use this and a simple CloudFormation template to generate disposable environments.

The Challenge

Creating such environments can be a daunting task. You need to setup an entire stack with some (if not all) of your application’s components installed and configured with the relevant code deployed and make it resemble your production environment (as much as possible). Then, you need to provide an endpoint for accessing that environment.

Strigo is a Meteor.js app with a MongoDB backend. We host our code on GitHub, use Travis-CI to build our artifacts and Ansible to configure servers and [re]deploy the artifacts. AWS is our IaaS of choice.

For the sake of maintaining a fast development and deployment cycle, we use mutable infrastructure. Basically, we redeploy on the same machines. We have several services aside from our main Meteor app, each deployed on its own. Nothing new so far.

Now let’s say I create a branch called “make-strigo-great-again”. To test and have someone review my changes I need an application endpoint to turn to, running the new code.

We don’t like wasting time on configuring stuff manually so we decided on the following flow:

FTW!

The Setup

Let’s see, in length, how this shenanigan is setup:

.travis.yml

...deploy:
  - ...
  - provider: s3
    on:
      all_branches: true
      condition: "! $TRAVIS_BRANCH =~ ^(master|release)$"
    region: $AWS_DEFAULT_REGION
    bucket: strigo-deploy-dev
    upload_dir: $TRAVIS_BRANCH/strigo-app
    access_key_id: $AWS_ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY
    skip_cleanup: true
    local_dir: build/artifactafter_deploy:
  - dev/deploy-dev-stack.sh "$TRAVIS_BRANCH" "$INSTANCE_SSH_KEY" "$STRIGO_APP_VERSION" "$TRAVIS_COMMIT_MESSAGE" || travis_terminate 1notifications:
  slack: ...

I trimmed the head of our Travis config to skip the Meteor build steps. Suffice to say, before the after_deploy step commences, we have our artifact ready in an S3 bucket.

The after_deploy step then creates the environment.

Prerequisites

To be able to create the disposable environment we’ve setup several one-time prerequisites:

A load balancer with the relevant listeners to which we’ll inject the relevant rules. We have an always-existing ALB, which by default, redirects to our master environment.

ALB Listeners defaulting to dev (our “master” dev branch)

An always existing master Mongo instance we can access to dump the current state, and then restore it in our provisioned instance.
A set of environment variables setup in Travis as part of the repo settings: AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and INSTANCE_SSH_KEY. The INSTANCE_SSH_KEY is a ready-made AWS key-pair which Ansible will use to connect to the provisioned instance (via an ssh-bastion, btw, as we’re creating the instances in a private subnet of the VPC).
A deploy key setup in Travis and GitHub so that Travis can clone the repo containing our Ansible playbooks. You may have this setup in the same repo as your Ansible playbooks so this might not be necessary.
A DNS wildcard CNAME record for redirection of *.dev.strigo.io to the dev ALB we created.
Lastly, a dedicated bucket must obviously exist.

Provisioning the Environment

Our provisioning script is written in bash. Yes, bash. I know, I should rewrite it in Python. Leave me alone. (or comment if you see anything that’s not on par with Google’s Shell Style Guide.)

Since it is pretty long, I created a Gist for it. Basically, we use aws-cli and a CloudFormation template to deploy the stack and retrieve the deployed instance’s IP. The IP is then propagated to a dev-specific hosts file for Ansible to use. The template not only creates the instance — as previously mentioned — but also the relevant Listener Rules, which is where the magic happens. We create the rule based on the branch name (passed to the deploy script by Travis) and so the rules look somewhat like this:

Since we also create the instance, we can create a Target Group and put the instance in it, then assign the branch-name-based rule to the newly created Target Group and BAM, juice! We can now go to make-strigo-great-again.dev.strigo.io and… get a 503. What an achievement. Thankfully, once Ansible (slowly) finishes the deployment process, we’ll get to the promised land.

Note: The annoying thing is, that when creating a new listener rule you have to give it a specific priority. It won’t assign one automatically. So we extract the current priority in the script and +1 the new one.

Targeting our Instance

By default, we use a dynamic inventory generated from a Terraform.tfstate file. But since we want to deploy everything into the same instance, we pass an env specific hosts file when provisioning the environment and pass strigo-instance as the host to deploy on for all playbooks.

We do that by using a placeholder for the instance and by using a variable to set the hosts for all playbooks.

The provisioning script will simply run ansible-playbook ... -e 'hosts=strigo-instance' -i hosts_file_with_cloudformation_host ...

Other than installing the required components and configuring the servers, we also replicate our development database. This allows us to create a truly disposable environment as any change to collections done on the specific branch will be contained within that environment. Once a development branch is merged, the database schema changes will take place in the master development database as well.

The process of creating the branch-specific database takes place only once when the environment is created (TL;DR, we check if a lock file exists. If it doesn’t, we dump and write a lock file. Same for reading.)

- stat: path=/tmp/mongodump.lock
  register: mongodump_lock
  tags:
    - deploy- name: mongodump {{ mongo_source_endpoints }}/strigo to {{ mongo_tmp_dump_path }}
  command: "mongodump --host {{ mongo_source_endpoints }} --db strigo --out {{ mongo_tmp_dump_path }}"
  when: mongodump_lock.stat.exists == False
  tags:
    - deploy- name: Write mongodump lock file
  become: true
  file: path=/tmp/mongodump.lock state=touch owner=root group=root
  when: mongodump_lock.stat.exists == False
  tags:
    - deploy- stat: path=/tmp/mongorestore.lock
  register: mongorestore_lock
  tags:
    - deploy- name: mongorestore {{ mongo_tmp_dump_path }}
  command: mongorestore --db strigo {{ mongo_tmp_dump_path }}/strigo
  when: mongorestore_lock.stat.exists == False
  tags:
    - deploy- name: Write mongorestore lock file
  become: true
  file: path=/tmp/mongorestore.lock state=touch owner=root group=root
  when: mongorestore_lock.stat.exists == False
  tags:
    - deploy

We made certain optimizations by caching directories and putting tags on certain areas of our playbooks so that we don’t redo things that already happened when the environment was created. Additionally, some playbooks run common roles that can be skipped when other playbooks run (remember, we’re running everything on a single machine).

The full workflow

Note that the last (dotted line) means we will continue to deploy, and deploy, and deploy until we’re done.

Environment creation is idempotent. If a stack for the branch already exists, Travis will skip the creation phase and simply run Ansible playbooks to redeploy the code. If a stack didn’t exist at all (no special commit was pushed), Travis will do nothing but build the project. This allows developers to keep pushing code to the same environment without worrying about it until they want to destroy it.

While this process has certainly made life easier for us, we need to mind the potential consequences: These environments need to be kept in check. Without properly monitoring the process, we might end up running 20(0) redundant instances if we don’t make sure we destroy them when we’re done.

EDIT: Post writing this, we’ve replaced Travis with CirlceCI to speed up the creation and deployment process thanks to their layered fine-grained control caching. We’ve also enabled Ansible pipelining. We improved build times by more than 50%.
EDIT: Destroying the environment requires developers to push a special commit,. Since this is error and forgetfulness prone, we’re working on a Lambda GitHub hook to destroy an environment post-merge.

And like that… poof… he’s gone.