Orchestrating AWS with Ansible

Managing the entire stack

At Sailthru, Using a single configuration management system across the board gives us the power of portability and re-usability. By managing our cloud infrastructure with Ansible, we can reuse processes, plays and roles that we use to manage our on-premise environments.

Modeling the environments

Before roles and playbooks, we need data.

Because of the dynamic nature of AWS and managing multiple environments across multiple accounts (profiles), the data representing our environments needs to be dynamic and hierarchical. Out of the box, Ansible is not geared to support such a model, even with using the the EC2 dynamic inventory, we are still limited to static host and group vars for storing data.

Extending Ansible

Fortunately Ansible is extremely flexible and can easily be extended with modules or plugins.

Our first step was to build a drop-in replacement for group and host vars that would allow us to represent the AWS environments as a hierarchical key/value store similar to puppets hiera.

The requirements:

  • Name spaced hierarchies
  • Top down merging of hashes and lists
  • Pluggable back end
  • Multiple ways to access data: as a module, as an action or as a lookup
  • Process the data through Ansible’s Templar to embed lookups, variables and JINJA functions
  • Support vaulted yaml or json data files

Welcome Echelon

Using echelon we can layer our environment data:

 aws/{{ profile }}/{{ env }}/ec2-{{ region }}
 aws/{{ profile }}/{{ env }}/default}

The data:

ami_id: ami-123456
security_groups: ssh-from-wan

Echelon will resolve the hierarchy via the backend plugin(s) and return the key/values, which can then be accessed in playbooks or roles as

ansible-playbook site.yml -extra-vars “region=us-east-1 profile=account1 env=dev”
- debug:
 msg: {{ aws }}
-> { ‘
ami_id’: ami-123456’: , ‘security_groups’: ‘ssh-from-wan’ }

or as a lookup

- debug:
 msg: “{{ lookup ( ‘echelon’, ‘aws.ami_id’ ) }}”
-> ami-123456

Because Echelon has a pluggable backend, we can make our data even more portable by having Echelon perform lookups against an API or a database such as Redis

Local artifact repositories

Don’t rely on 3rd parties

It is extremely important to be in control and that means storing all your system and application packages in private repositories local to your environment. Whether its a Docker image or a CentOS rpm, if your system needs it to run, it should always live in a repository that you own or at least a proxy. Because if a 3rd party hosted package suddenly becomes unavailable, you are at their mercy. As the former Netflix C.T.O. and mentor Adrian Cockcroft taught me: “You just have to build for it”.
The peace of mind that you get from spending the extra effort in integrating this in to your infrastructure will be well worth it. Not to mention the gained benefit of artifact transfer speeds.

The AMI build environment

Build everything ahead

Rolling custom AMI’s is crucial to quickly scaling an environment. Pre-loading alleviates most on-the-fly configuration management and instances can start performing their intended tasks right away. When working in the cloud, time is literally money.

Just like all environments, the AMI build environment is just that, another environment represented by data that is fed to roles from a playbook.

The build process is completely handled by Ansible and kicked off from Jenkins. 
We start by firing up a reference instance using the core ec2 module and check for its availability using our ec2_instance_status_checks.py plugin. Once up, Ansible can ssh and configure it to our liking.

Ansible strips any traces of cloud-init (we don’t need it), installs all prerequisite components for our stacks and adds a simple rc.local script which is responsible for fetching the ssh key and boot script from user-data at boot.

After a successful play, Ansible will create the new AMI, register it and remove any old AMI’s (we like to keep last 5) using the core ec2_ami module.

Scaling fast and reliably

All instances are launched in autoscale groups

Using the core ec2_asg module we launch even a single instance in an ASG. This gives us continuous protection from the instance going down.

When launching, a userdata script is generated and passed to the autoscale group. This script is then launched by every instance that comes up. The user data script writes local facts which are used to set the instance persona, then downloads a playbook payload from S3 and launches Ansible against it. Ansible then goes ahead and configures the instance based on what was set in the local facts.

Because configurations are pre-packaged with the applications and most prerequisites are pre-baked during the AMI build process, this step mainly focuses on simply managing services.

By eliminating almost all launch-time configuration management, we ensure that instances scale up fast and reliably.

Everything is disposable and reproducible

One way in to the void

Once services are up, there is no need to continue to manage them. This is a waste of time and resources as instances are constantly scaling up/down. If we need to make any modification to an ASG or application, we simply blue-green deploy a new ASG by modifying the data in Echelon to set old and new ASG values and have Jenkins kick off Ansible.

Ansible then makes the environment look like its representation in Echelon. This data to environment (one-way) model allows us to quickly re-configure and re-provision any part of the stack.

Show your support

Clapping shows how much you appreciated Taras Lipatov’s story.