AWS Account Vending Machines

Meeting teams where they are

Josh Armitage
5 min readSep 22, 2023

What is an Account Vending Machine?

The fundamental building block of an AWS Landing Zone is the account vending machine (AVM), or account factory.

A great AVM provides has two key qualities.

First, a delightful developer experience, where its consumers are able to rapidly request new accounts and other changes.

Second, a rapidly extendable governance framework, where controls can be easily enforced across the estate.

Some examples of governance controls include:

  • The creation of default account guardrails
  • Ensuring appropriate levels of network connectivity (e.g. sandpit or developer accounts have limited networking capabilities)
  • Enforcing that certain account tags are set

What go into an AVM?

In it’s simplest form, an AVM is made of three layers.

  1. The engine which drives the interaction with AWS, e.g. Terraform or AWS Control Tower account factory
  2. The public interface for consumers, e.g. a YAML file in GitHub or ServiceNow tickets
  3. An integration layer between the interface and the engine, e.g. Python or manual changes

Having built AWS organisations that have surpassed half a thousand accounts, I’ve found the Terraform, YAML and Python approach significantly more scalable and powerful than the ServiceNow or pure Terraform based variants which pervade many enterprises today.

As a benchmark, a well oiled AVM will generally have account creation lead times of under a day, i.e. from an account being requested, it is created and made available in under 8 business hours.

The Issue with a ServiceNow Centric Design

When looking at the toolchain that a developer generally works within, ServiceNow isn’t on the list. The experience for developers operating in a post-DevOps world feels slow and alien. It’s a lot of manually filling in forms, and often the ticket sits waiting for approvals before dropping onto a queue for someone to action.

The long cycle time incurred by the manual processing becomes particularly exacerbated when there are errors or changes that need to be done after the initial request.

Anecdotally, where ServiceNow is part of an AVM, you are looking at a lead time measured in weeks not hours. This explosion in lead time for an account drives sub-optimal architectural decisions where accounts become unnecessarily multi-tenanted which has significant corrosive second order effects to operations and security, on top of the significant wait time imposed on product teams.

As the route to live for ServiceNow catalog items is owned by another team, it also adds an inter-team coordination overhead for the AWS platform team, draining their resources when they have to update their catalog items which is generally an arduous process with little to no automation.

Being able to meet developers where they naturally are, e.g. GitHub, GitLab or other version control systems, allow for lower friction, more automation friendly options.

The Issue with a Pure Terraform Approach

While Terraform is, for good reason, the dominant force in infrastructure as code, for a mature AVM it is limiting force for enforcing certain controls and provides a lacklustre development experience for consumers. By exposing Terraform directly, you are coupling to your implementation details, and while modules can provide some level of abstraction, it becomes hard to easily express intent.

Additionally, while the amount of functions within Terraform is ever increasing allowing us to express more complex logic, the readability of the functions is getting worse. Much like with regex, it only makes sense when you write it, never when you read it.

Having a custom DSL as your public interface allows you to provide a much greater developer experience through a much more consumer friendly and robust abstraction. Updating a YAML file is something that non-technical users can potentially achieve, whereas Terraform is a much higher learning curve. By processing the YAML file via Python you can enforce guardrails in a more maintainable and robust way and provide contextual errors enabling the consumers to autonomously fix their requests.

As always, abstraction layers also allow you to maintain optionality in your architecture, should you wish to swap from Terraform to Pulumi, for example, you can make the transition without having to bring all your consumers along for the journey.

1 Way Decisions

When it comes to vending accounts in AWS, they are 1 way decisions that get progressively more expensive to change over time. For example, account email addresses are static once initially set, updating network connectivity brings serious operational risk, and closing down accounts is either a lengthy manual process or rate limited.

Due to this, there is great value to be gained from being able to enforce certain guardrails consistently from the beginning, before corrective action becomes too expensive.

A Simple Example

Below you can see a short example of the YAML file we can present as the public interface of an AVM. To request a new account, you append a new block to the accounts list.

We’ve also brought Organisational Unit (OU) creation into the AVM to allow us to validate that people are selecting known OUs, and that we can create accounts and OUs in parallel rather than introduce double handling or implicit dependencies.

You can also see we’re managing user access here, where people are assigned roles directly to accounts. One of the ways this is powerful is we can now constrain access for people. For example, you can ensure that no single user has privileged access on more than a limited number of accounts. This is something that is incredibly difficult to achieve using pure Terraform.

Notice also how we’ve abstracted certain elements of account creation here, the necessary step of having an account email is managed by the python integration layer and passed into our Terraform engine.

Enabling Consumers

With this in place, our engineers have been empowered to autonomously request accounts. Following is the README snippet which explains how to add a new account.

Governance is maintained through GitHub CodeOwners. While everyone is empowered to raise pull requests, only the team accountable for the AWS organisation as a whole is empowered to approve them.

To ensure we maintain the service level required, we use the open source Flight Controller solution to track our account request to vend lead times. The primary feedback loop for services with any sort of manual governance step.

Conclusion

In this post we’ve looked at the two most common methods of building an AWS account vending machine, and a third style which looks to address the scale challenges that are often faced.

ServiceNow centric approaches have long lead times due to an over-reliance on manual data entry and execution, nudging consumers into architectural compromise via multi-tenanting accounts.

Pure Terraform or CloudFormation provide a worse user experience to consumers, which puts a greater burden on the platform team. Additionally, the translation layer allows you to enforce guardrails in a maintainable and understandable way, which we’ll explore more in the next post.

Next Steps

In the next post we’ll look at how to build an AVM using the YAML, Python and Terraform trio described.

--

--