Streamlining AWS Infrastructure: Part 2 — Automation with Customizations

Published in

SSENSE-TECH

5 min readSep 8, 2023

Welcome to Part 2 of the SSENSE journey towards streamlining our AWS Infrastructure. In Part 1, we explored the motivations behind our choice to migrate to AWS Control Tower. Now we’re diving into some of the automation to make this mass migration the least painful possible.

Our original AWS accounts grew organically, mainly from need, and still bore some of the failed or incomplete attempts at restructuring from previous generations of DevOps. We wanted to heavily lean on security, standardization, and automation of our baseline to ensure we had an AWS account 100% ready to be deployed to.

This is where the AWS solution comes into play: Customizations for AWS Control Tower (CfCT). Not only does it enforce best practices during the account creation via AWS’s own pre-defined landing zone, but it also offers a way to apply organization-wide service control policies, while automating resource creation at the same time. Anyone who has created a brand new AWS account, and not leaned on automation to set it up, can understand the potentially slow process of putting things in their correct place and having the account ready for use.

Overview of CfCT Workflow

In the above diagram we have two workflows that make up CfCT; one based on AWS account level events (AWS Control Tower lifecycle event workflow), the other on webhooks from a version control system of your choosing (AWS Codepipeline workflow).

The Lifecycle Event Workflow: High level non-API based events which can trigger some automation, delivered via EventBridge and Amazon SQS, to AWS Lambda functions. This is based on one of nine different account lifecycle events. (We’re not covering it in this article, but you can read up on it here and here.)

The Code Pipeline Workflow: From a Version Control system of your choice, the workflow deploys Stacksets of CloudFormation code via AWS Step Functions to specific AWS accounts. This can be a straight list of matching accounts, or Organizational Units (OU).

All this is configured in the manifest.yaml file of CfCT. For a full breakdown of the manifest file, have a look at the AWS documentation.

Legacy Account Concerns

In what we refer to as our “Legacy” AWS accounts, there were a lot of inconsistencies and issues we wanted to address when we first set out:

Extra VPCs, both used and unused
Improper VPC IP CIDR ranges
Environmental crossover
Manually created/modified resources not done in Infrastructure as Code (IaC), nor tracked in Version Control (VC)
Lack of standardized networking
Non-standard IAM access method(s)

Our Designated AWS Account Baseline

This is what we wanted to accomplish with the CfCT:

Minimize the human factor in creating a new AWS account
Have a single source of truth for all AWS account configuration
Centrally share resources where possible and applicable
Define and standardize how AWS account-specific values are fetched for DevOps and Developer use
Network relevant accounts together
Create required Terraform backend resources

The Human Factor

The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.[1]
— Tom Cargill, Bell Labs

It’s important to understand that total automation, while possible, would be fairly complex. Instead, we opted for a minimal configuration. A few account-specific tags, CIDR range, some git push origin master –force and we’re good to go.

Source of Truth

Keying into the human factor mentioned above, what we provide configuration-wise happens from a singular template in CfCT, which is the first run, and to which all other templates reference values we’ve stored in the AWS Systems Manager Parameter Store (and Secrets Manager where applicable).

Central Sharing

We defined a specific bucket, in a singular account, which would hold all our compiled code for custom Lambda functions. This bucket would then give read-only access to the entire organization so that any CfCT template which requires the Lambda code will source it from. Anything that can be shared via AWS Resource Access Manager (RAM) should be shared. This severely cuts down on duplicative resources.

Volatile Account Values

Automating at the account level is fine as a one-off time saving mechanism, but resources that would get reused or referenced elsewhere should be provided in a standard way. We leaned heavily on the AWS Systems Manager Parameter Store for storing (non-sensitive) information that our Terraform modules, and Serverless Framework/SAM deployments reference. This significantly reduces potential misconfigurations as values would not need to be plain-texted, just fetched.

Automated Networking (and Segregation)

This was a must! Having all accounts of a similar environment networked together automatically by way of a Transit Gateway (TGW). This required the use of a Lambda Function for the cross-account actions, and while there is the option of auto-accepting Transit Gateway attachment requests we opted not to. Each environment has its own dedicated TGW route table that is used for the attachment. No exceptions.

CloudFormationing Your Terraform

As DevOps, our standard for creating resources via IaC is with Terraform. The future migration to our new AWS organization structure provided us the opportunity to further standardize and automate its configuration. We defined a per-account backend file, which lives in our Terraform repo, and holds the backend state configuration (S3 Bucket, DynamoDB locking table, etc).

Caveats to Using CfCT

The deployed pipeline provides a way to automate the baseline, but we’re not out of the woods yet. As with everything in the cloud, we need to keep some things in mind.

Regionality

CfCT is regional, and starts by deploying to your home region. While this in and of itself is not a problem, once you begin to think of deploying to a second region, all those global service resources with hyper-specific names will cause issues. Things like IAM resources, S3 buckets, Cloudfront CDNs, etc. Be sure to take this into account during planning.

Unsupported CloudFormation Actions

As CfCT is essentially deploying resources via CloudFormation, all of CloudFormation’s limitations are also at play here. Region-locked, no cross account support, and any actions not supported directly will need to be handled via custom Lambda Functions (which have caveats of their own).

Ordering

The ordering of the resources in the manifest is crucial. Especially when referencing resources created from another template, or actions that require two steps (e.g. sharing Route53 private zones, attaching/peering VPCs, or peering TGWs together). Dependent steps should either be separated into different templates, or handled sequentially with custom Lambda Functions.

SCP Sanity

Be mindful of what you allow/disallow. You can very well shoot yourself in the foot with any overly restrictive global service control policy, resulting in the CfCT pipeline hitting a disallowed permission, and thus failing. Be sure to include any necessary conditions to keep your pipeline functional.

Conclusion

We’ve covered the pain points from our previous adventures in AWS. We’ve outlined our wants from adopting Customizations for Control Tower, and the gotchas that can cause an entire implementation to pivot if not kept in mind. Keep an eye out for Part 3, where we’ll cover planning the migration proper, cross-functional cooperation, and deciding on organizational structure.

Editorial reviews by Catherine Heim & Mario Bittencourt

Want to work with us? Click here to see all open positions at SSENSE!