Dive deep on our AWS landing zone: Architecture, Decisions made, Lessons learnt — Part 1

Nicolas Malaval
14 min readOct 3, 2023

--

As a former Professional Services Consultant and Solutions Architect at AWS, I’ve been involved in many projects and discussions on landing zones, particularly with customers looking to scale and structure their cloud adoption (see the stages of adoption).

In my current role, at Biogen Digital Health, I’ve been able to put a lot of theories into practice when building our enterprise-grade landing zone from end-to-end. This took us about half a year, and I thought it might be useful to share our journey and outcomes with the community.

Welcome to this series of three long stories in which I describe the architecture in detail, and explain the decisions we made and the lessons we learned.

Context

Before diving deep into the technical details, a brief introduction to our organization and objectives might help to understand some of the choices we’ve made.

Teams involved

Biogen Digital Health (BDH) is a global unit of Biogen that builds and operates digital health solutions running on AWS. Its size makes BDH more comparable to a start-up than a large company.

BDH has its own “IT” and application teams, and its own AWS organization. The teams at BDH involved with AWS on a daily basis are:

  • The “application teams” in charge of their application, including deployment and management of the AWS resources related to their application. There is usually one team per application.
  • The “cloud team” in charge of the AWS foundations of BDH. You could think of it as a CCoE (cloud center of excellence) with only a couple of individuals. Its very small size requires to empower the application teams and to minimize operations. I lead this cloud team.

Key principles

Here are some of the principles that guided us in setting up our landing zone on AWS:

  1. Minimize the dependency of application teams on the cloud team: Application teams should be able to manage AWS resources with as little friction as possible. I’ve seen cloud management platforms in the past where application teams couldn’t create IAM roles without going through IT. That is a typical example of what we want to avoid.
  2. Maintain security and compliance with minimum human intervention: The above principle must not lead to anarchy! We intend to provide as much autonomy as possible to application teams, but still comply with AWS best practices and some industry standards to keep our applications secure. I’ll describe further the detective and preventive guardrails that we’ve implemented.
  3. Keep application teams accountable for their security and costs: Although the cloud team strives to maintain compliance, the application teams still have a key role to play and they are expected to resolve security issues and to control AWS costs. This is facilitated by a common and short line of management between the cloud team and application teams. However, the cloud team supports the application teams in this task by configuring essential resources and providing actionable reports and tools.
  4. Keep it simple but be ready to scale: Our team is small, and we don’t need to be prepared to create dozens of accounts a day. But the architecture we’ve chosen today shouldn’t prevent us from scaling up tomorrow. For example, the cloud team could delegate the development of security controls without handing over all the keys of the landing zone.
  5. Use as many native AWS services as possible: Some third-party solutions could have saved us time, but we initially preferred to use AWS services to: 1/ maximize our AWS Enterprise Support plan, 2/ avoid qualification and contractualization efforts (even if the AWS Marketplace could have helped) and 3/ satisfy our curiosity for AWS technology :-)
  6. Empower application teams: We want application teams to make the most of the landing zone. As one example, we’ve created a “BDH AWS User Guide” that answers the main questions application teams may have on how to use AWS in the context of BDH.

Overall approach

Why not AWS Control Tower

We decided not to use AWS Control Tower and instead to build our landing zone from scratch. Here are three main reasons:

  1. The AWS Control Tower baseline covers just a few services with little customization possible: IAM Identity Center, CloudTrail, Config, some Config rules, some SCPs in AWS Organizations, and eventually a standalone VPC. I believe that a landing zone should go far beyond these services.
  2. If we deploy services that are not part of the baseline today, like GuardDuty or Security Hub, but are added to the baseline in the future, I don’t know if we’ll be able to update AWS Control Tower to the latest version if the resources we deployed outside of Control Tower interfere.
  3. AWS Control Tower supports account customization (AFC or AFT) but the architecture is complex and there are limitations, such as not being able to use GitLab as a code repository for Terraform templates and Python scripts.

So, because we have to build a process to deploy services not currently supported by Control Tower across multiple accounts and regions, we decides to use that same process for everything.

Tools

Organization and accounts

Account structure

Our AWS organization is comprised of the following accounts:

Illustration of the accounts in our AWS organization

Management account (a.k.a. Master account): This account is used as little as possible because preventive guardrails (Organizations SCPs) don’t apply, which implies a greater risk of corrupting security resources. We use it to configure consolidated billing, AWS Organizations and AWS Identity Center . Note that IAM Identity Center administration can now be delegated to another account.

Core accounts: These accounts are dedicated to the landing zone and are managed by the cloud team. It is always difficult to define how many accounts are needed, and there are many opinions on the subject. We finally ended up with:

  • Security account: 1/ Used as a “gateway” (we’ll see more about it later) to deploy resources used for security and compliance to all accounts. 2/ Consolidates the findings of AWS security services (GuardDuty, Security Hub…) from all accounts.
  • Logs & Keys account: 1/ Used as a “gateway” to deploy KMS keys to all accounts. 2/ Consolidates the security logs from all accounts.
  • Infra account: 1/ Contains transversal network resources (Transit Gateway, centralized egress VPC…). 2/ Used as a “gateway” to deploy VPC resources to all accounts.
  • Sandbox account: Used to test new infrastructure-as-code templates and Organization SCPs before they are deployed everywhere.

At BDH, our main motivation for splitting core accounts was to be able to separate important roles if we want to delegate the management of certain resources to sub-groups. For example, we can provide access to the Infra account only, such that “Infra administrators” can manage network resources but can’t tamper security resources and logs.

Application accounts: We separate accounts per application team and per environment (at least production and non-production). This allows to:

  • Reduce the blast radius: What an application team does in a non-production account should not impact the production account and other application teams.
  • Avoid complex IAM policies: It is difficult if not impossible to strictly segregate teams within the same account. More and more AWS API actions support tag-based permissions (see column ABAC in AWS services that work with IAM) but it is still insufficient. In addition, application teams sharing a same account would have to use permission boundaries so that they cannot elevate their permissions when creating IAM roles.
  • Billing: One application per account allows to calculate the cloud costs for this application without having to enforce proper tagging.
  • If the same application team manages two applications, we still divide them across several accounts, as it will be easier to transfer an application to another team or external partner if needed, than to move resources to another account at a later date.

Account root and alternate contacts

AWS account root users are managed by the cloud team. AWS requires the email address of AWS account root users to be unique. Therefore, we use aliases (email+alias@domain.com) instead of creating one email address per account. The associated mailbox can only be accessed by the cloud team.

A virtual MFA is enabled manually in all accounts for the account root user. The MFA seed is stored in our password management solution. The password of account root users is reset every time we need to use it, and never stored. So far, the only reason we’ve needed to access the account root user was to rename some accounts.

Because AWS may sends security and operational notifications by email, we configure alternate contacts in all accounts, using one email distribution list per application team that application team members can read.

Organizational units

Our organization units are aligned with the need for different SCPs (Service Control Policies): core, production, non-production and sandbox. We currently don’t use other types of organizational policies. We’ll come back to SCPs later.

IAM for AWS

We use the following capabilities to provide access to AWS Management Console and APIs:

Illustration of the IAM capabilities for access to AWS

AWS Identity Center (former AWS SSO): We use AWS SSO for human access with user email addresses as usernames.

The primary benefits of AWS SSO are: 1/ it is fully managed and has a built-in identity store, 2/ it can generate temporary credentials for use with the AWS CLI and SDKs which removes the need for long-term access keys, 3/ it can be used to authenticate custom SAML applications. We used the latter benefit to restrict access to “internal” services exposed on the Internet: using Application Load Balancer authentication and Cognito as a SAML-to-OpenID converter, only AWS SSO users are allowed to access these services.

The main disadvantage is that the built-in directory of AWS SSO must be managed centrally and cannot be partially delegated (resource-based and tag-based policies not supported, it is all or nothing). Therefore, only trusted individual should have access because it can be used to grant any permission on any account. Note that AWS SSO now supports delegated administrator, but still doesn’t allow you to restrict the accounts and permissions a user can grant outside of the management account.

IAM roles: Teams are encouraged to use IAM roles to grant permissions to AWS services and for cross-account access (no access keys in code…).

The cloud team also creates IAM roles in all accounts that are used from “gateway” accounts to manage landing zone resources. For example, each account has a role protected-LogsKeysAdmin that can be assumed from the account Logs & Keys, by human users logged in via AWS SSO or by automated systems, to manage KMS keys in each account. We’ll see later how to ensure that only this role can manage KMS keys.

IAM users: We don’t use IAM users because MFA cannot be enforced on programmatic access and long-term access keys can lead to security breaches. Therefore, application teams cannot create IAM users.

However, if an external system has no other choice but to use an IAM user, the cloud team can create an IAM user as an exception and the application teams can manage its access keys and permissions. Access keys are automatically disabled if they are older than 6 months, and the actions with wildcard (e.g. * or ec2:*) are automatically removed to enforce the least-privilege principle (see how in the “Detective guardrails” section of this series).

Security services and logging

Here are the security services that we enabled and how they were configured. We use AWS Orga Deployer to deploy these resources from the Security account by assuming the IAM role protected-SecurityAdmin in all accounts.

CloudTrail

We created two multi-region trails in each account to record management events in a S3 bucket in the Logs & Keys account: one trail records “read” events, the other records “write” events. Having two trails allows two different prefixes in S3 and therefore two different lifecycle policies, so that “write” events can be retained longer than “read” events.

We haven’t used an organization trail because, at the time we set up the landing zone, CloudTrail didn’t support delegated administrators and we wanted all organizational security resources to be in the Security account.

These logs in S3 are of no use on a day-to-day basis. However, they could be useful in the future if we want to ingest the history into a third-party solution.

CloudTrail Lake

We wanted to give application teams an easy way to search CloudTrail logs, more advanced than the native Event history, particularly for debugging purposes or to understand the origin of unexpected costs.

It wasn’t easy to provide cross-account access to the S3 bucket in the Logs & Keys account and a SQL interface with Athena, and we could have faced limitations on the size of the bucket policy as the number of AWS accounts increases.

Therefore, we created a multi-region CloudTrail Lake event data store in each account, that application teams can query. Since then, AWS has released Amazon Security Lake, which I believe is now a preferred alternative.

Whatever the service used, it’s important to promote this log search capability to application teams, to avoid paying for it for too little use.

Config

We enabled Config in all accounts and all regions. We configured it to record all supported resource types and to store the events and snapshots to a S3 bucket in the Logs & Keys account. I will talk about Config rules later in the “Detective guardrails” section.

We created one Config Aggregator in the Security account that aggregates resources and rules for all accounts and regions. We also created one Config Aggregator in each account in one region that consolidates resources and rules for that account, making easier for application teams to find resources.

Since we set up the landing zone, AWS added support for many resource types. While it is a good thing to track the most possible resources, beware of costs that may increase as price depends on configuration item changes. Note that Config now supports excluding certain resource types.

GuardDuty

We enabled GuardDuty in all accounts and all regions. We designated the Security account as a delegated administrator to aggregate findings and configure suppression rules centrally (but still on a per-region basis).

GuardDuty integrates with AWS Organizations. Member accounts are not allowed to resolve findings or edit suppression rules. As a result, application teams can only view their findings, not manage them. We also prevent application teams from generating sample findings to avoid “noise”.

Malware protection is enabled everywhere. However, to keep GuardDuty costs reasonable, we’ve enabled S3 and Kubernetes protection in production accounts only. However, it is not possible to define natively which protections to enable per OU. Therefore, we use AWS Orga Deployer to enable or disable protection on a per-account and per-region basis. We have not yet evaluated the latest protections (RDS, Lambda…).

Detective

We initially enabled Detective in all accounts and regions, to troubleshoot GuardDuty findings. We finally disabled it because it was too expensive for the few times we used it.

IAM Access Analyzer

We created one organization analyzer per region in the Security account. Using an organization analyzer allows to define the organization as the zone of trust, and avoid findings for cross-account IAM roles.

We have noticed that IAM Access Analyzer still generates a lot of false-positive findings. Here are a few examples of resources wrongly considered as “external”: IAM roles managed by AWS SSO, federated roles for EKS, S3 bucket granting access to CloudFront Origin Access Identities, etc. It would be great if AWS improved its algorithm... Meanwhile, we maintain archive rules in IAM Access Analyzer in all applicable regions.

Patch Manager

We use Patch Manager to track missing security patches in EC2 instances managed by application teams. Application teams are still responsible for patching.

In each account and each region, we created custom patch baselines to specify which patches to check for each operating system, and we configured an association in State Manager to scan all EC2 instances every night. We also set up an association to update the SSM agent every week.

We have observed that Patch Manager frequently reports missing patches as “critical” or “important”. Even if it’s not easy, this level of criticality should be harmonized with the criticality of findings generated by other services such as GuardDuty, to help application teams prioritize remediation.

Security Hub

We use Security Hub to 1/ implement controls from market-proven security standards, 2/ aggregate findings from all security services, and 3/ provide teams with a single pane of glass for security findings.

We enabled Security Hub in all accounts and all regions. We enabled cross-region aggregation in Ireland, and designated the Security account as the delegated administrator.

We enabled the standards CIS AWS Foundations Benchmark and AWS Foundational Security Best Practices. Our objective is that no findings should appear if no action is expected from application teams. Therefore, we disabled the controls that:

  1. Duplicate other controls (e.g. “Ensure VPC flow logging is enabled in all VPCs” is covered by both standards);
  2. Are irrelevant in our context (e.g. “Hardware MFA should be enabled for the root user” or “CloudTrail trails should be integrated with Amazon CloudWatch Logs”);
  3. For which we’ve not yet defined precise, actionable remediation instructions for application teams (e.g. “API Gateway REST API stages should have AWS X-Ray tracing enabled”: we don’t force application teams to use X-Ray).
  4. Or the control can never be non-compliant by design, or for which we’ve implemented a detective guardrail with automatic remediation (see details in the next story). For example, the cloud team enables S3 Block Public Access at the account-level, therefore the control “[S3.2] S3 buckets should prohibit public read access” can never be non-compliant. Or, “[S3.14] S3 buckets should use versioning” which should never be non-compliant because we’ve implemented a detective guardrail which enables versioning for all buckets. While it’s not a concern to have controls that will never be non-compliant, disabling them reduces the cost of controls in Security Hub.

We disabled new controls by default, until they have been reviewed by the cloud team, who decides if they should be enabled or not.

We also disabled or customized certain product integrations (I will explain later how) because they are:

  • Either too “noisy”. For example, the integration of Patch Manager in Security Hub generates one finding per EC2 instance and per scan. Therefore, there may be a lot of outdated findings if patches have been applied since then, which could confuse the application teams.
  • Or not adapted to our context. For example, the integration of IAM Access Analyzer creates findings in the Security account only, because the organizational analyzer are in this account. However, if we want application teams to view the IAM Access Analyzer findings for their resources in Security Hub, we need to duplicate the findings from the Security account to member accounts.

Finally, Security Hub integrates with AWS Organizations, but member accounts still have a lot of permissions. For example, they can disable controls or security standards. We’ll see later how to prevent application teams from modifying the configuration of Security Hub. We also prevent application teams from changing the status of their findings.

Illustration of AWS security services enabled

Other logs

As a general rule, and whenever possible, security logs that should only be accessible by the cloud team are stored in the Logs & Keys account. However, security logs that should be accessible by application teams — notably for troubleshooting — are stored in the application account using a service that allows queries.

For example, VPC flow logs for all VPCs are stored as Parquet files in a S3 bucket in the Logs & Keys account. In addition, they are stored in CloudWatch Logs in each account with lower retention, so that application teams can query them.

Application teams are required to enable access logs for S3 buckets, load balancers and CloudFront distributions in a bucket managed by the cloud team (one bucket per account and region). I’ll explain how we enforce this requirement.

Next

In the next story of this series, I’ll describe the measures implemented to maintain or monitor compliance, and inform application teams of their security posture.

--

--

Nicolas Malaval

Ex-AWS Professional Services Consultant then Solutions Architect. Now Technology Lead Architect at Biogen Digital Health. Opinions are my own.