Dive deep on our AWS landing zone: Architecture, Decisions made, Lessons learnt — Part 3

10 min readOct 3, 2023

In the first and second stories of this series, I introduced the context in which we built our AWS landing zone, some of the security foundations, and how we track and report on security and compliance.

In this last story, I’ll describe other aspects of the landing zone: encryption at rest, network, backup, access management to management ports and production accounts, and FinOps.

Encryption at rest (KMS)

Obviously, we want to encrypt all data at rest, and KMS is a convenient and secure enough service for this. Nevertheless, several questions arise: what type of keys to use, how many keys, who manages them, how to enforce encryption, etc.?

Keys architecture

Here are some of the decisions we made on KMS keys:

AWS managed or customer managed keys: We deny the use of AWS managed keys, unless customer managed keys are not supported: ACM, CodeCommit, Cloud9… (see AWS service integration and look at “[1] Supports only AWS managed keys”). AWS managed keys don’t allow to share encrypted resources across accounts, so it is easier not to use them than to re-encrypt everything if necessary.

However, we allow the use of AWS managed keys for AWS Lambda and AWS Backup to remove the need to create a customer manager key in all regions for the detective guardrails and backup vaults deployed by the cloud team. Below is an example of SCP to enforce this:

{
  "Effect": "Deny",
  "NotAction": [
    "kms:Decrypt",
    "kms:Describe*",
    "kms:Get*",
    "kms:List*",
    "kms:Tag*"
  ],
  "Resource": "arn:aws:kms:*:*:key/*",
  "Condition": {
    "Null": {
      "kms:ResourceAliases": "false",
      "kms:ViaService": "false",
    },
    "StringNotLike": {
      "kms:ViaService": [
        "acm.*.amazonaws.com",
        "codecommit.*.amazonaws.com",
        "cloud9.*.amazonaws.com",
        "dax.*.amazonaws.com",
        "lightsail.*.amazonaws.com",
        "lambda.*.amazonaws.com",
        "backup.*.amazonaws.com"
      ]
    },
    "ForAnyValue:StringLike": {
      "kms:ResourceAliases": "alias/aws/*"
    },
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-LogsKeysAdmin"
    }
  }
}

Centralized or decentralized KMS keys: The cloud team creates one KMS key named primary in each account and in each region in use — i.e. where application teams are expected to deploy resources — by assuming the IAM role protected-LogsKeysAdmin from the Logs & Keys account. Application teams cannot create KMS keys ( kms:CreateKey is denied using a SCP) and they must use this sole key to encrypt everything, unless there is a good reason to use multiple keys (none so far).

We decided to create keys in member accounts, rather than in a central account, because it’s more user-friendly for application teams to choose the key by name in the AWS Console, rather than having to enter the key ARN when it’s in another account. Having one key per account also makes it easier to “detach” one account from the organization if needed. However, to close an account, all data to be repatriated to another account must be re-encrypted, since the original encryption key will be deleted.

Only the cloud team via the IAM role protected-LogsKeysAdmin can delete the primary keys or modify their key policy, not the application teams. This reduces the risk of losing encrypted data because the key has been deleted, or of exposing encrypted data externally as application teams have to ask the cloud team to grant other accounts permission to use the key for decryption. Below is an example of SCP to enforce this, given that primary keys have a tag Protected:

{
  "Effect": "Deny",
  "NotAction": [
    "kms:CreateGrant*",
    "kms:Decrypt",
    "kms:Describe*",
    "kms:Encrypt",
    "kms:Generate*",
    "kms:Get*",
    "kms:List*",
    "kms:ReEncrypt*",
    "kms:RetireGrant",
    "kms:RevokeGrant",
    "kms:Sign",
    "kms:Verify"
  ],
  "Resource": "arn:aws:kms:*:*:key/*",
  "Condition": {
    "Null": {
      "aws:ResourceTag/Protected": "false"
    },
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-LogsKeysAdmin"
    }
  }
}

Key rotation: As says that it is a good practice to rotate KMS keys. Customer manager keys can only be rotated every year, AWS managed every 3 years. However, you pay for each version of the key, so the KMS bill will increase every year. It would be great if AWS allowed to define the rotation frequency…

Enforcing encryption at rest

All supported resources should be encrypted at rest using KMS. Whenever possible, we prevent teams from creating unencrypted resources. However, this is only supported for too little services:

S3: We implemented a detective guardrail that enables bucket default encryption, at least with SSE-S3, to encrypt all objects in S3. This detective guardrail is not needed anymore, as AWS automatically encrypts all objects since early 2023.
EBS volumes and snapshots: The cloud team configured EBS default encryption in all accounts and all regions. The default key is the primary key where it exists, otherwise the default aws/ebs key, which prevents application teams from creating EBS volumes and snapshots in these regions as they cannot use this key. We set up a preventive guardrail to deny modifications of the EBS default encryption settings:

{
  "Effect": "Deny",
  "Action": [
    "ec2:EnableEbsEncryption*",
    "ec2:DisableEbsEncryption*",
    "ec2:ModifyEbsDefaultKms*",
    "ec2:ResetEbsDefaultKms*"
  ],
  "Resource": "*",
  "Condition": {
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-LogsKeysAdmin"
    }
  }
}

RDS, EFS and Elasticache: AWS supports IAM condition keys that allow to deny the creation of unencrypted resources. As far as I know, these are the only services that support such condition keys. Here is the preventive guardrail we set up using these condition keys:

{
  "Effect": "Deny",
  "Action": [
    "rds:CreateDBCluster",
    "rds:RestoreDBClusterFromS3"
  ],
  "Resource": "*",
  "Condition": {
    "Null": {
      "rds:StorageEncrypted": false
    },
    "Bool": {
      "rds:StorageEncrypted": false
    }
  }
},
{
  "Effect": "Deny",
  "Action": [
    "rds:CreateDBInstance",
    "rds:RestoreDBInstanceFromS3"
  ],
  "Resource": "*",
  "Condition": {
    "Null": {
      "rds:StorageEncrypted": false,
      "rds:DatabaseEngine": false
    },
    "Bool": {
      "rds:StorageEncrypted": false
    },
    "StringLike": {
      "rds:DatabaseEngine": [
        "custom-oracle*",
        "custom-sqlserver*",
        "mariadb*",
        "mysql*",
        "oracle*",
        "postgres*",
        "sqlserver*"
      ]
    }
  }
},
{
  "Effect": "Deny",
  "Action": "elasticfilesystem:CreateFileSystem",
  "Resource": "*",
  "Condition": {
    "Null": {
      "elasticfilesystem:Encrypted": false
    },
    "Bool": {
      "elasticfilesystem:Encrypted": false
    }
  }
},
{
  "Effect": "Deny",
  "Action": "elasticache:CreateReplicationGroup",
  "Resource": "*",
  "Condition": {
    "Null": {
      "elasticache:AtRestEncryptionEnabled": false
    },
    "Bool": {
      "elasticache:AtRestEncryptionEnabled": false
    }
  }
}

For some other services, we implemented standard controls in Security Hub or detective guardrails without automatic remediation to notify application teams when they create unencrypted resources.

Network

The cloud team created one VPC named primary in all accounts and in each region in use, by assuming the IAM role protected-InfraAdmin from the Infra account. Application teams can’t create VPC ( ec2:CreateVpc is denied using a SCP) which allows to better control CIDR ranges and avoid overlapping, and can’t modify certain attributes of the primary VPC. Below is an example of SCP to enforce this, given the VPC resources created by the cloud team have a tag Protected = infra:

{
  "Effect": "Deny",
  "Action": [
    "ec2:AssociateDhcpOptions",
    "ec2:AssociateSubnet*",
    "ec2:AssociateVpc*",
    "ec2:AttachInternetGateway",
    "ec2:AttachVpnGateway",
    "ec2:CreateNetworkAclEntry",
    "ec2:CreateTags",
    "ec2:CreateTransitGatewayVpcAttachment",
    "ec2:DeleteDhcpOptions",
    "ec2:DeleteFlowLogs",
    "ec2:DeleteInternetGateway",
    "ec2:DeleteNatGateway",
    "ec2:DeleteNetworkAcl*",
    "<OTHERS...>"
  ],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:ResourceTag/Protected": "false"
    },
    "StringEquals": {
      "aws:ResourceTag/Protected": "infra"
    },
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-InfraAdmin"
    }
  }
}

By default, each primary VPC comes with two public subnets and two private subnets. Public subnets allow inbound HTTP and HTTPS, as well as other ports that application teams may request with proper justification. Route 53 Resolver DNS Firewall is implemented in each VPC and uses AWS managed rule groups to block the resolution of malicious domains.

Currently, application teams can create additional subnets and associate them to the existing private route table and network ACL. In the future, we would like to allow them to create their own route tables and network ACLs, but this requires to implement detective guardrails to check and remediate them if needed. For example, a subnet shouldn’t have a public route table and a network ACL that allows unlimited inbound traffic.

We removed the default VPC in all accounts and all regions, and implemented a preventive guardrail to prevent anyone from recreating it. This avoids creating VPC resources in regions that application teams shouldn’t be using.

The cloud team manages one transit gateway per region that has all the primary VPCs attached. The transit gateway has multiple route tables to allow connectivity between certain VPCs, and provides centralized egress to the Internet for all private subnets through a centralized AWS Network Firewall that uses AWS-managed rule groups.

Traffic to S3 and ECR exits via VPC endpoints to avoid Network Firewall data processing costs: one gateway endpoint in each primary VPC for S3, and one centralized interface endpoint for ECR (cheaper than one endpoint per VPC).

Illustration of our network architecture

Backup

At BDH, application teams are responsible for backing up their resources with the appropriate frequency and retention. However, the cloud team has implemented measures to avoid losing data in the event of incorrect backup configuration, human error or malicious action.

S3: We implemented a detective guardrail that enables bucket versioning for all S3 buckets, and configures lifecycle policies to delete non-current object versions after a given period. If someone deletes an object, a deletion marker is added and the object still exists as a non-current version.

To prevent data loss in production accounts, the detective guardrail ensures that non-current versions are retained for at least 7 days, and an SCP prevents unprivileged users from deleting specific object versions (s3:DeleteObjectVersion). As a result, we are able to restore objects in production accounts up to 7 days after deletion, which is short but better than nothing.

The only constraint is that users have to wait up to 7 days to delete a bucket containing files in production. Another solution, more expensive, would have been to use AWS Backup to backup S3 buckets.

AWS Backup: For all other services supported by AWS Backup, the cloud team created one backup vault in each account and region, and a backup plan to back up all resources. Recovery points are retained 21 days in production accounts, 7 days in non-production. A preventive guardrail is implemented to prevent unprivileged users from modifying the backup vaults and plans, and from deleting the recovery points.

How is an example of SCP that protects recovery points, given that they have a tag Protected = backup (the role protected-AWSBackupServiceRole used by AWS Backup is created by the cloud team):

{
  "Effect": "Deny",
  "Action": [
    "backup:Delete*",
    "backup:Disassociate*",
    "backup:Tag*",
    "backup:Untag*",
    "backup:Update*"
  ],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:ResourceTag/Protected": "false"
    },
    "StringEquals": {
      "aws:ResourceTag/Protected": "backup"
    },
    "ArnNotLike": {
      "aws:PrincipalARN": [
        "arn:aws:iam::*:role/protected-AWSBackupServiceRole",
        "arn:aws:iam::*:role/protected-SecurityAdmin"
      ]
    }
  }
}

Access to management ports

Our main requirements on this area are as follows, given that we don’t connect VPCs to our corporate network:

Users must first authenticate via AWS SSO before they can access the management ports of EC2 instances (SSH, RDP…) or other VPC resources (RDS, EFS…). This removes the need to delete or modify SSH keys and other credentials when someone leaves the organization. Just delete the user in AWS SSO.
Management ports shouldn’t be exposed to the Internet in order to limit the risk of intrusion and use of vulnerabilities.

Our strategy is to use Session Manager, a capability of AWS Systems Manager that establishes a secure “tunnel” between the SSM agent (installed by default on AMIs provided by AWS) and a user authenticated on AWS with sufficient permissions. Session Manager can be used either via the AWS Console directly, or via the AWS CLI using the Session Manager plugin. Session Manager also allows to expose remote ports locally (see port forwarding to remote hosts).

Only VPC resources in public subnets can be reach from the Internet. The public network ACL authorizes inbound HTTP and HTTPS, but also ephemeral ports (1024–65535). As a result, application teams could still expose MySQL (3306) or development ports (8080) to the Internet. To avoid this, we have implemented a detective guardrail which automatically removes security group rules that authorize inbound traffic to certain high risk ports.

Finally, AWS recently announced EC2 Instance Connect Endpoints which is an interesting alternative to Session Manager, because it does not rely on the SSM agent and allows to connect to other VPC resources, like RDS instances, without the need for an intermediate EC2 instance. However, in our context, we still need the SSM agent for Patch Manager, and Session Manager enables to record session activity — unless the SSM agent is used as a “proxy” and can’t decode the SSH traffic, obviously.

Access to production accounts

We want to record user activities in production systems, including AWS accounts, and to reduce the likelihood of data exfiltration to uncontrolled devices. Relying solely on CloudTrail logs is neither sufficient nor usable.

Based on the solution described in my story How to record system operator activities on AWS using Amazon AppStream 2.0 and Session Manager, we built a web portal called the “AWS Bastion” that enables human users authenticated via AWS SSO to create sessions in virtual desktops.

The virtual desktop screen is recorded and we block the use of the clipboard from virtual desktops to local workstations. Users can access production accounts (via the AWS Console or AWS CLI) and connect to EC2 instances with Session Manager, only from the virtual desktops (we use the aws:SourceIp condition key to deny actions outside of the virtual desktops).

Note that CI/CD pipelines don’t need to use the “AWS Bastion” as the related activities are recorded by the pipeline solution itself.

Screenshot of our “AWS Bastion” solution

FinOps

In the same way as for security findings, the cloud team helps the application teams to control their AWS costs by providing actionable reports and tools.

We created a cost monitor in AWS Cost Anomaly Detection in all accounts so that application teams receive an email notification when there is an abnormal cost variation in one of their accounts. Application teams can then remediate the issue, or mark the anomaly as “expected” or “false positive” in AWS Cost Anomaly Detection.

We have developed a Lambda function which automatically queries AWS Cost Explorer API and sends a weekly report by email to each application team, helping to quickly understand the most important cost items, the biggest variations, and the opportunities to save costs with reserved instances and right-sizing. There are many third-party solutions for cloud cost management, but we didn’t evaluate them and chose to develop our own tool.

Extract of the weekly FinOps report… the full report is much longer

Conclusion

I hope this series has been useful and will give you tips and tricks for building or improving your landing zone. Feel free to respond to any of the three stories to comment or share your own experience.