Dive deep on our AWS landing zone: Architecture, Decisions made, Lessons learnt — Part 2

10 min readOct 3, 2023

In the first story of this series, I’ve introduced the context in which we built our AWS landing zone and the architecture and decisions made for the account structure, the access management for AWS and the security services.

In this second story, I’ll describe the measures implemented to maintain or monitor compliance, and inform application teams of their security posture.

Preventive guardrails

In the previous story, I’ve described a set of services and resources that we’ve implemented for security and compliance. But what if application teams change their configuration or disable them? For example, someone with AdministratorAccess to an application account could turn off Config recording.

To avoid this, we implemented Service Control Policies (SCPs) in AWS Organizations, which we call preventive guardrails as in Control Tower. More globally, we use SCPs wherever possible to block any action that could lead to non-compliance before it occurs.

SCPs to protect security resources

Security services (CloudTrail, Config, GuardDuty…) are configured in all accounts after assuming the IAM role protected-SecurityAdmin from the Security account.

This role can only be assumed by AWS SSO users or other systems with access to Security account and authorized to make a sts:AssumeRolecall on arn:aws:iam::*:role/protected-SecurityAdmin.

The following SCP is an example of how to deny anyone expect this IAM role from calling sensible AWS actions: close the AWS account, enable or disable an AWS region, stop Config recording, disable GuardDuty, leave the AWS organization, change the S3 account block public account settings, etc.:

{
  "Effect": "Deny",
  "Action": [
    "account:CloseAccount",
    "account:DisableRegion",
    "account:EnableRegion",
    "account:PutChallengeQuestions",
    "account:PutContactInformation",
    "billing:UpdateIAMAccessPreference",
    "config:DeleteConfigurationRecorder",
    "config:DeleteDeliveryChannel",
    "config:DeleteRetentionConfiguration",
    "config:PutConfigurationRecorder",
    "config:PutDeliveryChannel",
    "config:PutRetentionConfiguration",
    "config:StopConfigurationRecorder",
    "guardduty:CreateDetector",
    "guardduty:CreateMember*",
    "guardduty:CreateSampleFindings",
    "guardduty:DeleteMember*",
    "guardduty:DeleteDetector",
    "guardduty:Disassociate*",
    "guardduty:InviteMember*",
    "guardduty:Stop*",
    "guardduty:UpdateDetector",
    "guardduty:UpdateMember*",
    "guardduty:UpdateOrganization*",
    "<OTHERS...>"
  ],
  "Resource": "*",
  "Condition": {
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-SecurityAdmin"
    }
  }
}

Whenever possible, we tag the resources that are created by this IAM role with a tag Protected = security) to differentiate the protected resources from other resources.

The following SCP is an example of how to prevent anyone except the IAM role from making any actions on these resources, except certain “read-only” actions. For example, unprivileged users can’t delete roles if they have a tag Protected = security but they can delete other roles.

{
  "Effect": "Deny",
  "NotAction": [
    "cloudtrail:CancelQuery",
    "cloudtrail:Describe*",
    "cloudtrail:Get*",
    "cloudtrail:List*",
    "cloudtrail:StartQuery",
    "config:BatchGet*",
    "config:Describe*",
    "config:Get*",
    "config:List*",
    "config:Select*",
    "iam:Get*",
    "iam:List*",
    "iam:PassRole",
    "<OTHERS HERE...>"
  ],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:ResourceTag/Protected": "false"
    },
    "StringEquals": {
      "aws:ResourceTag/Protected": "security"
    },
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-SecurityAdmin"
    }
  }
}

When tag-based permissions are not supported, we use resource name prefixes. For example, CloudTrail doesn’t support tag-based permissions for trails, so we prefix trail names with protected-. The following SCP is an example of how to prevent anyone except the IAM role from modifying the protected trails:

{
  "Effect": "Deny",
  "NotAction": [
    "cloudtrail:Get*",
    "cloudtrail:List*"
  ],
  "Resource": "arn:aws:cloudtrail:*:*:trail/protected-*",
  "Condition": {
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-SecurityAdmin"
    }
  }
}

The cloud team creates S3 buckets in each account and region in use to store S3, CloudFront and load balancers logs. The following SCP is an example of how to prevent anyone expect the IAM role protected-SecurityAdmin from modifying these buckets or deleting objects:

{
  "Effect": "Deny",
  "NotAction": [
    "s3:PutObject",
    "s3:PutObjectAcl",
    "s3:PutBucketAcl",
    "s3:Get*",
    "s3:Describe*",
    "s3:List*"
  ],
  "Resource": "arn:aws:s3:::bdh-access-logs-*",
  "Condition": {
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-SecurityAdmin"
    }
  }
}

How is another example of SCP that prevents unprivileged users from creating and managing IAM users, except their access keys:

{
  "Effect": "Deny",
  "NotAction": [
    "iam:CreateAccessKey",
    "iam:DeleteAccessKey",
    "iam:Generate*",
    "iam:Get*",
    "iam:List*",
    "iam:Simulate*",
    "iam:UpdateAccessKey"
  ],
  "Resource": [
    "arn:aws:iam::*:group/*",
    "arn:aws:iam::*:mfa/*",
    "arn:aws:iam::*:sms-mfa/*",
    "arn:aws:iam::*:user/*"
  ],
  "Condition": {
    "ArnNotLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:role/protected-SecurityAdmin"
    }
  }
}

Finally, it is also important to prevent unprivileged users from modifying the IAM protected-SecurityAdmin, as well as other roles whose name starts with protected-. To do this, we provision the role protected-SecurityAdmin immediately after account creation, we create the other roles via this first role, and we assign the tag Protected = security to all these roles so that unprivileged users can’t modify them.

Other SCPs

Here is an example of SCP to prevent the account root user from making actions other than changing account settings and and configuring the account root user’s MFA:

{
  "Effect": "Deny",
  "NotAction": [
    "account:*",
    "iam:EnableMFADevice",
    "iam:DeactivateMFADevice",
    "iam:CreateVirtualMFADevice",
    "iam:ListMFADevices",
    "iam:ListVirtualMFADevices",
    "iam:ResyncMFADevice",
    "iam:UpdateAccountEmailAddress",
    "iam:UpdateAccountName"
  ],
  "Resource": "*",
  "Condition": {
    "ArnLike": {
      "aws:PrincipalARN": "arn:aws:iam::*:root"
    }
  }
}

Here is another example of SCP to prevent anyone from modifying SAML providers created by AWS SSO (see example policy). AWS SSO prevents users from modifying the IAM roles it creates, but not the SAML providers…

{
  "Effect": "Deny",
  "NotAction": [
    "iam:Get*",
    "iam:List*"
  ],
  "Resource": "arn:aws:iam::*:saml-provider/AWSSSO_*"
}

Many other SCPs exist in our AWS organization. Some of these will be described later, when we discuss KMS encryption and VPC network. As for the others, this story is not intended to list them all, but I hope you understand the concept.

Detective guardrails

Often, it is not possible to block actions that could result in non-compliance before they occur, because of AWS IAM limitations. For example, you can’t prevent users from creating S3 buckets with access logs disabled or not delivered to a specific bucket.

In that case, we implemented Config rules, which we call detective guardrails, also as in Control Tower. Whenever possible, we enabled automatic remediation to remediate non-compliant resources without human intervention. If not, they simply report non-compliant resources.

Implementation approach

Most of our Config rules are custom rules with “business logic” in Lambda functions. For automatic remediation, we create Automation documents that use Python scripts directly in the runbooks (no need to manage compute resources). Retrospectively, I think that CloudFormation Guard is worth a look because writing custom rules is cumbersome and verbose.

Config integrates with AWS Organizations to deploy Config rules across multiple accounts. However, we used AWS Orga Deployer to create one Lambda function, one Automation document and one Config rule for each detective guardrail in each account and region. This makes it easier to deploy new or updated rules progressively and use different parameters in different accounts.

The Lambda functions and Automation documents send messages to a SQS queue in the Security account in case of errors or successful remediations. This allows the cloud team to monitor all preventive guardrails from a single place. Whenever possible, we also had a tag BDHAutoRemediation to remediated resources to inform application teams.

Illustration of preventive guardrails implementation

Examples of detective guardrails

Here are some examples of detective guardrails with automatic remediation that we implemented:

Detect EC2 instances without IAM role attached. Remediation: Attach the IAM role protected-EC2RoleForSSM that the cloud team creates in all accounts. This role has sufficient permissions to allow the SSM agent to communicate with AWS Systems Manager, which is needed for Patch Manager.
Detect EC2 roles without sufficient permissions to allow the SSM agent to communicate with AWS Systems Manager. Remediation: Attach the necessary IAM policies.
Detect S3 buckets with versioning disabled. Remediation: Enable bucket versioning and create lifecycle policies to delete non-current versions after 1 year for production accounts and 3 months for non-production accounts, and to expire delete markers and incomplete uploads.
Detect load balancers and CloudFront distributions using unsecure SSL/TLS protocols. Remediation: Update non-compliant resources to use at least TLS 1.2.
Detect S3 buckets, load balancers and CloudFront distributions with access logs disabled. Remediation: Enable access logs with the bucket managed by the cloud team as the destination bucket.
Detect IAM users with access keys that are older than 6 months. Remediation: Mark these IAM users as non-compliant, and disable the access keys if they are still active 3 months later.

Here are some examples of detective guardrails without automatic remediation:

Monitor SES reputation metrics (bounce and complaint rate) and mark the rule as non-compliant if it exceeds a given threshold.
Detect EC2 instances that are not integrated in Systems Manager (SSM agent likely not installed). We don’t use the standard control “[SSM.1] Amazon EC2 instances should be managed by AWS Systems Manager” because it reports stopped instances as non-compliant, which doesn’t make sense because the agent cannot be running if the instance is stopped…

Security monitoring and reporting

Now that security services are enabled (GuardDuty, Access Analyzer…) and detective guardrails in place, some of which have no automatic remediation, we aggregate findings in Security Hub so that all curated security findings are available in one place, and we send periodic email reports to the cloud team and the application teams with the list of findings requiring action.

Aggregating curated findings in Security Hub

Here are the possible types of findings that we aggregate in Security Hub and how we aggregate them:

Security controls in Security Hub: As explained in the previous story, we enabled two security standards in all accounts and regions, and we left only certain applicable and actionable controls enabled.

GuardDuty: We enabled the native integration of GuardDuty in Security Hub to add GuardDuty findings in Security Hub. However, as indicated in the AWS documentation, archiving a finding in GuardDuty — in case of false positives for example — will not update the finding in Security Hub, and vice-versa. As a workaround, we have developed a Lambda function that periodically synchronizes the status of findings between GuardDuty and Security Hub.

Access Analyzer: We enabled the native integration of Access Analyzer in Security Hub. However, because we set up organizational analyzers in the Security account, the findings appear in Security Hub in the Security account only. We have developed a Lambda function that duplicates these findings in the accounts owning the non-compliant resources, so that they appear in the Security Hub console of the application accounts.

Patch Manager: As explained in the previous story, the native integration of Patch Manager in Security Hub generates one finding per EC2 instance and per scan, which may be confusing. Therefore, we have developed a Lambda function that periodically creates or updates one finding in Security Hub per EC2 instance, which makes it easier for application teams to identify instances with missing security patches.

Detective guardrails: We create findings in Security Hub for non-compliant resources that require human action (one finding per guardrail and per non-compliant resource).

The title, description, severity and remediation action of the finding in Security Hub is retrieved from tags assigned to the Config rules ( SecurityHub_Title, SecurityHub_Description…) and from the annotation of individual resource evaluations. For example, for the detective guardrail that detects IAM users with access keys that must be rotated, the severity depends on the number of days remaining before the access keys are automatically disabled.

AWS added Config integration in Security Hub after we set up the landing zone. However, this integration is not sufficiently customizable and we would have developed this Lambda function anyway.

AWS Health: We enabled the native integration between AWS Health and Security Hub to automatically add security-related findings from AWS Health. However, certain findings are informational only and don’t require human action, such as AWS operational issues or ACM certificates successfully renewed. Therefore, we have developed a Lambda function that periodically resolve the findings that don’t require human action.

Sending security reports

Now that Security Hub contains curated findings, each team can navigate to the Security Hub console to find the findings that affect them. However, we can’t rely solely on the goodwill of the teams to periodically check for new findings to process.

That is why we have developed a Lambda function that periodically sends email reports listing new and previous findings (we change the finding status from NEW to NOTIFIED in Security Hub after it has been reported for the first time):

The cloud team receives a daily report. They are responsible for resolving false positives and engaging with application teams when immediate response is needed.
Each application team receives a weekly report with the findings that concern them. They can’t update or resolve the findings in Security Hub, but they can either take action to remediate non-compliant resources or inform the cloud team in the event of false positives.

Example of the table in a weekly report that summarizes security findings

Ideas for improvement

First, there is clearly room for improvement in Security Hub:

The integration of certain security services in Security Hub is not mature or usable enough. Why archiving findings in GuardDuty don’t propagate to Security Hub? Why Patch Manager creates so many findings in Security Hub?…
Some basic features would be useful: being able to add a note in the Console when resolving a finding, being able to snooze a finding…
A better integration between Security Hub and Config would also be helpful to retrieve configuration details of a non-compliant resource when browsing findings of all member accounts from the Security account.
We need a way to trigger a notification or an action when users update findings. For example, if we want to delegate the ability to resolve false positives to application teams, the cloud team should be informed so they can double-check. Security Hub recently introduced finding history but that is not sufficient.

Second, our approach largely relies on the cloud team: they resolve false positives, they engage with application teams when urgent action is needed, and they remind application teams when remediation actions are still due. To improve this situation, we would need to integrate Security Hub with a task and workflow solution… or wait for AWS to add these capabilities in Security Hub.

Third, all this required a lot of Python code, more than I could have imagined. Evaluating third-party solutions rather than building everything yourself seems to me to be an essential step, as long as you choose the tool according to your needs, and you don’t pick a tool and then adapt your needs based on its capabilities.

In the next and last story of this series, I’ll describe other aspects of the landing zone: encryption at rest, network, backup, access management to management ports and production accounts, and FinOps.