Using Cloud Custodian in your AWS Organization

6 min readMay 11, 2022

Introduction

Let’s say you have a whole AWS organization with several accounts within and you need to implement security standards for all of AWS services and ensure that all of the accounts and resources out of compliance are remediated or have the team notified of its non-compliant, for example:

Unencrypted EBS volumes must be deleted or send a notification to the account owner or team email about it.
Block all S3 Buckets with public access
S3 buckets without encryption
Detect Root Logins
Terminate EC2 instances that are using unapproved AMI’s

And so on, there are infinite use cases giving complete security management in the organization by the Security Team.

You can definitely create security policies with Lambda, Config Rules, CloudWatch Events, etc. But how this will be maintained? One pipeline for each policy? One CloudFormation StackSet deploying all of the resources for all organization or for specific Organization Units for each security policy?
But Why use Cloud Custodian? Because it is easier, once the Custodian is already configured, you don’t need to keep creating the CloudWatch rules linking with Lambda all the time for every policy/resource/feature that you need to check the compliance and thinking on how to deploy this new security measure. You will just need to create the policy in YAML and Cloud Custodian handles all the rest.

First of all, What is Cloud Custodian?

“Cloud Custodian is a tool that unifies the dozens of tools and scripts most organizations use for managing their public cloud accounts into one open-source tool. It uses a stateless rules engine for policy definition and enforcement, with metrics, structured outputs and detailed reporting for clouds infrastructure. It integrates tightly with serverless runtimes to provide real time remediation/response with low operational overhead.

Organizations can use Custodian to manage their cloud environments by ensuring compliance to security policies, tag policies, garbage collection of unused resources, and cost management from a single tool.

Cloud Custodian can be bound to serverless event streams across multiple cloud providers that maps to security, operations, and governance use cases. Custodian adheres to a compliance as code principle, so you can validate, dry-run, and review changes to your policies.” from Cloud Custodian Page.

Cloud Custodian is one of many tools used by organizations to ensure that the organization itself is compliant with its own compliance rules/baselines.

Okay, but how does it work? Cloud Custodian applies security/compliance policies across several accounts within the organization, its policies are expressed in YAML and include the following:

The type of resource to run the policy against
Filters to narrow down the set of resources
Actions to take on the filtered set of resources

But who defines these policies? Who writes them? This all depends on how your organization works: The security team could write the policies following the resources baselines; the teams that work within the organization could write the policies that they need for their own application/projects; squads core like Foundation, Security, Analytics, IaaS, PaaS and etc could develop the policies regarding the resources that they own within the organization; and so on. It all varies on the size, maturity, and culture of your organization are key factors when implementing Cloud Custodian.

Introduction, best practices, monitoring, and some examples of policies can be found on the CloudCustodian website, feel free to check it out.

AWS Architecture example

This is a basic architecture just to give you some context and for a better understanding of everything.
For this architecture, we are using full AWS resources and for code GitLab/Github(whatever is being used by the organization)

AWS CodePipeline
AWS CodeCommit
AWS CodeBuild
AWS CloudWatch (Here I prefer using Events, because the Cloud Custodian Lambda will be triggered as soon as the resource target is created/updated, and the security breach if there is any, will be fixed in no time)
AWS Lambda
AWS Services that will be scanned for

Pipelines Structure

This is the visual of the pipeline, one for development where nothing will be deployed, the pipeline will only execute a Dryrun and in the codebuild log will be shown the policies running on every account of the organization unit(if your organization is using OUs for department and etc), and the count saying how many resources are not compliant for this policy in this specific account.

Source Stage
You can have X sources for your pipeline, I have worked with only 1 for the cloud custodian pipeline being the repository with the policies, python codes, configuration files and so on, but you can have a second one for the pipeline infrastructure of your pipeline was deployed using Service Catalog. In my case we deployed/maintained the pipeline using Terraform, so the source in the pipeline for the infrastructure was not needed.

Following the Best Practices, the pipeline was not triggered by CodeCommit, it was triggered by a CloudWatch Rule that is executed when any changes occur in the CodeCommit repository synced with GitLab where the users/security team committed the changes.

The difference here between the Development Pipeline and Master pipeline is only the branch configured on the action that the pipeline is looking for. The development pipeline is configured to look at the develop/development branch and the Master pipeline is configured to look at the master/main branch of the repository.

Approval Stage
We have set up the manual approved, but in the development pipeline this is Optional. At the master this is mandatory.
You can set up here the governance steps that your organization uses, since integration with Jira and/or ServiceNow.

DryRun Stage
Simulates the deployment command, but don’t deploy anything. It validates the policies structure, and scan the accounts using all the policies that are in the repository. We can check the CodeBuild Execution logs and see X policy running on X account on X region has 0-N resources out of compliance.
All these logs are stored in an S3 bucket at the managing account, with this we can have some analytics behind or any data visualization that can be implemented.

Deploy Stage
It Deploys all the policies of the Organization Unit that the Action reads, or uses deployments that your organization has configured in the Actions across all counts.
Be aware to always check the logs if there were any issues/failures even if the action itself was completed successfully.

Customizable

All of the architectures and services utilized by the CI/CD can be changed and should be changed with your organization's standards. All of the info above shouldn’t be followed by the book, Develop by your organization’s standards and what you feel more comfortable with.

Possible Problems

TimeOut: If your organization has too many accounts, the dryrun or the deploy action can result in failure because of the Timeout of the CodeBuild. Remembering that the limit is 8 hours, AWS Can not increase this quota
Execution time: Cloud Custodian deploys policy by policy individually in each account. So depending on the number of policies and number of accounts, the deployment time can be high.
AWS TAM triggered: As Cloud Custodian deploys policy by policy individually in each account, the API can end up having too many calls for the Lambda:UpdateFunction, and your Organization TAM can receive some warnings for this. Is nice to always keep your TAM in check on the scheduled deployment windows for Cloud Custodian Pipeline
CloudWatch Rule quota in the Project accounts: The Standard quota is 300, so when writing the policies, always think of compact as much as you can.

Closure

Read about Cloud Custodian on their webpage, do some hands-on and validate with your team and organization if this is applicable.
Cloud Custodian GitHub for AWS here.
Hope that this article can be helpful and feel free to comment with your feedback, use cases, and problems so we can help each other!