Cloud Custodian — Things I Wish I Knew At The Beginning

Wilmer Mendez
Globant
Published in
5 min readJul 25, 2022

Introduction

Cloud Custodian is an open-source tool that started as a project for an American financial company, it went on to become a CNCF project maintained by the community. You could sort of define Cloud Custodian as the Terraform of cloud governance and compliance, in the sense that allows you to interact with the main 3 cloud service providers using similar syntax and policies.

You can find a quite a few documents and articles explaining how to get started with Cloud Custodian, I recommend in particular an article by Jean-Brice GACHOT titled Cloud Custodian — Overview and deployment of cloud governance, that was really helpful to understand how to make Custodian run on an AWS multi account architecture. Official documentation is also very descriptive, including all CC features and example policies to guide yourself through all things you can do with Cloud Custodian. That documentation can be a good starting point to everyone who wants to implement Custodian, but this article will be more focused on providing tips on traits we found while implementing Cloud Custodian. While some of the stuff I present here could be found on the official docs, I wanted to condense some of the quirks we found and the solutions along the way.

The following sections will be reviewed in this article:

  • Finding resources quickly
  • Testing with cache-period set to 0
  • Do not forget to test using a dry run first
  • JMESPath queries and functions
  • Conclusion

Finding resources quickly

When you have properly set your c7n-org for multi account custodian execution including roles and regions, it can be a handy tool when you need to get a quick inventory or resource list on multiple accounts or regions. It can also provide some information investigating orphan or legacy resources. For example, this basic policy provides lots of interesting info:

Simple lambda policy to retrieve all resources

Output:

resources.json from output folder
CSV comma separated list from c7n-org report command

Here I was able to retrieve information on AWS lambda functions which allowed me to quickly identify an unknown orphan IAM role I needed to find to which resource was attached to. You could argue this same result can be achieved through AWS CLI or any of its SDK, and you could be right, but when you have a proper Cloud Custodian multi account multi region set up, you do not need to play around with credentials and regions, you only need to write a couple of lines on a policy and get all the info you need in no time. Another advantage is that you can play around with the different values and filters from the Custodian policies in a simple yaml syntax format.

Testing with cache-period set to 0

Imagine that you are starting to test Cloud Custodian policies, let’s say to evaluate your resources have the proper tags set. You start by running the policy on resources that you know already do not have the proper tags and the policy effectively lists those resources, you fix one or a few of the resources and re-run the policy, they should get removed from the report right? No, you are still getting the same results as the beginning of the tests. I was scratching my head trying to figure out what was going on, and it turns out that Cloud Custodian has set a default cache value set to 15 minutes, so it is going to keep a temp folder and show you the results from that folder and not the actual live resource status, this is done to display the reports faster than going and retrieving the metadata again to the cloud back and forth. But when you are trying to test the effects of the policies and the fixes done on the resources you want to have the actual live results, so when testing policies make sure to set the cache period to 0:

custodian run -s output-folder my-first-policy.yml --cache-period=0

Do not forget to test using a dry run first!

Let’s give an example looking at a policy which terminates EC2 instances using invalid AMIs

Example policy to terminate EC2 instances using invalid AMIs

This policy has a very aggressive action which terminates non-compliant instances. There are a couple of pieces of advice I can give you here: First, try to test this first on an isolated dev or sandbox account first and second, ALWAYS use dry run when first checking which resources will be affected by a policy action.

custodian run -s output-folder ec2-invalid-ami.yml --dryrun

Dry run flag will run the policy without actually executing the action, giving you the list of resources that could be removed by the policy. An additional measure is to use alternative credentials for local testing, which do not allow disruptive unintended actions. I cannot stress enough how important it is to control how policies are tested and executed, having heard horror stories of companies destroying production resources due to wrongly executed policies.

JMESPath queries and functions

When you start to dig deeper on the kind of resource filtering you can do with Cloud Custodian, you realize it is a very useful tool to get resource information in a handy way. Cloud Custodian uses JMESPath queries to properly filter resource information from the json output.

ECS service policy with JMESPath query

In this policy, we are using a key filter to find when an AWS ECS service deployment was last updated, however the field deployments[].updatedAt cannot be run against the age value type as it is, you need to transform the date type into a string that the age value type can compare. In this situation, we can use a JMESPath to_string function to get the proper data format and filter the age value.

Conclusion

Cloud Custodian is a handy tool which allows enforcing governance and compliance across multiple accounts and services. Starting with it could come with a few hurdles that you will need to tackle along the way. If you face particular requirements or use cases, you can also post your question on the Cloud Custodian Gitter Community, where you may find experts which could help you on your issues.

--

--