Cloud Custodian — Overview and deployment of cloud governance

Jean-Brice GACHOT
ManoMano Tech team
Published in
9 min readApr 14, 2021

--

History

When we migrated to a pure Cloud-based infrastructure, our ability to ship new features quickly was greatly improved.

DevOps practitioners have access to a very large toolbox. From AWS famous compute service, EC2, to fully managed data services like Kinesis or Aurora, the set of infrastructure resources feels almost unlimited. This has proved to be a reliable way to enable quality business innovations.

But cloud costs can grow faster than your business. Addressing this is where FinOps comes in. FinOps aims to ensure we get the most value out of every dollar spent in cloud. We would like to do so without breaking down velocity or quality of newly shipped features.

What is cloud governance?

Cloud governance is a list of precepts you define and verify that is applied in your cloud environment to achieve several objectives.

An objective of Cloud governance could for example be to prevent one’s account to become messy. This first objective could be achieved by ensuring your tagging policy is respected across every resources.

Except the satisfaction migrating :

Photo by Nick Nice and Sigmund on Unsplash

This tagging strategy could also allow your FinOps team to make an accurate cost allocation of cloud consumption based on these tags (team, project etc…). It could then help the business make the right decisions over time.

To give you another example, the cloud governance could be a generic rule that “no compute instance should be under 20% of average utilization”, else it should be downscaled.

Cloud governance can takes several aspects.

With that in mind, we, at Manomano, decided to use Cloud Custodian to achieve cloud governance.

Introduction

What is Cloud Custodian ?

Cloud Custodian is a rules engine for managing public cloud accounts and resources. It allows users to define policies to enable a well managed cloud infrastructure, that’s both secure and cost optimized. It consolidates many of the adhoc scripts organizations have into a lightweight and flexible tool, with unified metrics and reporting.

Cloud Custodian uses compliance rules, called policies, to compares the desired and actual state of your cloud resources.
Beyond compliance, a large set of plugins provide complementary features: email-sending with c7n-mailer, multiple AWS-role support with c7n-org etc.

“c7n” is the common abbreviation for Cloud Custodian.

A policy in Custodian is a YAML file following this basic structure:

- name: your-policy
resource: a-cloud-resource
description: My first policy
filters:
- (some filter that will select a subset of ressource)
- (possibly more than one)
actions:
- (an action to trigger on this subset)
- (and another one)

Resources:

Custodian is able to target several cloud providers (AWS, GCP, Azure) and each provider have it’s own part in the documentation. Here you will find the documentation for AWS resources.

Filters:

“Filters” is the way in Custodian to target a specific subset of resources. It could be based on some characteristics, labels etc… Some filter examples can be found here

Actions:

“Actions” is the actual decision you make on resources that matched the filter. (See example below to better understand the type of actions we can use). This action can be as simple as sending a report to the owner, stating that the resource does not match the Cloud governance rule.

Both actions and filters can combine as many rules as you want to express perfectly your needs.

Example

The official documentation provides a list of very good examples that can help you getting started, I suggest that you start there if you want to get inspiration.

We will now describe a common automated workflow. This should give you an overview of Cloud Custodian features :

First of all, a policy in Cloud Custodian will often be splitted into several steps. Each steps represent a different point on a timeline.

In the first step, like we saw earlier, we will usually target a set of resource that does not comply with a rule.

For example, let’s say, RDS instances that are not used (with no connection for the last seven days). On a AWS development account, if our rule match an instance, we can probably assume that this instance is a leftover of some kind of test.

Since we don’t want to leave unused resources on our account, we decide that it should be removed. But just in case that this instance is really mandatory, we will flag it for deletion and inform the owner with an email or a slack notification. The owner could at this step contact the Finops team if he does not agree with the decision.

Since we launch Cloud Custodian several times a day, we will need to make sure we don’t repeat step 1 forever, that’s why we should exclude resources that have already been flagged for deletion.

This is the end of step 1.

In the second step, we could target the RDS instances that have been flagged for deletion since 5 days. Verify that no new connections has been made to the database and send a last warning to the owner.

The third and last step could be, after 7 days, to finally remove the unused database.

I now hope that you start to understand the kind of pattern you can implement with Cloud Custodian.

Cloud Custodian is open source and backed by the CNCF as a sandbox project.

The second part of this article will focus on how we implemented Cloud Custodian at Manomano.

Implementation

This first part of this article showed many AWS related example, but the same examples could also be applied to any other cloud provider like GCP or Azure. The next part however will focus on the specificity of the Manomano “stack” (including Gitlab and AWS mainly).

AWS account organization

At Manomano we decided to split our different environments across several AWS accounts. In order to avoid maintaining access to every account individually, we use a specific account for IAM management. This account holds every IAM users and it is then used as a gateway to “assume” a role in other accounts with specific rights.

We will not dive into this configuration but it could be useful if you are already familiar with this kind of setup. If you want to know more about it, you can check this AWS tutorial. The c7n-org plugin is there to allows you to run the policies over your several accounts from one single run.

nb: if you only have one AWS account, all c7n-org commands in the following of this article could be replaced with c7n

IAM policy

For each account we will have to create an IAM policy to allow Cloud Custodian to read or write on all ressources it run against. To do so, I suggest you “open” your policy as you go and you don’t give full admin rights to your role as it could lead to destruction into your account if you for example malformed a policy. In order to protect a bit more our critical deployments, we even decided to split policies into two main categories, production and not production. This allows us to release our policies in some kind of a “beta” mode onto our development environments before allowing them to run against our mission critical environment.

Version control repository organization

We store every configuration file / policies in a dedicated Gitlab repository which is organized as follow :

.
├── .gitlab-ci.yml
├── accounts.yml
├── mailer.yml
└── policies
├── nonprod
│ ├── daily
│ │ ├── daily-nonprod-policy.yml
│ │ └── another-daily-nonprod-policy.yml
│ ├── monthly
│ └── weekly
│ └── weekly-nonprod-policy.yml
└── prod
├── daily
├── monthly
└── weekly
└── weekly-prod-policy.yml

Let’s focus on the policies directory first. As you can see we made a hierarchy of folders that make the distinction between policies that should be played against “production” and “not production” accounts, and it also allows us to have various periodicity

In the next chapter we will focus on our Gitlab CI jobs that automatically run policies based on this hierarchy, but first let’s check the content of the two configuration files (accounts.yml and mailer.yml)

Custodian configuration files

  • accounts.yml is a list of our AWS accounts we explained earlier and the associated IAM role that c7n-org should impersonate on each of them :
accounts:
- account_id: '3333333333333'
name: development-account
regions:
- eu-west-1
role: arn:aws:iam::3333333333333:role/dev-role-custodian
vars:
account: development
tags:
- type:nonprod
- account_id: '444444444444'
name: staging-account
regions:
- eu-west-1
role: arn:aws:iam::444444444444:role/sta-role-custodian
vars:
account: staging
tags:
- type:nonprod
- account_id: '666666666666'
name: production-account
regions:
- eu-west-1
role: arn:aws:iam::666666666666:role/prd-role-custodian
vars:
account: production
tags:
- type:prod
  • mailer.yml is the configuration that will be used by c7n-mailer to retrieve every messages from the SQS queue that Cloud Custodian policies generated and send them to each recipient defined in the “notify“ part of the policy we saw earlier. (c7n-mailer can also assume a role to access the queue)
queue_url: https://sqs.us-east-1.amazonaws.com/12345678900/cloud-custodian-mailer
region: eu-west-1
role: arn:aws:iam::12345678900:role/iam-role-custodian
from_address: CloudCustodian@Company.com

CI job example

We defined three scheduled jobs in gitlab :

In the jobs, we define a specific variable that will be passed at the execution time and will allows the script to determine which subset of policies needs to be run.

You will then find below the content of our .gitlab-ci.yml file that define two jobs. One that run the policies against our accounts and the other job that run the c7n-mailer command for mailing reports.

What you need to understand here is that we use some shell and c7n tricks to specifically feat our needs. This is not part of the Cloud Custodian documentation and should not be considered as “best practice”.

Use this section “as is” if you want to fit to our repository structure or adapt it to your needs :

---
stages:
- custodian
- reports
custodian policies:
stage: custodian
image:
name: cloudcustodian/c7n-org
# We override the entrypoint of the docker image so that we can run the script below
entrypoint: [""]
# Run the job only when it's a scheduled by Gitlab
only:
- schedules
script: | # Find policies with the specified frequency
for POLICY_PATH in $(find policies/* -type f -mindepth 2 -maxdepth 2 -name \*.yml | grep /$FREQUENCY_TYPE/)
# For daily jobs it should return (nonprod/daily/daily-nonprod-policy.yml and nonprod/daily/another-daily-nonprod-policy.yml)
do
# extract "nonprod" type from filename
ACCOUNT=$(echo $POLICY_PATH | cut -f2 -d/)
c7n-org run -s output \
-c accounts.yml \
-u $POLICY_PATH \
# specify the accounts tag that we want to target from accounts.yml - here it will be nonprod
-t type:$ACCOUNT \
--region eu-west-1
done # Store the output in a Gitlab artifact just in case
artifacts:
name: custodian-reports
expire_in: 1 week
paths:
- output
custodian mailer:
stage: reports
image:
name: cloudcustodian/mailer
# Same here we override the entrypoint
entrypoint: [""]
only:
- schedules
script: |
# Run the command in order to send the reports
c7n-mailer --run \
-c mailer.yml

Conclusion

In your journey to the cloud, and as your account begins to grow, you could easily be lost into the numbers of resources your account holds, and you could have much trouble spotting the ones that are either not secured, unused or not well tagged. Cloud Custodian could help you in finding those resources, and applying the best possible strategy to correct this mistakes in a safe way. It’s a great product 👍

Cloud Custodian

ProsOpen Source (backed by the CNCF)YAML definitions are easy to readAllows complex workflowExpand possibilities with pluginsConsCould benefit an UI, with the state of pending deletion, etc…Mainly target cloud providers at the moment. It would be great if it could also target other SAAS/software (like datadog, for example with compliance over tags in metrics etc...)

--

--