Lesson learned while I’m using CDK

Jaewoo Ahn
7 min readMar 17, 2022

--

To be honest, I hadn’t much used CloudFormation, SAM, nor CDK. It is somewhat shameful to say as an engineer who works in AWS, but mostly I had worked on the service application code rather than infrastructure. Recently, I had a chance to work on infrastructure and played with CDK.

CloudFormation is a declarative format to define AWS infrastructure resources. To add more programmability, people used to tame it with some template engine such as Jinja. Although it gives some flexibility, it becomes really hard to read and debug as your template becomes complex and hierarchical for reusability and modularity. If you ever wasted your time to debug a gigantic template decorated with a template engine, you understand me.

People have explored more human-friendly ways instead of interacting CloudFormation directly. There are several frameworks that attempts to offer a better developer experience, but it still uses CloudFormation underneath (or AWS APIs) to provision a resource in AWS environment.

CDK is another wrapper for this purpose, and it is my favorite so far. The essential of CDK is a different levels of layers of constructs on top of CloudFormation. Preferably you would want to deal with higher level of constructs, but still you can access to the low level CFN resources.

Generally, higher constructs are well abstracted. It intended to reduce the verbosity when you use lower level one, and the defaults are well constructed. However, I often found myself digging the internal of the construct implementation a lot to understand what’s happening within it. In the end, you have to understand how CloudFormation works to use CDK properly. Let’s see some examples from what I learned.

Maximum number of resources in a stack

What if you suddenly encounter this message while you’re building it?

Number of resources: 475 is approaching allowed maximum of 500

Apparently this limit comes from CloudFormation, which limits resources in a stack to 500 resources. Honestly I can’t imagine how people lived with it when the limit was 250 until few years ago.

When you deal with CloudFormation resources directly, you have higher chance to aware how many resources that you have. However, it is a bit opaque when you use CDK constructs since you have no idea how many CFN resources can be generated by a single higher level construct until it actually builds and synthesizes a CloudFormation template.

FWIW, CDK gives you early warning before you actually hit the limit. So you’d better to tackle this as soon as possible.

Refactoring a stack

To reduce the number of resource in a stack, you need to split this stack by moving some resources into a different stack. How can you split resources from this stack? If you don’t have any idea how the underlying CloudFormation works, you may just create another CDK stack and move the resource into there. Copy and paste, easy peasy. What would happen if you do that? In fact, you’re deleting all existing resources then re-create it. It is not a big deal if you’re setting up a new infrastructure, but what if it is a production service that already has been deployed and serving a traffic?

Historically, CloudFormation hasn’t solved the stack refactoring problem very well. It requires a series of changes:

  1. You need to add a Retain deletion policy to all the resources that you want to move
  2. Delete all the resources from the stack. Since Retain is set, this will actually detach them without deleting it.
  3. Import the existing resources to the target stack (be aware not all the resources are supported by import)

The problem is CDK does not offer a great solution on this procedure neither. Some CDK construct allow you setting a deletion policy, but not all of them. To retain all the resources, you have to access the all underlying CFN resources to set the policy. Even after then, CDK does not support resource import yet. It means you have to import it as CloudFormation stack first, then import the CloudFormation template to CDK rather than refactor your CDK stack directly.

I have tried the above approach, but finally gave up. I couldn’t find a clear and reliable way to do it without disrupting our production service. Instead, I created a new stack first, then updated all references to use a resource in a new stack, then delete the existing resources from the old stack.

For example, let’s imagine I have following stacks: DDB table streaming triggers a lambda Consumer function, but I have to split this function from Lambda stack.

If you’re just switching the streaming to a different Lambda function, it will create a new Lambda function and a new event source mapping to it, then delete the previous event event source mapping and the old Lambda function. While I tested, I found some events have missed without invoking consumer. I thought it would go to the DLQ where we can redrive, but there was no message.

To avoid any potential event loss, I just created a new stack and another event mapping. By doing this, an event triggers both functions in the old and new stack.

Our function is designed to handle such duplicated invocation, so it wasn’t big deal. After confirming there was no event loss, I deleted the function from the old stack along with its event mapping (so it becomes same as above picture).

Clearly I should have design and split the stack from the beginning before encountering this situation. Although a refactoring is doable, it is time-consuming and risky until CDK/CloudFormation offers a better and safe way to do it.

Roles per account quota

Later, I encountered this error during a deployment.

Cannot exceed quota for RolesPerAccount: 1000 (Service: AmazonIdentityManagement; Status Code: 409; Error Code: LimitExceeded; Request ID: D39F040A-80E0-45A4-9219-B442C44F3DDC; Proxy: null)

Initially, I naively thought I could delete some unused roles. I found there are hundreds of roles that have been unused more than 120 days, so write a script to detach and delete role policies and roles **confidently**.

Then our stack deployment began to fail. Although those roles are unused (and I thought they’re dangling), but they are still referenced from our stack. Simply speaking, I made a silly drift by deleting them manually. For god’s sake, it’s non-prod environment, and I was managed to recover it.

Why so many roles have been created? If you don’t provide a role, most CDK constructs create a default role with a minimum privileges to the specific resource. Although it is a good practice to do this, it will create a role per resource, which leads to IAM’s roles per account quota which is 1000 by default. Though it can be increased, but it has a hard limit at 5000.

Since it is account level quota, it can block any stack change. To solve the problem, you’ll need to aggregate those roles into a single one, and provide your own role instead of letting CDK construct create a default role. Choose wisely between how many roles that you want to aggregate and granularity of privileges from a security perspective.

Leverage cdk diff

It is common to see CloudFormation deployment works when you deploy the stack for the first time but it fails or leads to unexpected behavior when it updates the existing stack.

CDK offers cdk diff command which compares the specified stack with the deployed stack or a local CloudFormation template. It will show IAM statement/policy changes, and Resources changes along with where they would be deleted or retained. This is extremely useful for a stack refactoring or when you’re deploying a sensitive change. It is more straightforward than diff between CloudFormation templates.

You must be careful especially when you rely on a default role that CDK constructs create internally. Sometimes without noticing it, it may blow up the required permission that your production stack is using.

Whenever you’re replacing/deleting something in CDK, use cdk diff command to think if any issue can happen when the existing resource is deleted. Sometimes you have to set an explicit dependency among resources are being created.

Test a change in non-prod

Needless to say, you must verify your CDK change in non-prod environment before deploying it prod. If you had to fix something manually in non-prod during/after the deployment (it means you made a stack drift!), you’d better to roll it back then fix your CDK until you can deploy it successfully. Once non-prod deployment is done without any manual fix, then you can continue to deploy it to your prod. It is recommended to minimize the difference between non-prod and prod, otherwise it increases a chance for uncovered area in your change.

Remember, Infrastructure as Code also means a bug in your code can break down your infrastructure during run time. It is more critical than run time error of application.

--

--