My experience on Infra as Code with AWS CDK + Tips & Tricks

7 min readJun 13, 2022

AWS CDK is a game changer in the IaC on AWS. It offers a decent level of abstraction saving time to DevOps by leveraging constructs compared to writing CloudFormation templates or Terraform HCL. Code leveraging the CDK is succinct, TypeScript and most other languages provide static type checking, and IDE like vscode have auto-complete and in-line documentation.

I started to develop with CDK a few weeks ago. My IoT pet project IaC first started with AWS CLI shell (github repo) as it gives great flexibility in creating resources but it was tedious to maintain (adding new resources, updating existing ones, and deleting them). This is where CDK comes handy by quickly updating existing resources or destroying them, see AWS CDK (github repo).

Challenges

Finding the proper construct(s)

Discovering them is not always intuitive. You might start to use one which is old/deprecated before you stumble on code samples using a newer construct. An example for me was CloudFrontWebDistribution vs Distribution. Fortunately I found Migrating from the original CloudFrontWebDistribution to the newer Distribution construct to guide me through the process.

Other times you will realize constructs are missing in the CDK. AWS likely first focuses on CLI, then SDK, and finally CDK which means you might not find what you are looking for when you create resources with the latest framework(s). For example aws iot create-keys-and-certificate does not exists in the CDK so in this situation your options are

Make calls with the CLI or SDK to generate output then upload values where you need them (meaning the DevOps running the stack will have all certificates/secrets on his computer)
Use AwsCustomResource with AwsSdkCall. Finding the proper field to retrieve or get the payload out of a json output could be troublesome. The debugger does not help since types and methods are defined as strings. Invocations might hang then timeout if you make mistakes or will throw an exception if output is missing or too long. For example, accessing ‘Contents.0.Key’ when listing files from an empty S3 bucket will throw an error as the entry is missing but will succeed when there is at least one file present. More pain points are covered in this blog post.
Create your own AwsFunction/NodejsFunction. Functions are difficult to develop. Changing the code won’t get the function re-deployed if it was previously successfully deployed. If you want to perform a change, I recommend to comment the block of code related, deploy, uncomment the code, do the change then deploy again. This is tedious!

Note: Beware of search results on Google as most links point to CDK v1 with package “@aws-cdk”. A mention “You are not viewing the latest version. Click here to view the v2 documentation.” is present in yellow on top of the page. For CDK v2 the package is “aws-cdk-lib”. Most of the time you can simply add the “-lib” to convert from v1 to v2.

Naming convention and structure

There is no clear guidance on how to define a stack and name them. In my case I prefixed them with the project name and put them all at the same level after the app creation. The first stack is for setting up the core infrastructure, then there is one stack per microservice x5, and finally one stack to finalize the infrastructure configuration based on outputs from previous stacks.

Roles and Policies

Roles are mostly handled by the CDK. Policies are created inline in IAM roles. Often default policies will be enough to run the stack but sometimes failure will indicate you have to create a PolicyStatement then call addToRolePolicy or addToPolicy or addStatements or addToResourcePolicy to apply. As there are multiple ways to grant new statements, it is often not obvious how to assign them. You will end up searching code samples to copy/paste.

In my project I needed 13 roles

1x CodeBuild per microservice (x5)
1x CodePipeline per microservice (x5)
3x for IoT: 2x for logging and 1x for a Rule

CDK created 32 roles for doing the same work

1x CodeBuild per microservice (x5)
1x CodePipeline per microservice + 3x roles each to be assumed (4x5)
5x for IoT: 1x for logging + 1x for a Rule, 1x to describe IoT endpoint, and 2x for one NodejsFunction
1x S3AutoDeleteObjects for deleting objects when destroying a S3 bucket otherwise the stack fails if files are present
1x LogRetention for deleting logs from CloudWatch after a specified retention period

Destroying a stack

The operation might not delete all its resources. Some will remain like S3 buckets & ECR repos unless explicitly mentioned in the code. Having a strong naming convention as suggested above will help you spot remaining resources after destroying a stack. Do not use suffixes as the CDK often truncates the name of the resources to add random characters at the end.

Some constructs do not offer the possibility to remove dependencies so destroying the stack will fail. For example, I had to use the CLI to list and delete all images from ECR repos before calling destroy on the stack. Other times the destroy will hang until it times out unexpectedly. You will have to go to the AWS console in CloudFormation / Events tab to find where the error comes from and manually delete resources before calling destroy again.

Unexpected behaviors

I spent many hours debugging random issues. Here are my three biggest…

When destroying stacks I stumbled upon “The bucket you tried to delete is not empty. You must delete all versions in the bucket.” even though S3 bucket had “autoDeleteObjects: true”. It turns out that if you have “versioned: true” also set, the auto delete will not remove all Versions and DeleteMarkers. Similar story with ECR “Resource handler returned message: “The repository with name ‘abc…’ in registry with id ‘123…’ cannot be deleted because it still contains images”. Extra CLI scripts were written to cater for these cases.

Another one was CustomS3AutoDeleteObjects lambda encountering Access Denied when destroying a bucket. Message was “MyStack: destroy failed Error: The stack named MyStack is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [S3XXX7DA36247, S3XXAutoDeleteObjectsCustomResource79E8]. ): Received response status [FAILED] from custom resource. Message returned: AccessDenied: Access Denied”. For granting Origin Access Identity read access for CloudFront to an imported S3 bucket I used bucket.policy.document.addStatements(ps) but it had not effect (see issue S3 Bucket Policy Changes Not Recognized As A Change on CDK Deploy #6548). In other words, when you add a policy statement to an existing bucket, the method will not do anything (link). Also checking an imported bucket policy will return an empty document even though it exists. As a workaround, I used the CDK to read the current policy and created a BucketPolicy with all statements inside.

When having multiple stacks in the same app, note that all the code from the declared stacks will execute even if not called SecondStack by synth and deploy. For example, pretend you have FirstStack and SecondStack declared in bin/myapp.ts, if you have a bug in SecondStack, FirstStack might fail to deploy even when deploy is not called on SecondStack.

The number of CloudWatch Log Groups created is astronomical. As they are not deleted by default and their name is dynamically generated with random alpha-numeric characters, try to clean the groups once in a while to avoid having too many CustomS3AutoDeleteObjects and LogRetention in the list.

Get used to seeing “MyStack failed: Error: The stack named MyStack failed creation, it may need to be manually deleted from the AWS console” as you will encounter it often. Error messages are not always helping much in quickly finding the real issue. The maturity of the CDK is at times questionable.

Deployment takes time

Time to deploy my 7 stacks with CDK is 20 minutes on average.
This is 2x longer than creating the same with AWS CLI.

Tips & Tricks

Keep in mind the code to create/update resources won’t execute anymore once a resource is successfully created/updated.
In relation to the point above, do not rely on the returned value of resources being created/updated; the first time the stack runs it will be fine, but the second time the value will be undefined. Relying on the CLI or SDK to get current values is the best way to avoid exceptions.
Remember that reading a value from a resource will yield a token instead of the real variable value. In many cases CLI and SDK are still needed.
Deploy your stacks often to avoid surprises
When you encounter a bug in your code during development, delete related resources then re-run the code with your fix. Otherwise, patching resources half created/deleted might lead you to a wild goose chase
Destroy your stacks just for the sake of studying the behaviors of deleting resources and re-creating them from scratch. Dependencies not defined explicitly can break the process or you might find issues when redeploying
Review automatically created roles and their inline policies to make sure they are not too permissive. Explicitly create roles with appropriate statements where default access rights are not secure enough
When you have resources shared across stacks it is often easier to import them by name or ARN instead of reusing them directly in other stacks. CDK dependency management might otherwise fail to synthetize.
Have a simple naming convention and structure
CI/CD might fail due to race conditions. For my microservices most stacks are doing the same: create an ECR (Elastic Container Registry), a CodeCommit repository, a CodeBuild project, and a CodePipeline. A build will automatically start but may fail randomly because not all underlying resources are ready. Relaunching the build a minute later will succeed.
Ask questions in aws-cdk channel of cdk-dev slack group if you get stuck

What is your experience?

Have you faced similar challenges when adopting the CDK? Do you have other Tips & Tricks to share with others? Has releasing multi-environment changes been easy with the framework? Can you deploy infra changes without downtime? Let me know the best practices you’ve learnt over time!