15 Hours with AWS CloudFormation

Coding a serverless NAT puncher in AWS CloudFormation — a narrated description of a real-world serverless configuration

Tim Wagner
A Cloud Guru
18 min readOct 1, 2019

--

I recently spent a very long 15+ hour day transcribing a working application that I had built through the AWS console UI and command line scripts into a declarative deployment using AWS CloudFormation.

Most of the examples in the official AWS docs are usually pretty “micro” and my application — while not large — is sufficiently complicated that it took a lot of searching and experimenting to get it all to hang together and to run inside of AWS CodeStar.

I benefitted a lot from others who had blogged solutions and tutorials, so in a desire to give back, here it is: An annotated tour through a non-trivial CloudFormation template that deploys an Amazon API Gateway websocket backed by AWS Lambda and an Amazon DynamoDB table. Think of it as “CRUD for Serverless Websockets”.

What follows is all details…if you’re looking to learn about websockets, hear why Serverless is cool, or want to read another religious treatise on the declarative versus imperative debate — these aren’t the droids you’re looking for (tm).

While this post focuses on the deployment spec per se, if you’d like to learn more about the design pattern it encodes, check out my companion piece on Serverless Mullet Architectures:

CloudFormation is Tricky

Okay, let’s get the bad news over with right away: AWS CloudFormation (aka CF) can be tricky. And verbose. And hard to debug. And while the tooling could be better (I tried and struck out at finding a VS Code extension that can validate CF references) the fact is, both the CloudFormation implementors and we users of it have some hard problems to solve.

This is true no matter what you’re deploying: In an infrastructure-based deployment, there are a lot of gritty details to deal with. But even in a Serverless deployment, we’re shifting big chunks of the application burden onto the cloud vendor, so there’s quite a bit of “configurating” to get them to do all that work for us, as we’ll see.

Including comments, this CF template is over 500 lines, longer than the code in the Lambda function it deploys. Brace yourself.

What’s Not In My CloudFormation Template

Let’s get started by looking at the “negative space” — what I didn’t put into this template and why:

One-time account-level setup
When I began, it seemed intuitive to me that I should try to “CloudFormation all the things”; in other words, create a single script that would take a fresh AWS account and turn it into one running my Serverless NAT Puncher.

Unfortunately, I quickly learned that one-time setup is tricky to include. CloudFormation has a strong bias towards building and managing deployments that can be torn down again.

That’s a critical distinction — it means that tasks like the one-time setup of an API Gateway logging service role are not well tolerated, because undoing them could render everything else in the same account unusable.

Lesson learned: Keep the content of the deployment limited to things you are comfortable deleting.

Hard to reverse (or re-run) activities
Same lesson as above, but learned trying to embed Amazon Certificate Manager (aka “ACM”) into a template. I was hoping I could reference a cert’s simple name and have CF use ACM to look up its ARN (as the simple name felt a little less like a magic string), but alas this failed.

Lesson learned: Accept that there will be places where you have to keep strings consistent manually.

Test client packaging
In my original “all in one” approach I tried packaging both tests and the service they tested into a single CF template. While YMMV, for me this was a mistake — I ended up needing to tear down, restructure, and recreate the stacks for the tests far more frequently, and quickly pulled them out of the service template.

Lesson learned: Template the tests (or other swim lanes of different speeds) separately.

CodeStar (and its brethren)
I also relearned another critical limitation very quickly: AWS CodeStar only makes projects; it won’t “adopt” existing projects.

If you’re excited to use the Code* suite to build, manage, deploy, and share your projects, you have to decide that RIGHT UP FRONT, generate a CodeStar sample for one of the Lambda-based projects, and then edit the CF template it generates.

[This was the case on 9/30/2019 when I wrote this, but AWS moves quickly, so it could have changed by the time you read this.]

In any case, building the pipeline with itself is too meta to work, so you have to stand that up first, then use it for your (real) application deployment.

Let’s Get to the Code

Ok, enough preliminary stuff. Let’s get to the code that *is* in there!

Prolog: Define CF Transformations

The Transform section is pretty straightforward; just remember that the only workable way to get that AWS::CodeStar in there is to let CodeStar generate your template for you. The CodeStar transformer is supposed to keep the permission boundary up to date…more on that below, but note that it doesn’t quite work as advertised.

I’ve included the Serverless transformer so that I can use the simplified SAM syntax for items such as the Lambda function, though it doesn’t do much to lower the overall complexity of the template, and you can easily forego it here and write in “pure” CF if you prefer.

Parameters

The Parameters section represents the “arguments” to the CF template. I imagined that this would be tightly integrated into the Code* family — for example, I assumed that I’d get to configure a CodePipeline pipeline to be “development” (and another for “production”) and then it would in turn tell CodeBuild and CloudFormation the stage.

Alas, that doesn’t happen automatically…it’s all rather manually set, and in fact getting additional contextual information to flow between CodePipeline and CodeBuild appears to be pretty tough. Note that ProjectId and CodeDeployRole are inserted by CodeStar.

Globals section, focused on Lambda deployment.

The Globals section is better thought of as the “defaults” section — it establishes defaults that will apply if you don’t override them on a per-object basis. These are inserted for us by CodeStar when it creates the project, and enable incremental Lambda deployments.

The “canary” in the name hints at what this means: AWS CodeDeploy will watch the function’s metric and back out the deployment if error rates go up. You can learn more about CodeDeploy’s support for traffic shifting of Lambda deployments here.

Operational aside: This setting means that, in addition to the overhead of CodeStar, github (or CodeCommit if you chose that), CodePipeline, CodeBuild, and CloudFormation, you will also have a minimum of 5 minutes for Lambda canary deployments on each successful deployment. You may want to turn this setting down (or off) during development and then reinstate it during production.

AWS Lambda Function Definition

AWS Lambda function definition

Yay — our first actual definition!

There’s a lot of unpack here, so we’ll take it in steps:

SAM versus CloudFormation
I used the SAM type here, as you can see from the ::Serverless:: in the type. For simple Serverless projects, SAM can save a lot of typing. In this project, there’s so much verbosity that SAM doesn’t (currently) address that using it doesn’t make a ton of difference. Still, if you want to use the SAM variants, you’ll need the transformer (see above).

Manual Name Mangling
CloudFormation will name mangle (through some unpleasant hash suffixes) the outputs it creates, and CodeStar forces a little bit of project-specific name mangling into the template, but mostly you’re on your own to ensure that names don’t collide.

I prefaced the (many) items associated with the NAT Puncher with the project name so that I could add things to this template later on without worrying about collisions. Naming your function “function” or your API “api” might seem like an easy thing to avoid, but as the number of deployable objects creeps up, it gets easy to relax the mangling.

CodeUri
This is a little confusing — in this template it’s a local file or directory. The packaging step that happens at the end of the CodeBuild phase will create another copy of this template, and in that copy the CodeUri will have been rewritten to instead refer to an object in an S3 bucket (so the term “uri” sort of makes sense, eventually).

I only had one file to include, but if you have multiple files I’d suggest putting them into a src directory and making the value of this property “src/”. Note that important trailing slash.

Layers
I chose to package my Python prerequisites into a layer. Whether this is a good idea depends on your development practices. This project started life as a prototype in which I used the API Gateway console UI to construct a web socket and typed code directly into the AWS Lambda console.

In that phase of design, not having to re-upload the Python dependencies every time I made a little tweak was a huge win. It’s less of a win when you’re going through the overhead of a full CodeStar project build, particularly if the only consumer of your layer is your own project.

Still, I find it easier to have some segregation between the “real” function and the Python modules it uses. If you do decide to keep your Python prerequisites in layers, I strongly advise writing a custom script to ZIP them up…there’s a ton of cruft you can pick up if you’re not careful (like PIP) that will really bloat your Lambda function!

My suggestion is to go through every single object in the ZIP and make sure you truly need it at runtime, rather than just zipping up whatever landed in site-packages.

Role
More on this later when we define the actual role, but this is a good place to note the difference between “ref”ing and “getattr”ing an object: The former returns its simple name but you need the latter to get at its ARN.

Environment Variables
The Lambda code needs to know the name of the Amazon DynamoDB database so it can access the table at runtime. We could also have ref’ed the table here, but the function doesn’t otherwise depend on the existence of the table at deployment time.

Ref’s, sub’s, and getattr’s embed a dependency DAG into your CF template. CF runs a topological sort on that DAG to solve for a viable deployment order. Here I kept it simple by avoiding a spurious additional edge in that graph, at the cost of duplicating the name mangling formula for the DB. YMMV…pick the lesser evil :).

Coffee Break!

Tired yet? We’re just getting started, so this may be a good time for a coffee break! Now let’s get into roles and API Gateway websockets!

Lambda Layer — the Python Modules

Lambda Layer — the Python modules.

This one is pretty simple, although the challenge of filtering the layer down to just what’s required isn’t captured here — it’s in a bash script that carefully edits what gets zipped up. Note that easy-to-miss trailing slash on the directory name.

Finally, to avoid running out of space as you develop (unless you’re a “hole in one” type), you’ll probably need to set the layer retention policy to Delete. I thought this would simply reuse layer version #1 over and over again, since that’s what happens when you delete and re-upload the same version name using the Lambda UI. But instead I found it incremented the version number while eliminating the previous version.

Lambda Function’s IAM Role

The function’s role. This is a central piece of the CF template.

Ok, so here’s one of the longer sections: The Lambda function’s IAM role. In a pure SAM template, this can be mostly invisible, but the CodeStar-required complexity causes this to get displayed in all its verbose glory:

AssumeRole
This is the boilerplate that every Lambda function needs — permission to run as the role being defined here.

Managed Policies
These are the “meat” of the role — the things the Lambda function will be able to do. LambdaBasicExecutionRole enables the Lambda function to emit logs to AWS CloudWatch, the “minimum” role permissions for a normal function.

BTW, I don’t know why this one gets a “service-role” in its name, while other service-provided roles don’t. Anyone at AWS have insight into that naming convention?

The DynamoDB role is pretty self explanatory, though it’s overkill for what we need…and this is the double edged sword of managed policies: you end up using them because they’re simple, but they have to make the resource they control “*” since they’re generic, and they usually grant far more actions than are strictly required.

Both of these “far more than minimum privilege” sins are being committed here. On the plus side, the role is managed by AWS and worked on the first shot without having to write ASPEN (the IAM permission language) myself, saving me a bunch of time.

I left constructing a hand-authorized, tightly enveloped custom policy as a future TODO item for this project. API Gateway invocation permission is required to use the HTTPS endpoint for asynchronous callbacks; this is part of what makes websockets more powerful than just request/response to a Lambda function, but has to be reflected in the IAM role or it won’t be permitted.

Permission Boundary
Oh, CodeStar, you had to make things difficult, didn’t you? Permission boundaries are a security feature of AWS IAM that keep developers (or others) from “minting” permissions for themselves by creating a Lambda function, EC2 instance, or other role-asssuming deployment that escalates their privileges into the AWS equivalent of superuser status.

The CodeStar transformer is supposed to watch the Lambda function role and automatically manage the permission boundary to accommodate the use of additional resources, at least according to the CodeStar docs.

The idea is to enable flexible development while keeping teams safe from harm, but I found the mechanism to be poorly documented, difficult to bootstrap, and mostly a hindrance to getting things to work, because random parts of your deployment will just silently stop working if they fall outside the permission boundary.

You can see what’s permitted by going to the IAM console and viewing the permission boundary. This was a real time burner for me…I lost a few hours to debugging this particular rough spot, especially as the permission boundary for the function and CodeBuild can be different, leading to different experiences depending on how you’re testing.

AWS::Partition versus “aws”
Some AWS regions replace “aws” in the ARN with a segregated partitioning of the AWS cloud, such as “aws-cn” (China) or “aws-us-gov” (GovCloud). Using AWS::Partition in your template instead of “aws” makes it possible to reuse it transparently in these other regions…although in some cases, as is true for APIs, legal or regulatory restrictions prevent that for other reasons.

To make it more portable, I could (should) go through this template and fix the remaining ARNs to use the partition notation, but for now I don’t have plans to deploy it to any of those areas. Copy-pasting from the AWS docs and blogs will lead to inconsistent results, which is how I got into the state of mismatched partition handling.

DynamoDB Table Definition

DynamoDB table definition, using the SAM syntax.

This one is easy, and using the AWS::Serverless::SimpleTable cuts down on the clutter a bit. See note about about the name consistency versus ref.

By the way, I wanted my unit tests to be able to read & write actual data in the database, which created an interesting chicken & egg problem — I have to deploy this template to create the database, but normally deployment doesn’t happen unless the unit tests in the build stage succeed.

I just toggle off the unit tests in the (very rare) case where I need to adjust something in the table definition, but otherwise I’d have to make the unit tests run purely standalone and add another CodePipeline stage to do a post-deployment integration test.

Worth having eventually, but it was a level of complexity on top of the vanilla CodeStar pipeline that I got “for free”, and not a place where I initially wanted to spend a lot of my time.

Still with me? Ready for the hard part?

API Gateway Web Socket

API Gateway Web Socket, along with routes, integrations, responses, and models.

This one is a bear, and the gist above doesn’t even include the deployment and stage pieces (see below for that), just the functional elements.

Let’s first talk about the structure of an AWS serverless websocket as it appears in CF:

AWS::ApiGatewayV2::Api
This is the “core” definition; here, “Api” means websocket, though you’d only know that by looking at the ProtocolType property. Note that everything to do with the websocket has to use V2 of the Amazon API Gateway API; several times I slipped up and got mysterious errors until I realized one of the pieces didn’t specify V2. Nothing will warn you explicitly, sadly.

Routes
Routes are the “commands” that a websocket knows about. API Gateway plays the role of dispatcher, framing up the data and then figuring out which Lambda function to call. $connect, $disconnect, and $default are the built-in routes, while in our case pair and status are custom routes.

Integrations
Each custom route has a required integration, which is the Lambda function to call when that route is discovered in a data frame by API Gateway. For built-in routes, an integration is optional. If API Gateway can’t otherwise find a custom route to use, it uses $default; in our case, we just hook that up to an error response.

There’s a lot of religion swirling about whether to do one-Lambda-per-route or one-Lambda-for-all-routes; I’m of the opinion that the latter is easier to manage when your functions are closely related and sharing a role makes semantic sense.

Here, all my routes are part of the “CRUD” — they all read and/or write to the shared DynamoDB table, so breaking them up doesn’t accomplish anything; it’s no faster to develop and offers no additional security, while greatly complicating matters. So, one function it is.

Responses
Integrations in websockets come in two flavors: With or without a response. With a response does pretty much what you’d expect: The Lambda function that gets called can return a value; API Gateway waits for the function to complete and then places its result on the websocket, sending it back to the client — who is presumably blocking, waiting for it, although that’s outside the scope of our discussion here.

In our case, status is a good example of a route with a response: It’s basically a REST API disguised as a pair of websocket messages. Pair is an example of a route without a response; the “answer” to the pairing request is sent asynchronously, via a API Gateway-provided callback URL.

To turn on a response for a given route, you use the expression RouteResponseSelectionExpression: $default. The $default here is a bit unintuitive, as this is basically just a Boolean switch and no other values are permitted.

Models and Well Formedness Checking

The built-in routes are parameterless, but custom routes can optionally have models, as shown here for pair and status. Models allow you to associate a JSON schema with a route, allowing API Gateway to verify the syntax of the message for you.

It adds to the complexity of your deployment script, but avoids the need for you to write syntax checking code yourself, since you can count on getting only well formed messages. I couple this with using $default as an error handler.

That way, my code only has to process expected, well-formed requests and can ignore the corner cases unless it’s dealing with $default, and in that case it can just ignore the message and tell the caller that it’s malformed. (FWIW, when I wrote this was wasn’t able to get Refing of a model to work correctly, and so just used its name in the template. The route should really Ref the model.)

Websocket Miscellaneous Items

The rest of this is straightforward, if a bit finicky. By the way, the operation name seems to just be for display purposes and doesn’t carry any semantic meaning that I could find.

The “websocket miscellaneous” items.

Given how much we had in the previous block for the websocket definition, it would seem like we’d be done with it…but in fact, there’s a lot of non-structural material needed to turn the previous objects into a functioning websocket deployment.

Before we dive into the details, though, let’s talk about The Many Notions of Deployment in AWS Serverless Services:

  1. CloudFormation deployment. What we’re mostly talking about in this article — a stack (set of related resources) built by CloudFormation following the recipe in a template.
  2. API Gateway deployment. API Gateway APIs (including websockets) exist in two or more states: A development state, and one or more deployed stages. API Gateway deployments can be executed as incremental (canary) deployments, though we’re not doing that here.
  3. Lambda aliases. Strictly speaking, Lambda aliases don’t have any deployment semantics, but their typical use is to perform a controlled deployment or rollback.
  4. CodeBuild/CodePipeline. These tools manage building and releasing a deployment, typically (though not necessarily) by integrating with CloudFormation to do the actual resource construction. They can optionally be wrapped in CodeStar, as is the case in our project.
  5. CodeDeploy. CodeDeploy has a lot of infrastructure deployment uses, but in our case it’s being used to do a controlled deployment of the Lambda function after CloudFormation is complete. It relies on (3) above to handle the low-level traffic shaping.

Ok, so in our case we have a CodeStar (4) team project consisting of CodeBuild (4), CloudFormation (1), and CodeDeploy (5) running inside a CodePipeline (4) that includes a Lambda canary deployment (5) using traffic shaping (3) and a production stage deployment (2) of the API Gateway websocket as deployed by CloudFormation (1). Whew.

Websocket Deployment
Back to our regularly scheduled program: The websocket deployment is straightforward except that it includes a bunch of DependsOn statements.

Now, I’m going to say that this feels wrong to me — the value of a declarative specification should be that it gets things correct without having to encode imperative notions, such as what depends on what.

In almost all cases, CF gets this right — the service object types in CF “know” their deployment semantics, and occurrences of Ref and GetAttr supply the required additional dependency information for CF to construct a valid total ordering on deployment sequencing.

But here, none of the usual mechanisms applies, and the construction order information can get lost, leading to race conditions. Hopefully this will get cleaned up in a future release and the DependsOns can be removed.

Logging

Here’s the requisite Medium blog post sentence imploring you to turn on logging: Please…turn…on…logging.

The syntax for access logging is finicky, but you can see a working example here, and it’s worth it, especially for websockets. When a request fails to get through, it’s really useful to be able to determine how it failed — authorization, model validation, integration, etc.

The downside to having so much done for you is that you really need the added transparency when it all goes to hell. Unlike Lambda, you won’t get logging for your websocket unless you ask for it.

You also need to make sure you’ve set up your account to enable API Gateway with a logging role — that only needs to be done once per account, versus in each application’s deployment template, but without it you’ll get nothing when you go to CloudWatch.

One other minor point: Items like the access logging format here really benefit from being able to split a literal string in your template over multiple lines. You can either copy this verbatim from a known working example or get comfortable with the nitty gritty of YAML lexical rules.

Custom Domain Handling

If you’re content with the API Gateway-provided moniker for your websocket (which is probably fine if it’s an internal service), skip this section. But if you expect customers or partners to deal with it, you probably want some kind of vanity URL that’s a little easier to remember than a randomly generated GUID.

Custom domains bring together six pieces of information: The domain and subdomain, the API (in our case, a websocket), the path, the cert, and the deployment stage.

Note that the domain (and optional subdomain) show up as a literal string when you create the AWS::ApiGatewayV2::DomainName, but AWS::ApiGatewayV2::ApiMapping refers to the object…even though both call this property “DomainName”.

Domain name certification and DNS testing is a joy all its own; suffice it to say you want that fully working and debugged before attempting to use a custom domain in a CF template.

Last, but not least, you need bidirectional permissions between the Lambda function serving as the integration and the websocket attempting to integrate with it.

There are two sides to this:

Side #1: API Gateway calling Lambda
This is the role (no pun intended) of theAWS::Lambda::Permission object; it adds a resource policy that allows the NATPunch websocket to call the Lambda function serving as the integration target.

To keep it from being more verbose, we use a “*” for the route name. That’s very different than using a “*” for the entire thing — we want calls restricted to this specific API, but we know all routes (current and future) will go to this Lambda function.

By the way, if you want to list these ARNs explicitly, not that you can’t GetAttr them (AWS folks, note this bug!) so you’ll have to construct them manually, and there is an actual, unavoidable asterisk even in the full ARN. Seems like a questionable idea for something that has to wind up in an ASPEN statement, but there it is.

Side #2: Lambda calling API Gateway
Synchronous websocket responses don’t require any additional permissions; the Lambda function’s response is automatically turned into a websocket message by API Gateway, which is authorized by (1).

But an asynchronous message sent back to a client requires a Lambda function to use the callback endpoint provided by the websocket for this purpose. Remember the !Sub ‘arn:${AWS::Partition}:iam::aws:policy/AmazonAPIGatewayInvokeFullAccess’ in the Lambda function’s role? That’s where this magic happens.

CloudFormation Outputs

CF Outputs — a useful debugging aid.

The outputs section is super useful for two reasons: Diagnostics and downstream activities. You may need these if subsequent steps in a CodePipeline need to know the names of things that got created…especially the names that get mangled by CF, which you can’t otherwise capture in a build script since you won’t be able to predict them in advance.

Parting Advice

And that brings us to nearly the end of our story.

Two parting bits of advice:

  1. First, use the CloudFormation object name to search for documentation (so, “AWS::ApiGatewayV2::Stage”, not “API Gateway stage”) to avoid searching random pages of unrelated docs.
  2. Second, it can be easier to create a working application (or at least pieces of it) outside of CF, and then “transcribe” that app to CF. This lets you use the AWS console UI, AWS CLI, or any of the several AWS SDKs to see how AWS thinks your app is configured. Often, one or more of those will display enough to help you correctly guess how to configure your CF properties to recreate the same behavior.

If you made it all the way to the end, congrats, and happy CloudFormation construction!

--

--

Tim Wagner
A Cloud Guru

Inventor of AWS Lambda and former AWS GM of Serverless, former Coinbase VP Eng, and now innovating on some exciting new ideas!