Interactive Canary Deployments with Lambda functions on AWS

Published in

Fender Engineering

7 min readMay 13, 2019

We all know the feeling — work on an application for weeks, make changes to hundreds of lines of code, and then finally, one day, QA gives us the blessing to push that code to production.

Oh, production! With real users, doing real-user-y things, that no trained engineer would ever think of doing — because engineers are trained to click through the workflows they designed for, not the ones that an untrained user would explore.

If everything goes well, we rejoice, and replenish our coffees before diving into the next battle in the war against entropy.

If things DON’T go well, however, it would behoove us to put things back to how they were before we deployed our fancy new code, and get production back up and running again.

This, while sounding simple, is not always very easy. At Fender, we can resort to the most basic kind of roll-back — re-running a prior build in our CI tool of choice (CircleCI). However, this takes a few minutes. Could we find a way to roll things back faster?

Well, the answer is — kind of.

AWS’ lambda functions have a nifty feature, called “aliases”. This allows you to have multiple versions of a function, and use an alias to refer to a specific version. That way, you can refer to the function in the format <ARN>:<alias>, and invoke whatever version the alias happens to be pointing to.

Another interesting feature is the ability to distribute invocations across two different versions — thereby enabling you to test the waters before plunging in head-first. Deploy a new version, update the alias to point 10% of incoming traffic to the new version, and see what happens. If everything works fine, point the alias 100% to the new version. Profit.

All this sounds interesting. But we have microservices with upwards of 70 lambda functions. We’d have to hire a highly trained monkey to quickly go through all those functions fast, updating all their aliases to the correct version. And highly trained monkeys are hard to come by, and tend to get bored by such demeaning tasks. So we had to invent a better way.

AWS CodeDeploy

As a side note, CodeDeploy allows automation of canary deployments, and rolling canaries (linear deployments), leveraging the underlying aliasing mechanisms outlined in this post. We looked into that, but found that we’d need one CodeDeploy configuration setup for each lambda function in our app(s). That seemed like a lot of overkill — plus, the interactivity afforded by the AWS console required giving developers/others access to some innards of the infrastructure that we didn’t necessarily feel comfortable doing. More info here: https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html

Requirements

Allow quick roll-back of code, and limited testing of new code
Allow developers to approve / roll-back code without requiring AWS privileges
Leave a bread-crumb trail of approvals/roll-backs
Allow pre-approval of canary deployments on lower environments (dev/qa), without special-casing production infrastructure

At this point, we had decided to go ahead with using lambda aliases to implement canary deployments. This was the rational thing to do. However, the user-interaction, and some other aspects needed to be built out. Here’s how we chose to go about that:

Quick roll-back and limited testing of new code:

AWS allows aliases to have a routing configuration, with the “AdditionalVersionWeights” parameter specifying which additional lambda version gets what percentage of traffic. This page documents it well: https://docs.aws.amazon.com/lambda/latest/dg/lambda-traffic-shifting-using-aliases.html

Using this, if the main version the alias points to is kept the same, and only the routing configuration’s AdditionalVersionWeights parameter is updated with the new version, rolling back is simply a matter of setting AdditionalVersionWeights to an empty map {}. Roll-back is instant when this is done, at a per-lambda level.

Roll-forward is done by setting the main version of the alias to the new version (as returned by publishVersion), and setting AdditionalVersionWeights to {}.

We added an environment variable that we can use to configure how much traffic would be directed to the new version before approval/roll-back, and called it NEW_VERSION_WEIGHTAGE. This can be modified from 0 to 1 (0% to 100%), and directly affects the AdditionalVersionWeights parameter during the build.

Developer interaction without requiring AWS privileges:

AWS’ API Gateway is quite an awesome service. In addition to fronting flocks of lambda functions, it also allows users to interact directly with a multitude of AWS services through HTTP endpoints, including DynamoDB.

Since we wanted developers to be able to click a button and approve/roll-back code, API Gateway was chosen as the tool to provide that link to be clicked. We needed to percolate this click through to the build that was waiting to be notified, however — so we chose to store the results of the click in DynamoDB. Which API Gateway can directly talk to, as shown below:

Create a new resource named “roll_back”, and add a GET method under it
Under Integration Request, select AWS Service as the Integration Type
Select DynamoDB as the AWS Service, and PutItem as the Action
Add the below (customized as needed) as the Mapping Template:

{
  "TableName": "blah-foo-lambda_canary_tracker",
  "Item": {
    "repository": {
      "S": "$input.params('repository')"
    },
    "build_no": {
      "N": "$input.params('build_no')"
    },
    "build_status": {
      "S": "roll_back"
    }
  }
}

When https://canaries.example.com/canary-roll-back?repository=my-app&build_no=760 is clicked, the above Mapping Template takes the “repository” and “build_no” values from the query parameters of the link, and directly writes “roll_back” to DynamoDB.

The build scripts can now query DynamoDB with the build number and repository name, and check whether the build needs to be rolled forward, or rolled back.

We chose to have the build scripts post the roll-forward/roll-back links in a separate Slack channel we created, and also added an access policy to API Gateway, so only employees with Slack access, and from specific networks, would be able to click the links and have them do anything useful.

Of course, it is nice to have feedback if you click a link — we attained this by adding an Integration Response to the GET method(s), and set it up to return a suitably pretty HTML page with a 200 response.

Bread-Crumbs:

Since the API Gateway links directly write to DynamoDB, the table stores which builds were approved, and which were rolled back. In addition, knowing the repository name and build number (from DynamoDB), it is possible to view the build on the CI tool, and see what took place.

Pre-Approval:

Though the canary deployment functionality is easy to use, it can still get in the way of developers in a tight code/deploy/test feedback loop in dev and QA environments. However, we didn’t want to have differently-designed infrastructure on lower environments, since that could hide bugs that would show on production.

This was remedied by the addition of a simple, configurable environment variable that defaulted to PRE_APPROVE=true for lower environments, and PRE_APPROVE=false for production. When PRE_APPROVE was true, the build scripts automatically behaved as if a human had clicked the Approve link in the Slack channel — thereby giving the canary management infrastructure a workout identical to production, in lower environments, as well.

We’ve had this running in our setup for the past month or so, and seem to be seeing good results, though we’ve not had a production roll-back (yet!). We did choose to initially set the NEW_VERSION_WEIGHTAGE value to 100%, until we feel confident in the workings of the canary setup.

Schema/Data roll-backs:

One challenge we have not been able to crack, is how to revert back to older data schemas, if we roll-back the code in the functions. Currently, this is managed at a human level — if a deployment contains changes to the database schema, we cannot roll-back using the canary methods — and we cannot test just a percentage of the new code during the canary testing period, either.

Rolling out Canary Deployments:

Once you have created the lambda aliases that point to the versions you want, you need to ensure that the functions are invoked using an ARN that contains the alias, in the format <lambda_ARN>:<alias_name>.

We combed through our Terraform code, and updated every reference to a lambda (in SNS topics, DynamoDB streams, API Gateway definitions etc.) to add :<alias_name> to the function ARNs. This was quite a tiresome, manual, one-time process, and we will not mention what happened when 4 ARNs were NOT updated with the alias name. Hint: it wasn’t very good.

We also created the alias for every lambda function, in Terraform, and initially pointed it to the $LATEST “special” version (which AWS interprets as pointing to the last-modified version of the function). This would allow us to setup the infrastructure to be alias-aware, while not necessarily having to run a deployment for every app across every environment.

After this, any deployments would modify the alias to point it to a numeric version number (published during deployments using the publishVersion AWS Lambda API call), instead of $LATEST. And our canaries would fly.

To sum up:

Build scripts publish new versions, then update lambda alias configuration with version number and additional (new) version weightage
API Gateway provides links to developers to approve/roll-back
API Gateway writes to DynamoDB on link being clicked
Build scripts query DynamoDB for approval/roll-back status, and act accordingly
Lower environments are considered “pre-approved”, and go through the same workflows as production, but automatically.

As always, we’d love to hear back if you enjoyed this read, especially if you have feedback of any sort. Of course, software is always a work in progress, and every self-respecting engineer cringes at what he wrote a scant year ago — but today, this appears to be working well, for us.

Time to turn more coffee into code.