Supercharge Your Lambda Deployments With Kubernetes

Published in

motive-eng

7 min readFeb 1, 2022

Read this blog to learn how KeepTruckin makes declarative Lambda deployments with a Kubernetes Custom Controller.

Background

At KeepTruckin we make extensive use of AWS Lambda because of its ease of use, inexpensive pricing, and scaling capabilities. Currently we have more than 150 Lambdas. In particular, our IoT team uses Lambdas as part of its core infrastructure. They power the delivery of hundreds of thousands of firmware over-the-air upgrades, not to mention tens of thousands of device configuration tasks every day!

We tend to make frequent code updates to our Lambdas. Naturally, we need the process of deployment, monitoring, and rollbacks to be as frictionless as possible. More recently, as the team grew and we continued to add new product features at a rapid pace, the need to scale our Lambda deployment process became obvious.

KeepTruckin Development Process

At KeepTruckin, development largely occurs within the KT Monorepo (KTMR) where many of our application level services are built using Golang. This architecture provides a really nice developer experience because applications can share large portions of code easily, while you can still deploy unrelated services separately.

For managing builds, we use a tool called Bazel. We tell Bazel our code, application, and library dependencies, and it makes sure that only the required applications and libraries are rebuilt. This ensures reproducible results and less wasted time rebuilding the entire codebase.

Continuous Deployment of Golang services is done through ArgoCD, a declarative GitOps tool for deploying applications to Kubernetes. This means that just like we maintain a history of code changes, we can also maintain a history of application deployments.

The Problems

1. Lambda Deployments Require a Separate Deployment Pipeline

Bazel and ArgoCD make deploying a Go service a breeze. Deployments are done in a declarative manner, where the user updates the manifest using ArgoCD and Kubernetes takes care of the rest.

However, deploying Lambdas requires manually updating the resulting build file and applying it via Terraform.

Having a separate deployment process for Lambda functions is not a great developer experience, because a single line of code changed in the monorepo may require multiple deployments through two different infrastructures: ArgoCD Service Deployment and Terraform-based AWS Lambda Deployment pipeline; e.g., a library used by both a AWS Lambda and Go service.

2. Deploying Multiple Functions is Manual and Tedious

A single code change in a monorepo may affect multiple Lambdas, and requires the developer to run a series of commands to deploy all of the affected ones. Many companies, KeepTruckin included, create separate Lambdas for each environment (e.g., staging and production). This requires a developer to upload a zip file to multiple affected functions for each environment.

Additionally, Lambda functions that pull in several libraries can have pretty large code footprints when being packaged for uploads to AWS. These issues combined mean that deploying Lambda functions can be a very tedious and time-consuming task.

3. Lack of Observability and Rollback Friction

The manual deployment process used for AWS Lambda makes it error-prone to roll back bad deployments. It’s also a problem from a code observability standpoint. Without the proper tooling we have in the rest of the KeepTruckin infrastructure (e.g., ArgoCD), it can be difficult to identify when a stable version of the code was deployed while trying to correct a “bad” version.

The Solution: Kubernetes Controller and CustomResourceDefinition

To address these issues, we needed an adapter between Kubernetes (on which our infrastructure is built) and AWS Lambda (where we deploy Lambdas).

An elegant way to build this adapter is to use a custom Kubernetes controller in combination with a CustomResourceDefinition (CRD) that represents a LambdaDeployment in Kubernetes. This is preferable to many other methods discussed in the Alternatives section because it takes advantage of the extensibility of Kubernetes, while eliminating the need to make any additional changes to our deployment process.

CustomResourceDefinition

A CustomResourceDefinition, or CRD, is an object definition that extends the Kubernetes API. You can interact with CRDs similar to how you would interact with pods (or deployments or services), but they aren’t very useful until you have a custom controller that can take action based on the state of the CRD. In our case, we defined a LambdaDeployment object to track the state of our AWS Lambdas.

Controller

A controller is simply an application inside the Kubernetes cluster that can watch and interact with the Kubernetes API. In our case, our controller watches our brand new LambdaDeployment CRD and makes calls to AWS when a new Lambda should be deployed.

Target State

Let’s break down the flow illustrated above. First, the developer finishes work on the source code of an AWS Function and pushes their changes to GitHub. This triggers a CI job that:

Compiles the source code and builds a zip file meant for the AWS Function
Uploads the zip file to S3 with a reproducible key (e.g., a Git commit SHA)
Generates a manifest (CRD) and applies it to the Kubernetes cluster

The job manifest might look like this:

Note: If you use Docker images to deploy Lambdas, you can replace S3 with ECR.

A custom Kubernetes controller watches the LambdaDeployment CRDs. Using the AWS SDK, it calls the UpdateFunctionCode API with the respective parameters (FunctionARN, S3Bucket, and S3Key).

The function is now deployed on AWS.

Implementation

Build and Upload to S3

We want to upload a “ready to deploy” zip file to S3 so that we can use the UpdateFunctionCode API to update the function.

This is a good candidate for a task that should live in your CI script after the code has passed testing. It might look something like this if you use the AWS CLI and some bash:

CRD and Controller

Building the custom Kubernetes controller is a bit more involved. Using something like the Operator Framework or Kubebuilder can simplify things. At KeepTruckin, we went with the Operator Framework, because it was easier to set up and deploy.

LambdaDeployment Struct (CRD Generation): Below is the struct definition that we use to generate the CRD using controller-gen.

Controller Reconcile Loop: The Reconcile handler is the function that tries to “move the current state of the cluster to the desired state” in the spec. Here is a simplified version of the Reconcile handler for the LambdaDeployment Controller:

Deploying the Controller

There are a few steps involved in deploying a custom controller in Kubernetes. If you use one of the frameworks, most of the required boilerplate will be generated for you, but it involves applying the following manifests:

The CustomResourceDefinition: What the LambdaDeployment resources are supposed to look like.
The Controller role bindings: The controller is allowed to use the Kubernetes API to watch CRDs.
The Controller Deployment: The actual program with watch logic that calls the AWS API.

Synchronization

With the architecture discussed above, we’ve decided that Kubernetes will be the source of truth. This means that it’s paramount that we keep AWS in sync with whatever the state is in Kubernetes.

There are many ways to address this issue, but we chose to keep it simple and use the AWS tagging functionality to include additional metadata like the git commit SHA and last updated timestamp on the AWS Lambda itself.

We can avoid unnecessary update calls by checking that the git commit SHA matches the Kubernetes manifest, and as an additional check (to ensure that someone had not accidentally deployed to the function), check the last updated timestamp to be approximately the same as last modified timestamp on the CRD.

Alternatives

There exist alternatives to the Kubernetes architecture discussed above, and although they don’t work for our particular use-case of unifying the deployment pipeline that uses ArgoCD, they are worth considering nonetheless if your company is building core infrastructure with AWS Lambdas.

❌ AWS Lambda Versioning and Aliases?

The main reason it’s not possible to use built-in AWS functionality for versioning is because aliases do not support separate environment variables. This is a dealbreaker because we need environment variables to determine which services and endpoints the Lambda will interact with.

Note that it’s not impossible to make this work. For example, an alternative solution could have involved using aliases (preview, staging, production) and, based on the ARN (which can be retrieved through the AWS context in the source code), using the AWS Systems Parameter Store to retrieve your environment variables. Refer to this StackOverflow Explanation for details.

Pragmatically, our team wanted to avoid making code changes to every Lambda to make these deployments work, so we opted to not go this route.

❌ Serverless?

It’s simply not realistic to try and integrate Serverless when we already have our own Bazel build system.

❌ Terraform?

Terraform is a great tool for provisioning resources, but managing deployments of services can be clunky in practice because it requires manually committing changes for each separate environment — a very time consuming and painful process for developers.

Key Learnings

The process of researching and building this pipeline produced a few important takeaways:

Building and deploying a custom Kubernetes controller is relatively easy with the various frameworks, especially if you are familiar with Go.
Push your compiled source code artifact (zip file or image) to an AWS store (S3 or ECR) with metadata that describes the code state at the time of compilation. This provides an easy way to point the AWS Function to a specific version of the code without having to go through the entire compilation and upload process.
While exploring “AWS Lambda Versioning and Aliases” it became apparent that using AWS Parameter Store instead of Environment variables gives you more options for deploying AWS Functions. You can use the context within the AWS Function to determine which parameters to pull from (based on prefix), and, when combined with Versions and Aliases, you might not need a AWS Function per Environment.

Community-Driven Projects: ACK

The AWS community is already in the midst of building Kubernetes controllers for various AWS resources (S3, SNS, SQS, ECR, DynamoDB, API Gateway), but they haven’t gotten to AWS Lambda Functions yet. Hopefully this custom code can be replaced once an official implementation exists. Their blog can be found here.

Come Join Us!

Check out our latest KeepTruckin job opportunities on our Careers page and visit our Before You Apply page to learn more about our rad engineering team.