Lambda@Edge Case Study

Traffic Management at the Edge

Alternative title — How to Shoehorn Aerosmith Into A Blog Post.

James Hodge
Engineers @ The LEGO Group

--

Introduction

Our traffic during release events or sales is vastly disproportionate to our day to day operations. In order to manage these high levels of traffic without significantly rearchitecting the platform for these important, but infrequent events, we decided to add a layer that could protect the platform and ensure that we stay operational, when an unexpectedly large surge in traffic heads our way.

With sudden spikes in traffic, the site can be quickly overwhelmed.

This is a Lambda@Edge case study that looks at how we used traffic management at the edge to form a waiting room, detailing the issues we encountered and the steps we took to overcome them (along with a tenuously linked soundtrack to keep you company).

The Problem

Texas Flood

How do you prevent too much traffic from flooding your resources, without losing the traffic entirely? How do you apply logic to traffic you don’t want to receive in the first place?

By allowing every request to reach our platform during these release events, we run the risk of hitting some inherent limitations. This can cause some requests to fail, and worse, make the platform as a whole unresponsive and a generally poor experience for everyone.

At first we tried a queuing system. This would aggregate users past a certain threshold in one place, and let them out slowly over time. This effectively just elevated our problem up a level. We still had too many requests in one place, and had to deal with the management of those requests.

This drew us to look at edge compute solutions. Smaller pieces of distributed logic that exist closer to users, and importantly, away from the platform.

Lambda@Edge

Livin’ On The Edge

Our solution started life from an AWS blog around Visitor Prioritisation. This was an invaluable starting point for understanding what Lambda@Edge can do — https://aws.amazon.com/blogs/networking-and-content-delivery/visitor-prioritization-on-e-commerce-websites-with-cloudfront-and-lambdaedge/

Our Lambda@Edge function would operate similarly. We would send all traffic through a CloudFront distribution, and via the edge function, split the incoming traffic by a configurable ratio, to let only a percentage of traffic though to the platform, with the rest being placed in a highly available “waiting room” that sits outside of our platform, in a static Next.js site hosted in S3. From there users would be slowly drip-fed into the site over a longer period of a few minutes.

By siphoning a percentage of users into a waiting room, the same number of users are spread over a longer period.

Challenges

After some testing and proof of concepts, we stumbled upon a few specific gotchas that were not obvious from the outset.

Deployment

Build Me Up Buttercup

Our infrastructure is generally deployed via Terraform, while our Lambda functions are managed by Serverless Framework.

Edge functions start life as any other Lambda function, and become edge functions once associated to CloudFront. However this association can only happen if the Lambda function meets certain requirements and limitations.

  • Must be deployed to us-east-1
  • Must use a numbered version (not $latest )
  • The lambda execution role must allow the lambda.amazonaws.com and edgelambda.amazon.com service principals to assume the role.
  • The function must have no environment variables.

A full list of restrictions is detailed here — https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/edge-functions-restrictions.html#lambda-at-edge-function-restrictions

As a starting point, we went with deploying the Lambda function as we normally would with the Serverless framework (albeit in the us-east-1 region), with CloudFront, and bucket resources being created by Terraform.

The Terraform resources are treated as a pre-requisite for the Lambda, and in order to associate the two, we use the serverless-lambda-edge-pre-existing-CloudFront plugin.

The waiting room site itself is deployed to an S3 bucket, and this S3 origin is added to CloudFront, in order for it to be used as a target by the Lambda function.

The standard approach to securing an S3 bucket behind a CloudFront distribution is to use an Origin Access Identity. However it is currently not possible to switch between a custom origin (in our case, a load balancer) and an S3 origin which uses OAI, via Lambda@Edge.

In order to secure our bucket and ensure it’s only accessed by CloudFront, we put a bucket policy in place which defines a set of conditions, including a value that must be passed with each request. That value is added by the function itself, which ensures the only route into the static site is from the Lambda function.

Cost

Money

In order to use Lambda@Edge, one fairly obvious limitation is that you have to be sending all traffic through CloudFront. If you’re already doing this, then it’s possible the cost of adding edge functions is negligible.

We were not using CloudFront for almost all of our traffic, so adding it into the request chain in order for all requests to be subject to our edge functions, would add a significant cost, especially given that we only need this functionality for a handful of events per year.

To mitigate this cost, we added weighted Route53 records, that allow us to direct traffic through CloudFront during events, or to skip it entirely when not needed.

CloudFront is only part of the request chain when required.

Configuration and adaptability

Around The World

Edge functions by their distributed nature, have no central source of configuration. They are deployed to the us-east-1 region, and CloudFront takes them and distributes them out to edge locations, where you no longer control their operation. There are no environment variables to update, as these are not supported by edge functions.

Deployments can also be lengthy, sometimes 5–10 minutes for a new version of the function to be propagated, so how can we keep the function dynamic and operated centrally?

We reviewed a number of solutions to this, and settled on using a configuration file stored in an S3 bucket. This config file contains information such as the state (enabled/disabled), the queue time, and traffic ratio.
The edge function is then configured to retrieve this config file, and operate based on the parameters within. By keeping a central source of truth for the operating parameters, we maintain control over many disparate instances of the function.

Session maintenance

Even Flow

With every request being assessed by the edge function, we required something to let the function know that a user has reached the platform once, and that they should remain there. We use a cookie to communicate that knowledge to the Lambda function on subsequent requests.

We also use additional CloudFront behaviours to allow certain paths to pass through to the default origin without triggering the Lambda function at all, such as secondary browser requests.

Logging and Monitoring

What’s Up

As functions are executed in edge regions around the world, the CloudWatch logs for those executions are also distributed in each of the regions those executions occurred in.

In order to provide a centralised point of accessing logs, we aggregate them from all known edge regions, into a single S3 bucket. This is done by pre-creating all CloudWatch log groups in the regions utilised by Lambda@Edge, and subscribing them to an Amazon Kinesis Data Firehose stream, which in turn sends them to S3.

We use Terraform for this setup, but the following blog post from AWS details the process using CloudFormation, https://aws.amazon.com/blogs/networking-and-content-delivery/aggregating-lambdaedge-logs

From there we can inject the logs into our logging platform for convenient reporting and troubleshooting.

We also add a monitoring agent into our static site, this allows us to know how many sessions we have worldwide hitting the waiting room site and informing our decisions on letting more or less people into the shop.

Summary

The intent of this article is to demonstrate some of the surrounding architectural decisions and considerations to make when utilising Lambda@Edge.

While these particular issues apply to our waiting room solution, many of these points would need to be considered for any implementation of Lambda@Edge.

  • Configuration — With no environment variables, decide how you will configure the operating parameters of the function.
  • Deployment — Be sure to understand the limitations of edge functions (region, numbered version, service principals, environment variables).
  • Cost — What is the financial impact of having all requests go via CloudFront?
  • Logging and Monitoring — How will you debug when things aren’t working as expected? Do you need centralised monitoring?

--

--