Bursting To Lambda

Louis McCormack
spaceapetech

--

The murky art of capacity planning has undergone something of a quiet revolution over the last decade or so. With AutoScaling, lead time — the time it takes for new capacity to be in-place, serving traffic — has gone from weeks and months to minutes. Latterly, with Serverless, it has been almost eradicated.

In fact that forms a large part of the Serverless promise: assign capacity planning to the history books by drawing on a near-infinite pool of compute.

So why isn’t everybody scrambling to migrate to Serverless?

Firstly, a lot of people are.

Also there are good reasons why others are not. Certain high-performance, latency-sensitive workloads are still not a good fit (like online multiplayer games).

Primarily though, at a certain scale, there is a cost argument. For high traffic systems the raw economics may rule out Serverless as an option over more traditional compute.

For instance, this excellent recent white paper shows that there is a clear break-even point where Lambda usage becomes more expensive than EC2. The point varies for different applications and workloads but, once reached, the extra cost of using Lambda quickly starts to rise.

But. What if we could have the best of both worlds?

A (relatively) slow-scaling tier of (relatively) cheap compute lumbering away in the background, supplemented by an ever-ready tier of Lambda functions to soak up unexpected spikes in traffic. What if we could… burst to Lambda?

You’re probably reeling off in your head all the reasons why this is a bad idea. But let us offer one on the side of viability: Last year, container image support was announced for Lambdas. This raises an interesting possibility — is it feasible to deploy the exact same artefact to ECS and Lambda, and have them running alongside one another?

The answer is yes, in certain circumstances, you can. Furthermore you can, in certain circumstances, share traffic between the two.

Intrigued? Please read on…

What is described here is a proof of concept, and the concept being proofed has the following high-level properties:

  • The application under scale is a simple Golang HTTP server fronted by an Application Load Balancer.
  • The slow-scaling tier takes the form of containers running on AWS Fargate, configured to AutoScale.
  • The bursting tier takes the form of a Lambda ALB backend, configured as a separate target group.

Clearly there’s a lot of detail to be bedevilled.

All of the code referenced here can be found in the accompanying Github repo.

The Application

The application exposes a single endpoint: /doThing. DoThing does nothing more than calculate all the prime numbers between 0 and 10000, and return the result. A sisyphean task for sure, but that is in fact the least interesting thing about the application.

The most interesting thing about the application is that it is a single container image that can be deployed to either Fargate or Lambda.

In order for this to work it must respond in a similar fashion, to both ordinary HTTP requests and the HTTP-requests-as-JSON proxy Events that Lambda deals in.

This is achieved through use of the excellent apex-gateway library. This library converts Lambda proxy Events into ordinary Go HTTP requests; making it possible to use standard Golang HTTP constructs (e.g. HttpHandlers, Routers etc.) to serve Lambda proxy requests.

What it means for our application is that we can respond to HTTP requests or Lambda proxy events at the flick of a feature-flag:

if s.lambdaMode {
log.Println(“running in Lambda mode”)
gateway.ListenAndServe(“:8000”, s.Router)
} else {
log.Printf(“running in http mode on port %d”, s.port)
http.ListenAndServe(fmt.Sprintf(“:%d”, s.port), s.Router)
}

There are some further quirks to ensuring the same container image can be run on both tiers. Full details can be found in the README, but suffice it to say that the image contains a single binary (the Go application) that is invoked slightly differently by Fargate and by Lambda, by way of an altered entrypoint.

The Infrastructure

The AWS infrastructure we’ll use to run this PoC is refreshingly simple. It is simply an ALB with two listeners configured:

  • Port 8080, forwarding to a target group managed by the Fargate tier.
  • Port 8081, forwarding to a Lambda backend.

AutoScaling the Slow Tier

For reasons that will become clear, the metric we’ll use to AutoScale the Fargate tier is the ALB request rate per target. We’ll set a Target Tracking Policy such that Fargate tasks will be added or removed in order to maintain a consistent rate of requests against each of them.

But how do we know what that rate is?

To help us find out, we fired up the Space Ape load-testing systems.

Then donned lab-coats and specs before undertaking the scientific endeavour of running increasingly onerous load-tests, until one broke.

This graph shows a successful load-test at 100 requests-per-second (RPS). The blue line represents HTTP 200 Responses from the ALB.

If we dial the RPS up to 120, this happens:

In this graph, the green line shows HTTP 500 Responses. Bad news.

So, someone proffered, each Fargate task can comfortably handle 100rps? We nodded, seems about right.

Experiment concluded: a Target Tracking Policy of 100 requests per ALB target would be appropriate.

With the AutoScaling Target Tracking policy in place, we re-ran the same test but with twice the load (200 RPS). This graph shows the response codes, alongside the number of Fargate tasks (on a logarithmic scale):

We can see that AutoScaling did indeed kick in — if a little over-zealously — and decide to scale out 5(!) new tasks. This was more than enough to handle the 200 RPS.

So that’s the end of it — AutoScaling works, lets move on?

Well, unfortunately, no.

AutoScaling works only if the rate of increase is sufficiently low. Or, put another way: if the duration over which the load is added is long enough for AutoScaling to a.) realise it needs to add capacity and b.) actually add the capacity.

To illustrate this, we’ll run the same test, but halve the ramp up time:

Here we can see that, in fact, AutoScaling does not work.

The green line shows HTTP 500 responses, of which there were a fair number, until AutoScaling finally got its act together.

One way around this problem is to pre-scale: add capacity in advance of a known traffic increase. At Space Ape we do precisely that when an event is about to begin in one of our mobile games.

This of course presupposes that you know in advance. Some workloads do not have this advantage; think of a link which suddenly goes viral, or a TV advert driving a call to action. Systems with workloads such as this would likely have to either run consistently over-scaled, accept some errors, or have client retry mechanisms. None of which sound particularly attractive.

Lambda to the Rescue!

And now we come to the coup de grace. Can we use Lambda to soak up spikes in load, to give AutoScaling a chance to catch up?

The answer is probably, yes. But there are many ways to skin this cat.

Here is what we did:

Recall that we have 2 backends, one Fargate and one Lambda, both using the same container image.

Recall also that we believe each Fargate task can handle 100 RPS.

The aim then is take the portion of load which is above 100 RPS per target, and somehow send it to the Lambda backend.

To this end we added a rate-limiter to the application. It is only enabled on the Fargate backend (through an environment variable). Most requests will pass the rate limit and be served by the Fargate application.

However, if the rate is seen to be >100rps, it will issue an HTTP 301 redirect instead. The target of the redirect will be the Lambda backend. The client merrily follows the redirect, and the request is served.

There are a number of things to be said about this approach:

  • This is a form of backpressure. Ordinarily backpressure would result in the client request being rejected, or told to try again in a moment. But here we have still serviced the request, by sending it to our Lambda pool.
  • In this approach all requests are still sent to the Fargate tier, even those which are above >100rps. It works in this contrived example because a 301 redirect uses significantly fewer CPU cycles than calculating prime numbers. A better approach may be to have some out-of-band process which tinkers with the Target Group weighting when it sees a spike in traffic.

The Results

Ok. But does it actually work?

Well, here is the same test as above, with the Lambda redirection turned on:

  • The blue line shows HTTP 200 responses from the Fargate backend
  • The orange line shows HTTP 200 responses from the Lambda backend.
  • The green line is the number of Fargate tasks.

We can clearly see the Lambda tier taking on a high number of requests as load ramps up. Once AutoScaling arrives to the party it adds 5 tasks(see the bump in the green line), after which 100% of requests are once more served by the Fargate tier, and the Lambdas retreat.

So, yes, we successfully bursted to the Lambda tier! We provided a buffer until AutoScaling caught up, and no errors were returned to the clients.

Summary

We have proven that is indeed possible to a.) build a single artefact for use with both Lambda and ECS and b.) serve most requests from a cheaper ECS tier, only switching to Lambda when capacity on that tier is insufficient.

Obviously what we’ve proven is a specific solution, to a very contrived example, borne out of a hypothetical question. But we like to think that there may be some teams out there, in circumstances as specific as these, with whom it may resonate. If so, we’d love to hear from you!

Thanks for reading.

--

--