Performance tuning for an AWS Lambda-based API (Detecting Paris’ locked bicycle stations 5/5)

Jean Baptiste Muscat
Nov 21 · 13 min read
Photo by Jacek Dylag on Unsplash

This series of articles is about me spending way too much time trying to solve a niche problem (detecting locked bicycle stations in Paris, see Part 1) while learning how to use the AWS Serverless stack. To find the other articles, skip to the bottom of the page.

In Part 4, I created the frontend and the API for my www.velinfo.fr web application that computes and displays the status of the 1.500 Paris bicycle stations.

Everything is working fine, but the site is quite slow. The static content takes some time to load, and the API is not very responsive. How can I improve that?

Frontend caching with CloudFront

A simple solution when you want to serve static content (like the static files of a website) faster, is to use a CDN or Content Delivery Network. It’s a distributed network of servers, that will be used to cache your content (from simple files to videos) as close as possible to your users. More precisely, the CDN will be “in front” of your application, meaning that the client will call it instead of your site. And the CDN will call your site if it does not already have the content in cache.

There are a lot of independent CDNs to choose from, from Akamai to Cloudflare. But, as AWS has its own CDN (CloudFront), I’ll use this one.

CloudFront locations

Setting up CloudFront in the CloudFormation template is simple. You need to define a “Distribution” (how and what will CloudFront cache) and define :

  • the HTTP methods you want to cache (GET and HEAD for me)
  • what is the “Origin”(the source of the content that the CDN is proxying, so the website’s S3 bucket in our case)
  • what is the caching behavior (for how long the content should be cached, what happens in case of missing content, …)
  • what is the price class to use (which correspond to how many locations will be used)

And not forget to update the CNAME DNS record for “www.velinfo.fr” so that it points to the distribution instead of the S3 bucket.

All this can be done via the AWS web console, but as for the rest of my application, I will define it directly in my CloudFormation file instead.

A CloudFront distribution set up using CloudFormation

Performance improvements

To assess the performance gains, I’ll use KeyCDN’s performance test and focus on European cities. The most important metric is the TTFB (Time To First Byte), which represents the wait time before the call is answered.

The improvement is noticeable: from ~55ms to ~25ms. Nice!

Note: Now that CloudFront is set up, I no longer need to make the website S3 bucket publicly accessible. Only the CloudFront distribution should access it.

How much does it cost?

CloudFront pricing is based on three axes:

  • The amount of data transferred from CloudFront to the internet: $0.085/GB (it gets cheaper with more traffic)
  • The amount of data transferred from the origin to CloudFront: $0.020/GB
  • The number of HTTP(S) calls: $0.0120 / 10.000 HTTPS calls

Currently, loading the homepage performs about 10 calls to my frontend domain, for about 300KB. The number of calls to the origin is negligible because the content will only change after a deployment so it can stay in the cache for very long.

Realistically, I’m not expecting more than a few thousand users. So let’s say 10.000 daily users, that would give me~3 GB of daily outgoing traffic and ~100.000 calls, so about $0.375 per day. And that does not account for the fact that most of this static content will be cached in the client browser anyway.

Analyzing Lambda performance with X-Ray

The static content of my site is loading faster, but my API is still too slow.

For example, the GET /stations endpoint, which returns the current state, status, and characteristics of each bicycle station takes more than 12 seconds to finish!

I already made sure that my DynamoDb calls were made in parallel, so I’m not really sure of what could be done to improve it further. Luckily, I can use AWS X-Ray to more finely understand where is my performance bottleneck.

X-Ray is a system that traces each call made within your application to help identify performance issues. To use it on a NodeJs Lambda function, you need to enable it for your function( in the web console for example), import the aws-xray-sdk package and use the AWSXRay.CapureAWS() method to wrap your call to your function’s dependencies (the call to DynamoDb in my case).

Then, in the X-Ray web console, you’ll be able to see the detailed traces of each call during a function execution:

An X-Ray trace for a GetStations function call.

Here I can clearly see that my three DynamoDb calls are correctly made in parallel, but that the longest one is lasting 8 seconds!

This seems very slow for a simple read operation, so is there something wrong with DynamoDb? Well, the 8 seconds do not only represent the time spent on the database side but also the time spent on the function’s side marshaling / unmarshalling the data (DynamoDb stores the objects as a typed JSON and needs to be processed before being used). And these kinds of operations are CPU intensive.

What we are really seeing is the CPU limitation of a standard Lambda function.

Increasing a Lambda function CPU performance

By default, a Lambda function only uses 128MB of memory and has access to a fraction of a vCPU core. It is easy to increase the memory (by setting the MemorySize parameter in the SAM/CloudFormation template, up to 10240MB), but there is no way to assign more CPU. That’s because the CPU power automatically scales with the assigned memory.

To put it simply: to have more CPU power, you need more memory.

More specifically, you reach a full vCPU core at the 1769MB mark. Once you assign more memory, you will start having more than one thread, but single-thread performance will already be maxed out at 1769MB.

So let’s set the GetStations Lambda function to 1769MB and see what kind of performance gains we will have with a full core.

X-Ray trace after increasing the memory to reach a full vCPU thread.

From 12.6s down to 811ms, not bad 😊.

What about the cost? The increase in memory impacts the cost per milliseconds proportionally. This means my 1769MB Lambda costs ~14 times more to run than a standard 128MB Lambda. but it also runs ~15 times faster. So, in my case, increasing the memory (and the performance) is cheaper!

Doing the same optimization of other endpoints offers similar performance gains. I could also increase the memorySize for the Lambda functions that are part of the internal detection system, but I’ve no performance constraint there as they simply run once every minute.

The numerous ways to cache an API

Now that each Lambda function serving an endpoint is optimized, let’s do a simple performance test. I’ll call 50 times in a row one of the heaviest endpoints (GET /prediction/by-station).

Response time for successive calls to the prediction/by-station endpoint.

For the first ~20 calls, the endpoint behaves consistently. It takes about 3 seconds to finish. Then the response time starts to increase, with some calls failing, ending with a full failure for the last 10 calls.

So what happened? Well, we just went over the DynamoDb table read capacity, so our read operation has been throttled. Worse, while the endpoint was failing, some of the functions of the internal detection pipeline also failed, because they read from the same table and have also been throttled.

You’ll find more details about the DynamoDb RCU (Read Capacity Unit) and WCU (Write Capacity Unit) in Part 2.

In other words: if there are too many calls on my API, the API fails and, even worst, it may break the internal detection pipeline.

What could I do?

Increasing the Read Capacity of all the concerned tables would be the simplest solution, but this could quickly become costly as I’m already maxing out the free-tier limit. And it would only postpone the problem, simply requiring more calls before failing.

I could also create a new set of tables that would only be written to by the pipeline functions and read from by the API functions. They would serve as an isolation layer. But for those tables to sustain a high number of calls would still require more RCU than I can afford.

Instead of trying to increase the throughput of my tables, maybe I should be focusing on finding a way to cache my data so that my tables are simply less called.

Hit rate

One important metric to assess the usefulness and the cost of any caching solution is the hit rate. It is the ratio of the number of calls that a cache was able to answer over the total number of calls. The bigger the hit rate, the more useful the cache is.

Obviously, the hit rate depends on the design and the use case of my API. For example, an endpoint that would return data customized for the caller would result in a low hit rate, as a given response could only be cached for a single user.

Luckily my API is very simple and consists only of a handful of endpoints, with no parameter, that are user-independent and whose content is only updated every 60s or so. So I can expect a very good hit rate.

DynamoDb Accelerator

DynamoDb Accelerator (or DAX) is a managed cache for DynamoDb.

More specifically it’s a read-through/write-though cache. It will greatly increase the read performance of the DynamoDb tables it is caching, and every read operation that is answered by DAX won’t consume any read capacity from the table.

Internally, it consists of a cluster of one or several cache nodes. Think of those nodes as small EC2 instances that are created in your name by AWS. As for any EC2 instance, they need to be set in a sub-network (a VPC, or Virtual Private Cloud). You pay for the size and number of those nodes. With the smallest node starting at about $30 a month.

✅Fixed cost, would benefit all functions, would improve the performance of all calls.

❌Requires updating the data layer, requires setting up a VPC, won’t reduce the number of Lambda or Gateway calls, the base cost is expensive for a pet project.

Elasticache

Elasticache is AWS-managed distributed cache. The caching technology can be Memcached or Redis, depending on your needs.

A Lamba can connect to an Elasticache cluster to fetch data, the rest of the implementation is up to me. I could simply use it as a side cache, or update the pipeline’s functions to write directly in the cache and have the API read from it instead of the DynamoDb tables.

As for DAX, Elasticache consists internally of a set of managed EC2 instances and requires the creation of a VPC. The pricing depends on the number and size of the nodes, with the smallest node costing about $10 a month. But, depending on how the cache is implemented, you may want to have at least two nodes to ensure high availability.

✅Fixed cost, the caching logic is up to you.

❌Requires rewriting part of the application to implement the caching logic, requires setting up a VPC, won’t reduce the number of Lambda or Gateway calls.

Using a Lambda instance’s own memory

When a call reaches the API Gateway, an instance of the corresponding Lambda is started to answer it.

But starting a Lambda takes some time (it’s a “cold start”). So, to minimize this, Lambda keeps the instances alive for a few minutes after a call is finished, in case the same function would be called again.

We can take advantage of that by caching some data directly in a Lambda’s instance (by assigning a variable outside the event handler function for example). The next time this Lambda is called, if the same instance is reused, the data will still be there.

This solution has two major limitations:

  • predictability: we have no control over when an instance is freed or kept, so the hit rate of this caching solution can’t be predicted
  • concurrency: when a Lambda function needs to be started, if a corresponding instance exists but is currently in use (answering a concurrent call for example), Lambda has no choice but to create a new instance. This means that the “hit rate” will be worse in high load situations as more and more fresh instances will be created…

✅Free!

❌Only a marginal improvement, won’t provide any guarantee under load, won’t reduce the number of Lambda or Gateway calls.

API Gateway’s cache

The API Gateway offers a caching functionality, that can cache the data it returns.

In contrary to DAX or Elasticache, no VPC needs to be set up as the cache nodes are “owned” by the Gateway. In fact, you don’t have to select a number or type of nodes, but only the total memory you need for the cache, everything is handled by the Gateway.

The setup is fairly simple: you just need to choose a memory size, a retention period and flag the endpoints you want to be included in the cache.

The ApiGateway comes in two flavors: the RestAPI and the HttpApi. The RestApi provide more features. The HttpApi is more performant and cheaper by about x3. Only the RestApi support caching.

The pricing is based on this memory size, with a minimum of 0.5GB for about $15 a month.

Compared to the previous solution, this one offers an additional advantage: as the cache is “earlier”, each time it can answer a call, no Lambda function will have to be called.

✅Fixed cost, reduce the number of Lambda calls, simple to set up

❌Requires using the RestApi flavor which is more costly than the HttpApi, moderate fixed cost

CloudFront

As for the static website content, a CDN (CloudFront here) could also be used to cache the data. In fact, from CloudFront’s perspective, caching an API or a website is mostly the same.

Using CloudFront would have additional benefits:

  • The data is cached closer to the user (even if that is not very useful for a service that only concerns a single city)
  • Cloudfront can handle some other features, like content-compression

Setting up the distribution is similar to what I’ve just done for the website static contents, with a few exceptions:

  • The TTL will be shorter (60 seconds)
  • I need to forward the header needed for CORS

Instead of using the in-line parameters, I will define a proper CachePolicy this time:

The pricing is more complex as it depends on the number and size of calls. But, this is where things get interesting: a call answered by CloudFront is cheaper than a call answered by the ApiGateway:

  • Call cost: HttpApi $0.035 per 10.000, CloudFront $0.012 per 10.000
  • Data transfer cost: HttpApi $0.09 per GB, CloudFront $0.085/GB

Obviously, this is true only because my hit rate will be high and because CloudFront won’t have to call the underlying API Gateway very often (which would add cost).

✅Cost savings, cost scale with usage

❌Setup is tedious, cost scale with usage

And the winner is…

From a cost perspective alone, CloudFront is the clear winner. The setup won’t require any rewriting of my functions, which will make the setup simple.

If my API had been more complex (which would have generated a lower hit rate), I would have maybe used DAX, as it would have offered the most robust, yet simple, solution.

Winner: 👑CloudFront👑

So, how well does it work?

Careful, the scale is logarithmic

As before, the first call takes about 3 seconds to finish. But all the following calls last only ~10ms. I think we can say this is successful 😊

Conclusion

Even if AWS wants you to think that Lambda will allow you to run code without thinking about servers or clusters®, it obviously becomes more complicated as soon as you have to improve performance.

But I see that as a good thing: it means that even within the walls of the serverless stack, I’m not constrained to a single solution for when I’m facing a problem. I can leverage several tools and even rely on “non-serverless” tools like Elasticache.

This has been true for the specific subject of this article (performance) and for the others in this series.

It’s been more than a year since I started working on this project. It’s still a little bit wonky, the frontend design is basic but I achieved what I really wanted to do: understand what this “serverless” stuff was about, get more comfortable with AWS, and build a “real” application.

I hope you had as much fun reading me as I add working on this! If you want to look at some messy code, you can find the sources on Github: https://github.com/ouvreboite/velinfo

Saving costs was a primary objective, which will allow me to maintain the application online without thinking much about money. I’m also happy to report that the team at Velib seems committed to improving their service. Hopefully, my application will soon become obsolete!

As for me, I opted for a more radical solution: I bought my own bike.🚴

  • Part 1: Choosing the AWS serverless stack for a prototype
  • Part 2: The backbone of a serverless app: Lambda functions and DynamoDb tables
  • Part 3: Implementing a real-time detection algorithm with Lambda functions and DynamoDb streams
  • Part 4: Creating a serverless API and hosting a frontend with S3
  • Part 5: Performance tuning for a Lambda-based API

CodeX

Everything connected with Tech & Code. Follow to join our 500K+ monthly readers