Multithreading in AWS Lambda, Part 1: Performance Scaling

7 min readJan 20, 2023

AWS Lambda (and serverless and Functions-as-a-Service (Faas), in general) is the best way to run code. Most of the operational overhead is not your responsibility (i.e., no servers to administer). What’s not super obvious, though, is how to scale performance in Lambda using multithreading.

In this new serverless series, we’ll dive into multithreading in AWS Lambda, looking at implementation, scaling, and even comparing multithreading vs multi-instance architectures.

Today, in part 1, I’ll provide experimental data to show how multithreaded performance scales in Lambda.

TL;DR: What memory size for X number of threads?

Above is the table of summarized results from this experiment, showing at what point a Lambda function will have max performance for a certain number of threads (workers):

- One thread = 2048MB

- Two threads = 4096MB

- Three threads = 5376MB

- Four threads = 7168MB

- Five threads = 8960MB

- Six threads = 10240MB (Lambda max memory size)

Now, if you’re wondering how to interpret this and how we got the data in that table, read on!

Quick refresher: AWS Lambda and core counts

While you don’t have to worry about servers in Lambda, you do still have to worry about some important things (but only very lightly, far from having to actually worry about a real server). Available compute resources — in the form of compute cores and memory — is one of these things.

In AWS Lambda, while you can directly specify the amount of RAM your function will have (currently from 128MB all the way to 10240MB, in 1MB increments), you can’t do the same for CPU cores.

Instead, as you increase memory, you also get a proportional amount of CPU power. IIRC, back when Lambda only had a 4GB RAM limit, there used to be an official graph in the documentation that showed you when you get an additional core. I can’t find it anymore. (Maybe my Google-fu is failing; if you still know where this is, please link in the comments!)

So, here in Part 1, what we’ll uncover is exactly that problem — let’s figure out at what points in RAM amounts our Lambda function gets to enjoy the power of additional cores. For this, we will need a CPU-intensive load, give it varying amounts of RAM from 128MB up to 10GB, and also vary the number of workers each time (i.e., how multi-threaded it is) from 1 to 6 (6 is the current maximum number of cores you will get in a Lambda function)

The experiment: CPU-intensive load through password hashing

In a nutshell, to generate data for Lambda multithreading scaling, this is what I did:

Created Python code that does CPU-intensive (not memory-intensive) password hashing operations using the Python standard library, and turned that into a Lambda function.
The number of workers (threads) is controlled by an environment variable. It ranges from 1 to 6.
There are 41 different memory sizes, starting from 128MB, then 256MB, then incrementing by 256MB until it reaches the maximum Lambda memory size of 10240MB (i.e., 128, 256, 512, 768, 1,024… all the way up to 10,240)
There are 2 architectures tested, x86 and ARM (through AWS Graviton), the two architectures supported by Lambda.

In total, taking into account all possible combinations of those architectures, workers, and memory sizes, that means a total of 492 different Lambdas were created that do the same password hashing work, just with different parameters for this experiment to cover all possible combinations:

Lambda 1 would have 128MB RAM, 1 worker thread, and use the ARM architecture.
Lambda 2 would have 256MB RAM, 1 worker thread, and use the ARM architecture.
And so on, ending with Lambda 492, which would have 10,240MB of RAM, 6 worker threads, and use the x86 architecture.

Here’s a table to help you visualize the experiment parameters better. I can’t list all 492 Lambdas for space reasons, but this shot of the table with partial results should give you a clear idea of what we’re dealing with here:

On the left is the table for the ARM architecture, on the right is the table for the x86 architecture. In either table, you’ll see that the experiment runs through all 41 memory sizes, and then workers. You’ll see that after the single worker configuration has gone through all possible memory sizes for both architectures, it then starts over again from 128MB, but with an additional worker thread this time.

In total, that’s 492 different configurations (architectures x workers x RAM sizes).

Of course, it wouldn’t be a rigorous experiment if we only ran each of those Lambdas once to benchmark them. Instead, I ran them continuously for a few hours, triggering each of them every minute through EventBridge Scheduler. By the time I manually stopped the experiment, each Lambda executed for 190 times (about 6.5 hours total experiment time). The figures you see under “avg_proc_time” is the average of all 190 executions for each of those Lambdas.

Results

The table above shows the summarized and snipped results. There are 492 Lambdas total in this experiment (246 for x86, 246 for ARM), and I only show the most relevant here, so there’s about only two dozen Lambdas here on either arch.

I removed the majority of the rows — those that don’t show the best scaling for the given number of worker threads. I saved a couple of rows on-between each blue row

I highlighted in blue the memory configuration that resulted in the best scaling for the number of worker threads:

The first blue row (at 2048MB) gives the single worker thread max performance.
The second blue row (at 4096MB) gives two worker threads max performance
The third blue row (at 5376MB) gives three worker threads max performance
The fourth blue row (at 7168MB) gives four worker threads max performance
The fifth blue row (at 8960MB) gives five worker threads max performance
The sixth blue row (at 10240MB) gives six worker threads max performance (and you literally can’t have more at the moment, as that is the maximum RAM in Lambda right now)

You probably also noticed a column there that wasn’t in the earlier screenshot: “cores”. That’s the OS-reported number of cores that the Lambda instance sees (in Python, that would be through an os.cpu_count() call). But as you can see from performance data (the average processing time metric, under the “avg_proc_time” column), just because the OS reports a specific number of cores doesn’t mean your Lambda will get to use those cores in full. You’ll see that blue rows are NOT when the OS first sees an extra core available (i.e., the max performance for 3 workers isn’t when the OS first sees 3 cores available).

If you are interested, I’ll leave a GitHub link at the end of the article where you can download raw results (all 492 configurations) so you can inspect the data in more detail.

Wrap up

Hopefully, you found this exercise and the results interesting, and gives you some motivation into figuring out multithreading for your own serverless functions.

In Part 2, we’ll dive deeper into how to implement multithreading in Lambda, using code from this experiment as an example and guide. Stay tuned!

Of course, running this experiment wasn’t free. Almost 500 Lambda functions, with an average memory size of 5GB, an average runtime of >12 seconds, and 190 executions each (almost 100K total)… ouch. In total, I spent >$100 in this experiment (including paying for a mistake I made which ran over night before discovering it, stopping it, fixing the code, and then doing a successful run for a few hours — but that’s not a bug, that’s literally what happens during R&D). It’s all good though, because I have lots of AWS credits to spare:

As an AWS Ambassador, I got hooked up with a decent amount of credits, exactly so I can do cool stuff like this.
I also personally collect lots of AWS credits — a technique and tip I shared in a previous article about how I took and passed 5 pro-level AWS Specialty certification exams back-to-back in just 3 days. So even if you aren’t an AWS Ambassador and don’t have easy access to AWS credits c/o the program, you can easily gather AWS credits worth hundreds of dollars per year.

Finally, if you are interested in looking at the raw data from this experiment yourself, here’s my GitHub repo for it. Have fun, and see you again soon for Part 2!

UPDATE: Available articles in this series: