Multithreading in AWS Lambda, Part 4: Massive Throughput and Cost Efficiency

JV Roig
7 min readFeb 10, 2023

--

In this “multithreading in serverless” series, we dive into multithreading in AWS Lambda, looking at implementation, scaling, and even comparing multithreading vs multi-instance architectures.

Previously, in Parts 1–3, we looked at experimental data to show awesome multithreaded performance scaling in AWS Lambda using different memory size configurations, then discussed how we can easily implement such multithreaded processing in Lambda, and then compared multithreading vs multi-instance Lambda architectures.

Today, in Part 4, we’ll answer the question — is using multithreading actually more efficient and cost-effective than just a standard multi-instance Lambda architecture? We’ll use data we gathered from the experiment described in Part 1 and crunch some data to find the answer.

Multithreading: Massive benefits for throughput and cost efficiency

Above, you can see the absolutely massive improvements in both throughput (a relative measure of the amount of work done per second) and cost efficiency (how much $ we spend for the throughput/performance we get) when you properly multithread Lambdas. This is from our CPU-intensive workload experiment, as described in Part 1 of this series.

What you see above is the cost and throughput comparison between the multithreaded Lambda config vs a Lambda that only does single-threaded processing (1 worker thread), using the most cost-efficient configuration for each.

In the arm64 architecture, 10,240MB (max Lambda memory) with 6 threads resulted in a peak throughput/$ of 3,111.58. The most efficient single-thread configuration used 1,792MB of memory, for a throughput/$ of only 669.90. That’s 4.7x better cost efficiency due to multithreading! [3,111.58 / 669.90 = 4.7]

In the x86 architecture, we see a lower but similar cost-efficiency result, with the best result (the 5-threaded Lambda slightly edged out the 6-threaded one here) getting a 4.2x better throughput compared to the best single-threaded Lambda. [830.99 / 198.84 = 4.2]

Of course, results like these are heavily workload-dependent, and the workload in the experiment where we derived our data is incredibly CPU-intensive and parallelizable (as described in Part 1).

In the next section, I’ll show you how we derived the throughput and cost figures from our raw data.

Results in more detail

You can follow along with direct access to the spreadsheet I’ll use in the following discussion. Simply visit my GitHub repo containing multithreaded experiment results. This is the same link and data as described in Part 1.

From that repo, I used the spreadsheet “AggregatedResults.ods” to start figuring out the cost-efficiency advantages of multithreading vs multi-instance. In the “scratch_all” tab, there’s the side-by-side results (arm64 and x86 architectures) containing the Lambda memory config, the worker threads used, and the resulting processing time:

It’s hard to figure out the resulting cost-efficiency from just that, though. We need to add a few things there to help us get a quantitative view of the cost-efficiency of each configuration:

  • Since Lambdas are billed in GB-secs, we should create a column for GB-secs (which is simply our memory column, multiplied by the average processing time column)
  • We need a reference cost for GB-secs. Since I ran the experiment in the ap-southeast-1 region, let’s use that region as the basis of our GB-secs cost. ($0.0000133334 / GB-sec for arm64, and $0.0000166667 / GB-sec for x86)
  • We can now also add a column for the total cost of execution of the Lambda configuration (which is just the GB-secs column, multiplied by the GB-secs reference cost)
  • Now, we need to create a throughput measure. Since our experiment creates a fixed amount of work for every Lambda, no matter the configuration (RAM / threads / architecture), we can simply standardize our throughput measurement to be how much work is done per second. This means our throughput can simply be 1 (i.e., the amount of work done; hence, we’re saying every Lambda did 1 unit of work) divided by the average processing time. This gives us a throughput measurement that means “how much unit of work can be done per second”. In general, higher throughput = more work done per second = less bottlenecks in a system = better UX.
  • Finally, we can figure out how much we pay for the performance by creating a throughput / $ column, which is simply the throughput column divided by the total cost column.

I did that separately for the x86 and arm64 datasets, placing them in separate new tabs in the spreadsheet. That gave me something like the following:

That’s a shot of the resulting arm64 (orange) and x86 (green) sheets, already sorted in descending order based on throughput / $. (You might have noticed something weird — the arm64 results are cleaner and more predictable than the x86 ones — we’ll talk about that in the Caveats section below)

You can see the additions based on our discussion above — a GB-sec reference cost field, and new columns for gb-sec, total cost ($ total), throughput, and throughput / $. (You’ll also notice a $ per thread column — that’s not relevant for our computations here, but just something I was interested in seeing.)

From there, it’s easy to see how our multithreaded Lambdas, especially those with max threads and max memory, handily outperform single-threaded Lambdas.

This means if you didn’t apply multithreading, and instead relied solely on a vanilla multi-instance Lambda architecture for this type of workload, we would end up getting far worse performance for every $ we spent.

I’ve added the throughput sheets in the GitHub repository linked above. If you want to view them yourself, look for the file “Throughput.ods”.

Caveats

Now, these results are absolutely amazing, but that’s pretty much a best-case scenario already. As we’ve discussed in Part 1, the experiment was designed to tease out threaded performance against different memory sizes, and so was explicitly made CPU-intensive.

Most normal workloads might not be so CPU-intensive. If you are merely receiving data from a web form and then invoking a database API to process the web form data, that’s an incredibly CPU-light task. It wouldn’t be that sensitive to bigger memory configs or explicit multithreading. That’s probably what happens most of the time if your Lambda is in a Jamstack-type workload.

Note that our absolute cost is a tad bit lower in the single-threaded versions (compare the “$ total” column values) compared to the max-threaded ones, so if you wanted to squeeze every penny and you don’t care about the amount of time it takes — perhaps it’s a batch process overnight, isn’t time-sensitive, and nothing else really depends on it — you could feasibly choose not to care about the throughput. If you are reading about Lambda optimization, though, that’s probably not your case, and performance and throughput is probably a key part of your workload’s user experience.

You might also have noticed that while the Lambda results for arm64 is clean (6 threads wins out against less threads, and more memory is almost always better), the x86 results are not so clean (5 and 6 threads seem to be mixed together at the top, and memory sizes are also a jumble). That is sort of weird, but that speaks to the architectural differences between the CPUs powering these architectures. In the arm64 ones (using AWS Graviton2 chips), each vCPU/core is an actual physical core. In the x86 Lambdas, each vCPU/core is a logical core, because these CPUs use SMT (simultaneous multithreading, also known as “hyper-threading” in Intel products). As a result, the arm64 ones scale better and more predictably, while the x86 (SMT-based logical cores) end up scaling less overall and with a lot more fuzziness, due to more contention between actual core resources compared to the independent “real” cores in Graviton2.

Finally, note that all talk of costs here don’t take into account the cost of Lambda requests (just the GB-secs), so this isn’t a full comparison of the cost difference between multithreaded and (single-threaded) multi-instance Lambda architectures. I don’t think that’s worth worrying about, and by the time you reach the scale where it matters, you most certainly will likely end up in a multithreaded + multi-instance setup anyway.

Wrap up

There you go, multithreading and its wonderful performance and cost efficiency boost!

If you are using Lambda for any sort of CPU-intensive processing, you should definitely consider implementing multithreading on it. And remember, while the experiment done in this series shows a huge boost in performance and cost efficiency, this may not necessarily be the case for your own workload. You should benchmark your workloads to validate your expected efficiency boost. You can follow the design of the experiment here as a guide to designing your own benchmarking experiment setup. View my experiment’s GitHub repo here.

A note on doing experiments: As an AWS Ambassador, I get hooked up with a decent amount of credits, exactly so I can do cool stuff like that experiment above. I also personally collect lots of AWS credits — a technique and tip I shared in a previous article about how I took and passed 5 pro-level AWS Specialty Certification exams back-to-back in a 3-day marathon. If you are actively trying to increase your cloud skills, I recommend you implement those tips yourself so you can do hands-on practice and experiments without having to shell out real money for your AWS bill.

As usual, if you’ve found this article helpful or interesting, make sure to clap the article a few times and follow me to tell the algorithm to show you more stuff like this and get notified. Thanks and see you soon!

Other articles in this series:

--

--

JV Roig

Multi-cloud Sol Arch w/21 certs across AWS, GCP, Azure, Alibaba & Oracle