The largest benchmark of Serverless providers.
We have Serverless around for almost four years now, since Amazon introduced it on the re:Invent in the end of 2014. In 2016 Google, Microsoft and IBM joined the party. While there is no big difference in pricing, when comparing those 4 major providers, there is a difference in delivered performance, as the following benchmark will show you.
At the end of the article you will find information how you can help me with collecting more data to provide more thorough research and insights about Serverless.
To benchmark the Serverless offerings, I created a small test function, that calculates fibonacci number 39, and deployed it to the four providers using the Serverless Framework. This function is a good fit for measuring the computational performance of the cloud functions, because for every call the same amount of operations is executed. I also created a second function, that does some floating point calculation (repeatedly calculates a distance matrix) and returns the amount of iterations of one second. The Fibonacci-function seemed to be more reliable, so it was used to conduct the tests.
Additional to the actual calculation time of the Fibonacci number, I tracked the start and end time, the roundtrip time, as viewed from the client and the overhead, which is the roundtrip time minus the actual calculation time. The functions have been called from a Node.js process, that uses a promise queue to configure concurrency. This means for testing concurrency, rather than requesting bulks of 50 requests, waiting for all to finish and then requesting the next bulk, the promise queue issues the next request, when the first one returned. So when I tested a concurrency level of 50, at all times 50 requests were active.
Before we look at the first plot, let’s look at the pricing and how calculation power is determined at the four providers. In the following table you can see the pricing models.
The configurable parameter for all Serverless functions is the memory size. At AWS the CPU is doubled, when the memory is doubled. At Google this is also the case but only for 128 to 512 MB of memory, for 1GB and 2GB you get a little less additional CPU, actually they provide you with concrete numbers for the CPU power. Respective to following memory size 128,256,512,1024,2048 you get 200, 400, 800, 1400 and 2400 MHz. IBM lets you choose between three memory sizes from 128 to 512MB, but the CPU is fixed. Azure will automatically determine the memory need of the function, but assigns all functions the same computational power.
This plot shows the calculation time required to calculate Fibonacci number 39 on different instance sizes at the four providers. Each configuration was executed 25 times, except AWS/GCF 1024 MB, Azure and IBM 128 were executed 500 times, for greater precision for following plots.
As expected for AWS the calculation time halves, when doubling the memory size, but at around 1792MB RAM we see diminishing returns, this is because a second CPU is allocated to bigger instance sizes and the benchmark function only utilizes one CPU. We will look at this point again on the next plot. Azure delivers the same computation power for all calls. The computation power delivered by Google Cloud Functions is more distributed, as you can see by the data points being spread out more, and generally fulfills the expectations. We can see that the increase in computation power fades for the two bigger instance sizes. IBM delivers constant computation power, except some outliers, that take longer.
This plot shows the calculation time multiplied by the memory size. A higher value means, that the calculation time per 1MB RAM is higher, which is bad. Here you can see again, that after around 1792MB instance size for AWS, the value increases, which means, that we get less bang for the buck. So if you gonna choose an instance size for just computation purpose (single core) and want the most calculation power for your money, you would choose 1792MB at AWS, 512MB at Google Cloud Functions and 128MB at IBM. Of course this neglects the increasing overhead per function call for smaller instance sizes.
Next we look at the distribution of calculation time for the requests at a concurrency level of 1. The following plots each use 500 data points. If you have never seen a histogram before, this is how it works: You create bins, which are ranges for the values, e.g. if you have ratings from 1 to 9 points, ratings from 1 to 3 land in the “bad” bin, ratings from 4 to 6 land in the “medium” bin and ratings from 7 to 9 land in the “good” bin. What is in the bins can be seen at the x-axis. The y-axis shows us how many items are in a bin.
95% of the requests at AWS have a calculation time of 1470ms or less with most having a calculation time of 1440 to 1450 ms. Only 5% take up to 10% more calculation time. So generally you get a predictable amount of computation power.
Google Cloud Functions: 52% of the requests receive a calculation time 200ms below or above the average. 32% are computed faster and 16% receive less computation power (up to 40% less). These values are distributed a lot more, than in the plot before. Computation power you will get might vary a bit.
Most of the requests at IBM OpenWhisk have little variation in computation power, except some outliers, that randomly receive up to 4 times less computation power.
As you can see the computation power at Azure only varies in a range of 270ms, which is at maximum 7% of the average.
The conclusion of this comparison is, that at AWS, IBM and Azure, the computation resource allocation for non-concurrent requests varies only little (below 10%) and at Google it varies more (30%).
What happens if we need more than one Serverless instance at once? This should be a common use case as soon as an application, build on Serverless, receives a little more traffic. Let’s look at what happens to the allocated computation resources, when we request 50 instances at once. For following plots 1250 data points have been collected.
AWS: The average calculation time for non-concurrent requests was at 1448ms. Here we can observe, that two piles have formed. The first one is somewhat, what we expected. 55% of the requests have a calculation time of approximately 150ms more than at a non-concurrent level. Another 45% of the requests receive approximately half the computation power we would expect. We can clearly see, that performance degrades as we use concurrency.
On average the computation time increased by 46%.
Note: I switched to a bin size of 50, to have a better view of the data.
GCF: For non-concurrent requests the average computation time was 2743ms. 14% of the requests take over 3400ms, which means that they are more than 25% slower. 5% of the requests take less than 1900ms, which means they’re more than 50% faster. The remaining 81% are not too much slower or faster, than at the non-concurrent level, which means that Google handles the allocation of computation resources to concurrent requests better than AWS.
On average the computation time increased by 7%.
IBM: For non-concurrent requests the average computation time was 1429ms. 31% of the requests are handled below 1800ms, which is somewhat the level we would expect. 8% of the requests are more than 20% and up to twice as slow as expected. The remaining 61% are more than twice up to five times as slow as expected. This is a real bad decline of computation power when using concurrent Serverless resources.
On average the computation time increased by 154%.
Azure: For non-concurrent requests the average computation time was 2408ms. 7% of the requests received 25% or lesser computation power. The remaining 93% are distributed around the level we would expect.
On average the computation time increased by 3%.
The Serverless computation performance massively degrades at AWS (46%) and IBM (154%) for concurrent requests. Google (7%) and Azure (3%) are able to handle this far better.
Now we will look at the overhead (remember, this is all time spent not on actually calculating the request). This is an important measure, because it can severely impart service levels.
AWS: 88% of the requests have an overhead of 100ms or less. 9% have an overhead of 100–200ms, 2% of 200–300ms. The last 1% can have an overhead of up to 1.2s. The average overhead is 87ms.
GCF: 71% of the requests have an overhead of below 100ms, 9% of between 100 –200ms, and 3% of between 200 –1000ms. The remaining 16% have an overhead of 1 to 12 seconds. The average overhead is 702ms.
IBM: 86% of the requests have an overhead of below 100ms, 5% of between 100–200ms and 4% of between 200 and 1000ms. The remaining 5% have an overhead of 1s to 6.6s. The average overhead is 335ms.
Azure: 16% of the requests have an overhead of 500 to 1000ms, 9% have an overhead of 1 to 2s, 48% of 2 to 4s, and the remaining 27% have an overhead of 4 to 12s. The average overhead is 3097ms.
Only AWS Lambda has reasonable overhead (87ms) for concurrent requests (50), the overhead of IBM OpenWhisk (335ms) is high, the overhead for Google Cloud Functions (702ms) is even higher, but the overhead for Azure Functions (3097ms) is abysmal.
All providers seem to struggle with delivering constant performance, especially when concurrency comes into play. Overall AWS Lambda seems to have the best performance, mainly because of the reasonable overhead, which makes other providers pretty much unusable for applications with concurrent requests. Maybe AWS Lambda still has the competitive edge, because it started 2 years before the other services. Overhead is an important size for all kinds of requests, not only compute intensive workloads, 3 seconds of additional overhead (Looking at you, Azure) may kill the user experience.
Disclaimer and future work
This data represents only a snapshot, because it has been collected at a distinct point in time and not continuously, so it might not be as representative as it could be, when the data was collected continuously, although I think the data is capable of showing general trends.
To improve quality and validity of this data, I want to create a service that continuously collects data by requesting serverless functions, every hour and every day and provide the recent data on a dedicated website. This will cause quiet some costs for the serverless requests, so if you or your company (talk to your boss!) wants to support me in collecting this data, I plan on setting up a Patreon or OpenCollective account for that. For now if you’re interested, just leave your mail here and I will contact you as soon as the project starts.
So if you found this research, or at least some part of it, useful, please think about supporting the work.