Serverless Benchmark 2.0 — Part I

Initial release documentation and explanation for the new serverless benchmark on https://serverless-benchmark.com/ .

Maybe you stumbled upon my first benchmark on Medium, that I released half a year ago. In the last weeks I have been working on a newer and better version of this benchmark. It aims to deal with the shortcomings of the first version. The methods of measuring requests have been improved and more rigor has been applied to both data collection and evaluation.

First of all the benchmark is now continuous, which means that the actual benchmark is executed every hour (for now) and the aggregated data over a larger span of time is used for the evaluation. Also the evaluation is continuous: If you visit the serverless-benchmark.com site the evaluation is based on all data, including recent benchmarks.

This post aims to explain roughly how the benchmark is operating, to make it as transparent as possible.

Collecting data

For now five offerings are benchmarked. AWS Lambda, Google Cloud Functions, Azure Functions, IBM Cloud Functions and Cloudflare Workers. For each provider the same function is deployed in different configurations (if possible). The function is either calculating a bcrypt hash with a configurable number of salt rounds, which is mainly CPU-intensive, or is idling for a random time between a set min and max value, to simulate waiting for an API. The functions have been deployed to datacenters as close as possible (EU central/west). This is for AWS, IBM and Cloudflare (automatically through edge deployments) Frankfurt, Germany, for Google St. Ghislain, Belgium and for Azure Netherlands. The server from which the functions are invoked is provided by DigitalOcean and located in Frankfurt as well. I chose to use DigitalOcean for this purpose, to not favor any provider, by deploying the server in one of their data-centers.

The functions are invoked hourly with 10 repetitions each (for now) and a concurrency of 1, 25 and 50 (so 500 calls for a concurrency of 50) from a Node.js server at said DigitalOcean droplet. I use the time feature of the request/request library to accurately determine the time between start of the request and the arrival of the first bit of the response. This time will be called roundtrip from now on. The serverless functions itself keep track of the time they took to deal with the workload. This time is called computation time and is returned with the response. Timings and other data are saved into a MongoDB on the server to be evaluated for the website.

Evaluating data

Depending on caching, every hour or so, the data is aggregated on the server to be displayed on the website. I calculate the 0 to 100 percentiles for the plots, as well as the average, for different timings and supply them through an endpoint, together with data about the benchmark. For the currently displayed metrics only data from warm-invoked functions is used, cold start data is collected and will be used later.

For now I decided to just release the timings for the overhead which is defined as roundtrip — computation time to compensate for computing resources. Other timings, as well as cold starts, will follow as soon as possible.

Screenshot of the average section

On serverless-benchmark.com in the average section the average overhead in ms is shown for each provider, as well as the amount of data points collected for this provider and concurrency. The concurrency can be changed by the dropdown in the top-right corner.

Screenshot of the percentiles section
What does percentile mean?
If the 90th percentile is 200ms, 90 out of 100 requests were served in 200ms or faster, 10 out of 100 requests required more than 200ms.

In the percentiles section you can see median, max, 90th and 99th percentile. Also a plot with all percentiles from 0 to 99 is available, you can hover the graph to see corresponding overhead times for a percentile. The plot is only plotted to the 99th percentile, because the 100th percentile / max negatively influences the resolution of the other percentiles as it contains all super-high outliers.

I will not suggest which metric is most important for your use case now, you have to decide that for yourself.

Wrapping up

You can find a roadmap of planned features on the page. If you have any suggestions, which metrics and graphs might be useful, or have concerns with the methods used, hit me up here or on Twitter
Also sorry for the site not being usable on small devices for now.