Evaluation of Serverless Technologies at Jet

Photo by Markus Spiske on Unsplash

Serverless functions have been around for a few years and represent a new paradigm in cloud-based software engineering. This blog post focuses on enterprise adoption of serverless functions.

We are encouraged to keep an eye on new technology trends and adopt their usage at Jet. However, the adoption of any new technology requires a rigorous evaluation process. Therefore, the first step in our serverless journey was to define our evaluation criteria for serverless function runtimes. We came up with three groups of evaluation criteria: feature requirements, performance, and benefits of serverless function runtimes.

Feature Evaluation Criteria for Serverless Function Runtimes

The features we identified as requirements for a serverless runtime were based on the goal of minimising adoption barriers and maintenance cost. The following list was identified as part of our evaluation:

  1. Target Operating System — Our services run both on Linux and Windows machines. Therefore, ideally, we would like to have a serverless runtime running on both operating systems.
  2. Supported Languages — We have multiple teams using different tech stacks and programming languages. To get wide adoption of serverless within the company, it is important that our choice of serverless runtime supports all the languages used across different teams at Jet.
  3. Event Triggers — Support for various triggers: HTTP, Kafka, Azure Cosmos DB, Azure Blob Storage etc. The use of HTTP triggers does not need much justification as they are probably the most trivial and most applicable triggers to a wide range of use cases. The other important trigger for us to evaluate was Kafka triggers as Kafka is our main streaming platform for asynchronous communication between microservices.
  4. Integration with Existing Infrastructure — All new microservices at Jet are deployed on Nomad, get transparent integration with Consul for service discovery, Vault for secrets management, Prometheus and Grafana for monitoring, and Splunk for log management. This means that if we can deploy our choice of serverless runtime on Nomad, we can get all those integrations with almost zero-cost.
  5. Complexity to Manage — A system requiring complex runtime dependencies could be difficult and costly to operate. We wanted to avoid any such serverless runtime unless there was a strong reason not to do so.
  6. Onboarding and Developer Tooling — Serverless as a new paradigm already incurs a different way of thinking for developers. Ideally, we would like to provide a serverless oblivious deployment pipeline for our developers, so that, if we decided to adopt a serverless tech stack, they could use existing tooling without the need to think differently.

We identified a couple of non-managed open-source serverless function runtimes to evaluate: OpenFaaS, OpenWhisk, Knative, Kubeless, Fission, Fn Project, and Nuclio. Our selection was based on several criteria: GitHub activity of each project, documentation, flexibility to extend, ecosystem and the number of users in the industry. In terms of managed serverless function offerings, Azure Functions was the only candidate to consider as our infrastructure already built on top of Microsoft Azure.

The predefined evaluation criteria were enough to filter out most of the serverless runtimes quickly and as a result we shortlisted our list to Azure Functions v2.x and OpenFaaS only. The next two matrices highlight the main features of the previously mentioned serverless runtimes and the reasons they were filtered out.

Main Features of FaaS Runtimes
The Reasons for the FaaS Runtimes not Considered

Performance Evaluation Criteria for Serverless Function Runtimes

The main performance criteria for our evaluation were cold-start time and auto-scaler efficiency. These two are related, but are not the same thing. Let’s elaborate a bit more on this.

If we imagine a hypothetical scenario of warming-up a serverless function with 10 requests per second (req/s), the function runtime would need to allocate the required number of resources (VMs, containers, etc.) to process 10 req/s. However, if the next time the function runtime receives a significantly greater number of requests, for instance, 100 req/s, the runtime would need to allocate more resources, even though it has some warmed-up resources already. A cold-start from zero to five instances is not the same as a cold-start from zero to fifty instances.

The experiments in this article demonstrate that, cold-start time depends on how fast the runtime can spin up the required number of resources, and how fast it can scale up. The scale-up time itself depends on how fast the Docker images can be pulled onto the nodes (if it is a Docker based system), how fast new resources can be allocated, and how efficiently the decision to scale-up is made.

A cold-start time should always be associated with a number of requests per second.

Evaluating the Benefits of Serverless Function Runtimes

The features and performance provided by a specific serverless function runtime do not matter if at the end of the day we cannot answer the question, “why serverless?”. Therefore, we decided to evaluate the benefits of serverless computing as claimed by the community: cost saving and increasing developer productivity.

Cost Saving

The proponents of this promise justify this by the fact that the serverless function runtimes provide nice features to scale down to zero and scale up from zero on-demand. This can potentially save some cost and we do some cost estimation below.

Cost Estimation

Let’s assume we are going to redesign about 1000 microservices (about 30% of our microservices) as serverless functions. We believe 30% here is over-optimistic, as not all workloads fit into a serverless model and even if they do, it would require a huge effort to redesign all these services and integrate them together.

Let’s assume that our microservices have an average resource requirement of 0.5GHz of CPU and 1GB of memory. A CPU core in our case is approximately 2.4GHz, so 0.5GHz is about 20% of a single CPU core. This would add up to 500GHz of CPU and 1TB of total memory for 1000 microservices. All of this in turn would be equivalent to about 30 Azure VMs of Standard_F16 series, which would cost minimum of $8,000 in total per month. It is assumed those VMs are deployed in EastUS2 region and use Ubuntu operating system with managed standard HDD disks on a 3-year reserved plan. If we decided to use Windows VMs for the entire fleet, the cost would go up to $25,000 per month.

Now, let’s try to do similar estimation if all of these 1000 microservices were redesigned as serverless functions. We can assume that each of the 1000 functions run two times daily for an average of 2 minutes, totalling 4 minutes in 24 hours per function. Our experiments in the next sections show that in order to get a 100% success rate from an Azure function app with 200 req/s over a 2 minutes period, Azure Functions runtime would need to allocate about 100 servers. Servers here seem to be Windows Containers according to the Azure Functions Runtime documentation. During the evaluation we observed that the average committed memory for each server was about 200MB. By taking these numbers into account we can calculate the total monthly bill for running 1000 Azure functions by following the Azure Function pricing guideline.

Azure Resource Consumption Billing Calculation
--------------------------------------------------------------------
Resource Consumption (seconds) per Function per 2 Minutes:
Executions: 24,000 executions
Execution duration (seconds): 1 second
Resource consumption Total: 24,000 seconds
Resource Consumption in GBs: 

200 MB * 100 Servers / 1024 MB ~ 20 GB
Total GB-s per 2 Minutes per Function:

20 GB * 24,000 seconds = 480,000 GB -s
Total GB-s for 24 Hours (4 min) per Function:  960,000 GB -s
Total GB-s per Function in 30 Days:            28,800,000 GB -s
--------------------------------------------------------------------
Billable Resource Consumption
Resource consumption: 28,800,000 GB -s
Monthly free grant: - 400,000 GB -s
Total monthly consumption per application: 28,400,000 GB -s
Monthly Resource Consumption Cost per Function
Billable resource consumption: 28,400,000 GB -s
Resource consumption price: x $0.000016/GB-s
Total cost per application: $454.4
--------------------------------------------------------------------
Executions Billing Calculation
Total monthly executions: 1,440,000 executions
Monthly free executions: — 1,000,000 executions
Monthly billable executions: 440,000 executions
Monthly Executions Cost:
Monthly billable executions: 440,000 executions
Price per million executions: $0.20
Monthly execution cost: $0.088
Total Monthly Consumption Bill per Function:   $454.488
--------------------------------------------------------------------
Total Monthly Consumption Bill for 1000 Functions: $454,488

This is about 18 times more than the cost of using microservices.

Serverless functions can easily result in higher cost instead of saving costs!

The situation is a bit different when using a non-managed serverless runtime, such as OpenFaaS, on an existing Nomad cluster (or Kubernetes). If the client nodes in a Nomad cluster are not utilised 100%, there is a high chance that the Nomad scheduler will find slots to run a function without the need for new client nodes being auto/provisioned. Cost saving in this situation would be questionable if the company uses reserved virtual machine instances rather than a pay-as-you-go billing plan; whether you fully utilise a reserved VM instance or not, the cost would be the same.

Using a dedicated Nomad cluster for OpenFaaS on a pay-as-you-go subscription plan may make sense in certain situations. However, we should take the cost of running and maintaining a separate Nomad cluster and the serverless function runtimes itself into account during our cost estimation. For OpenFaaS it includes the instances of the OpenFaaS gateway, the Nomad plugin, the NATS server, faas-idler, and nats-queue-worker. In addition, we should be able to answer to the next few questions and include their cost into our estimation. Should we use our existing Prometheus cluster and Alertmanager for OpenFaaS? If yes, we should consider the added cost of using OpenFaaS metrics in the existing Prometheus cluster. If not, we should take the cost of maintaining a new Prometheus cluster into account. How many instances of the Kafka controller should we run? Furthermore, we may have many other custom controllers, for Cosmos DB, Azure Blob Storage, and others.

Because of these rough estimations we don’t see cost saving as the main reason to adopt serverless functions.

Improving Developer Productivity

This is one of the biggest promise of serverless functions; however, our evaluation shows that this claim holds mostly for small start-ups or for individuals who do not have established developer tooling, continuous integration/continuous deployments, and container orchestrators in place. As a matter of fact, if we take how our developers at Jet deploy and manage their services in production as an example, we would not see any significant benefits of a serverless function runtimes in terms of infrastructure abstraction. Our microservices platforms already provide many layers of abstractions that hide most, if not all details of infrastructure details from our developers. Most of the time, deployment of a new microservice is a matter of pushing a deployment file into a version-control system. In addition, we have auto-scalers in place to automatically scale up the number of VMs as well as the number of containers. Usually, our developers do not need to think about the number of instances for their microservices running on Nomad, nor they need to think about any failure that may happen to a VM or a container.

However, we do believe that serverless computing has a potential of improving developer productivity. This could be achieved by means of sophisticated controllers that abstract many of the IO challenges away from function developers. For instance, if we have a Kafka controller that can take care of message batching, retrying, and committing offsets, it would mean that the function developer does not need to think much about the interaction with Kafka, but rather implement only the business logic. The same applies to other event sources, such as, Cosmos DB, blob storages, and so on. Currently, there is no such serverless runtime providing a wide range of production ready controllers with those functionalities.

Kafka Controllers in FaaS Runtimes
Don’t call it serverless yet, if you don’t have production ready controllers abstracting away IO interactions between different functions, between functions and external systems, and between functions and their triggers.

Experiments

We conducted some experiments to understand performance of Azure Functions and OpenFaaS on Nomad. The main performance criteria we considered were cold-start time of a single request and cold-start time under continuous load for a short period of time.

A simple function application used in this evaluation takes an input string and outputs its bcrypt hash. It was implemented in C# on .NET Core for Azure Functions in consumption plan and in Go for OpenFaaS using golang-http template.

The Azure Function runtime version was 2.0.12246.0. We built OpenFaaS from its source as we had to do some small changes to be able to override its metrics endpoint so that we can easily integrate it with our existing infrastructure without any changes. Also, the default OpenFaaS alerting rule was tweaked to reflect our infrastructure. The OpenFaaS Nomad plugin was deployed on a Nomad cluster of 25 VMs running Ubuntu 16.04. The CPU and memory requirements of the Nomad OpenFaaS tasks kept the same as in the Nomad plugin repo.

Experiments using Azure Functions

The first invocation of the function took 9567ms, about 76% of this spent in server processing which included warming up the function. Another 16% was spent in DNS lookup.

The First Invocation of the Azure Function

httpstat was used to generate the visual breakdowns of the HTTP requests.

Invoking the same function a second time right after the first invocation took about 85% less time. The decrease was mainly due to DNS caching and hitting the already warmed-up function. We are more interested in the server processing time here, which was about 91% less compared to that of the first invocation.

The Second Invocation of the Azure Function

In addition to a single call invocation we performed load testing to investigate the behaviour of both serverless systems under load. The load testing was performed using Vegeta HTTP load testing library. The Vegeta target and input file used in the experiments are public and can be used to reproduce the results. The success rates were obtained from Vegeta reports and the number of instances for Azure Functions was obtained by continuously monitoring Azure Live Metrics Stream. We performed the entire load testing multiple times until the success rate reached 100%.

Ideally, any kind of experimental study should be backed by a rigorous statistical analysis. We performed 30 separate cold-start load testings with a 100 req/s request rate to calculate the confidence interval for its success rate. Due to the fact that we have to wait about 25 minutes for an Azure function to fully cool down, performing similar experiments for the other request rates would take days. A 95% confidence interval for the 100 req/s experiments was calculated by using SciPy’s Student’s t-distribution and it was [0.17, 0.51] with a mean value of 0.34.

Azure Functions Load Testing

A sample Vegeta report for an Azure Functions experiment looked as follows:

Azure Functions Vegeta Report

Another interesting observation from Azure Functions was the fact that, the average number of requests per second on each allocated server was 2. The committed memories were in the range of 159MB and 246MB and the request durations were within a range of 2000–10000 ms according to Azure Application Insights metrics.

Also, it seems Azure Application Insights does not capture most of the failures. It has a blade for failed requests metrics; however, we saw no traces of failures, but only few successes within Azure Application Insights, when the success rate of load testing was 0.34% with 100 req/s at first attempt. The Vegeta reports showed different errors during the failures, including: “TLS handshake timeout”, “timeout awaiting response headers”, “no such host”, “write: no buffer space available”, “Too Many Requests”, and “Bad Gateway”.

Experiments using OpenFaaS on Nomad

The first invocation of the OpenFaaS function took 6556ms, about 80% of this spent in server processing which included warming up the function. Another 8% was spent in DNS lookup.

The First Invocation of the OpenFaaS Function

The second invocation of the same function showed a similar behaviour to what we observed with Azure Functions; the total time decreased ~82% after the first request.

The Second Invocation of the OpenFaaS Function

Overall, the cold start times of OpenFaaS and Azure Functions were very close to each other when using a number of requests less than 20; however, as we increased the number of requests per second the success rate of Azure Functions decreased down to zero at the first attempts. The performance of OpenFaaS was much better at the first attempts. Overall it took at most 2 attempts for OpenFaaS to fully warm up and reach a 100% success rate. On the other hand, we had to run 3 attempts for Azure Functions to reach a 100% success rate. We believe the fundamental limitations of Azure Web App sandbox are the cause of this poor performance of Azure Functions.

OpenFaaS Load Testing

We conducted the same kind of statistical analysis for the OpenFaaS function with a 100 req/s request rate. A 95% confidence interval for 100 req/s experiments was [93.82, 97.31] with a mean value of 95.56.

A sample Vegeta report for an OpenFaaS experiment looked as follows:

OpenFaaS Vegeta Report

An astute reader may think that, it is not a big deal if we reach a 100% success rate in 6 minutes (Azure Functions) instead of 3 minutes (OpenFaaS). However, if it was happening as part of real applications, the applications would need to perform a retry for only the failed number of requests, which was only a tiny fraction of the total number of requests in OpenFaaS. Instead, if it was Azure Functions, the application had to retry almost all the 24000 requests (200 req/sec).

Conclusion

Our research suggests that the current serverless technology stacks available to meet our needs are not yet mature enough to deliver the desired cost savings and productivity improvements we were looking for compared to our current microservice strategy. We do, however, see the potential for serverless computing; we would like to see serverless runtimes going beyond functions and provide sophisticated function controllers to abstract away IO interactions between different functions, between functions and external systems, and between functions and their triggers. This idea is not new; we have already implemented and used similar kind of abstractions in Jet’s Order Management System. Azure Durable Functions also provides a similar workflow engine. What is missing in these solutions is their generality; ideally, the controllers should support multiple languages and pluggable event sources such as Kafka. We believe it would improve developer productivity significantly and help in saving costs.

Our next takeaway is that serverless functions are not going to solve a general computing problem. It needs to be evaluated on a case-by-case basis. Think about the integration with your existing infrastructure and integration with your legacy services.

Should I use OpenFaaS or Azure Functions?

If you answer “yes” to the following questions, then it makes sense to use OpenFaaS or some other non-managed serverless function runtimes if they meet your needs:

  1. Is Nomad or Kubernetes a main part of your infrastructure?
  2. Do you have multiple teams using different languages?
  3. Are you a heavy user of Kafka and are looking for a Kafka controller?
  4. Are you ready to extend existing controllers if they don’t satisfy your needs?
  5. Should your function runtime be able to handle a high number of requests?
  6. Would you/your company like to have a full control of your function runtime?

If all of these are not a concern for you, then you may choose Azure Functions for the following reasons:

  1. Easier integration with Azure services
  2. A powerful workflow engine, i.e. Azure Durable Functions
  3. Better developer-tooling such as integration with IDEs, local development and testing
  4. Support from Microsoft

Please check OpenFaaS and Azure Functions docs to learn more and keep up-to-date as the serverless technologies evolve quite fast.


If you like the challenges of building complex & reliable systems and are interested in solving complex problems, check out our job openings.


The content and information in this blog post is the property of Jet, and cannot be copied without Jet’s express written consent. This content and information is provided for informational purposes on an “as is” basis at your sole risk. Jet makes no guarantee as to the accurateness, completeness or quality of the information, or its suitability to your specific purpose. Jet shall not be liable or responsible for any errors, omissions or inaccuracies in the information or your reliance on the information. You are solely responsible for verifying the information as being appropriate for your personal use.



Errata

2019–03–11: This blog post estimated the cost of running 1000 Azure Functions twice daily for an average of 2 minutes. That estimation assumed that the billable resource consumption cost should take the total number of allocated servers into account. However, according to the feedbacks we received, the billable resource consumption should take only the average resource usage of a function. If we did the calculation without taking the number of servers into account the billable resource consumption for a function would be $4.60 and the bill for the executions would stay the same, i.e. $0.88. Therefore the total cost of a function per month would be $5.48 and the total cost for 1000 Azure Functions would be $5,480. However, this does not change the pricing conclusion if you already use reserved VM instances. Thanks to Mikhail Shilkov for the correction.