The Serverless Contract

It is common for a web based service to provide a service level agreement (SLA) which specifies the level of up-time the provider strives for. The SLA constitutes the terms and contract between the provider and the user. In the case when the provider cannot meet the stated up-time guarantee, the user is typically entitled to some form of a credit.

For functions-as-a-serverless (FaaS), also known as serverless functions, the traditional SLAs used for web services do not apply in the same way. With serverless functions, the user provides the code they wish to execute in response to a class of events, and the service provider orchestrates the execution of the code, on demand, when the events are received by the service provider.

An example of this is a REST API call: opening a URL in a browser for example will generate a GET request against the serverless provider. In response, the provider executes a handler for the request. This handler is the function specified by the user.

Let’s look at an example. I created a personal website using serverless functions. When you visit my web page rabbah.io, a function is executed to generate the HTML.


Expectation.

My expectation as a user of the serverless provider is that my function is executed “instantly” in response to the event. I use the term instantly loosely since there are network delays for the requests to reach the service provider. In a microservice model where I rent a virtual machine (or container) and run my own server to handle the request, I would expect that as long as my server is reachable, it will respond to your request in time T(f) which is the expected time to execute the function f, ignoring network delays to reach the server.

But in serverless computing, I have shifted the burden of provisioning and deploying a virtual machine and server, or more generally a resource for my function to execute in, to the provider. So the time to service the request is not just T(f) but must also account for the time it takes the provider to allocate and initialize the resources required to execute the function. Let’s call this second term L for the service latency. So the execution time of my function now is T(f) + L.

As a user, I want L to be “zero”, as in no additional delay in executing my function. So it is not enough that the service is “up” and can accept an incoming event only to delay the execution of the corresponding handler for an arbitrary amount of time. Concretely, if when you clicked on the example link above, the platform executes the request 30, 60, or 90 seconds later, your experience is not delightful and I am unhappy that that my website, or REST API, is not performing well.

So an SLA for serverless computing should not only stipulate the up-time of the service but also what guarantees it shall provide with respect to the system overhead L. In other words, I want to know what percentage of the time will the platform guarantee that it will service my request “instantly”. Realizing that instantly is impossible, I may accept for example a serverless contract which stipulates that 99% of the time the provider execute an event handler within 10ms, and maybe 99.9% of time within 100ms.


Over and Under (Provisioning).

The difficulty in providing an SLA of the form __% of the time a function will start to execute in __ milliseconds, from the service provider’s point of view, is that this kind of guarantee translates to how much compute capacity it must provide and reserve so that for expected load, an incoming request is not queued in the system until resources are available. The longer a request is queued, the higher the value of L. Hence, another way of looking at the platform latency is with respect to the queuing delay and more generally, Little’s Law.

If the platform can allocate resources with very low overhead, then L will be low. If the arrival rate is A requests per second and the drain rate is D requests per second, where draining a request means allocating a resource so that the function is ready to execute, then:

  • When D > A then L = 0 in steady state and this is the ideal scenario. The system is over-provisioned.
  • When D ≈ A then L = 0. This is the true ideal but difficult to achieve since A, the offered load, is not static and will vary over time.
  • When D < A then L is proportional to the mismatch between D and A. The system is under-provisioned.

When the system is under-provisioned and new requests arrive, the platform has to decide on a policy. For example:

  1. Reject requests until there is capacity in the system.
  2. Queue requests. In this case requests are accepted and wait for resources to free up. This is subject to the “hold time” or the expected execution time of functions that are already executing.
  3. Add new capacity.

Serverless providers must balance the expectations of the end users with the resource allocation which is directly related to the provider’s costs.


The Kubernetes Leap.

There are serverless function offerings today from all the major cloud vendors. There are also open source projects in the space, such as Apache OpenWhisk, Kubeless, and OpenFaaS, many of which delegate the resource allocation and management entirely to Kubernetes. The alignment with Kubernetes makes sense because functions — for the most part today — run inside containers.

Kubernetes however was not designed for resource allocation with the very low latency needed for (short running) functions, and also not designed to churn through hundreds of millions of containers that a serverless functions provider might service on any given day.

It is not uncommon to wait many hundreds of milliseconds for a new container to be created, or even several seconds, especially when using Kubernetes as the resource manager.

Apache OpenWhisk is unique in this space, because it includes its own resource manager, which can bypass Kubernetes to deliver better performance (≤ 11 ms on average) when allocating resources for functions. There are two ways this is achieved:

  1. Resource Reuse. This optimization ensures that repeated executions of the same function reuse a previously allocated container, for that function, if one is available. Reusing a container eschews potentially expensive function initialization, say for loading a framework and third party libraries. It also favors connection reuse, performing limited forms of “state caching”, and allowing a JIT to run for applicable languages. By reusing containers, the resource manager will significantly reduce the number of newly created containers.
  2. Resource Speculation. The system creates stem-cell containers that can be specialized for functions on demand, and replenishes stem-cells as they’re exhausted. The latency of creating a new container is hidden by speculating which stem-cell containers to create ahead of time.

There are many such scheduling and resource allocation optimizations in Apache OpenWhisk, which is why commercial multi-tenant offerings, such as the ones from IBM Cloud Functions and Adobe I/O Runtime, can provide stable and low response times for thousands of serverless functions users. See “Evaluation of Production Serverless Computing Environments” by Lee et al. from the recent Workshop on Serverless Computing, or the talk from Avner Braverman of Binaris at ServerlessConf circa 22m:30s, for independent empirical evaluations. Apache OpenWhisk continues to evolve and address the challenges of offering a large scale serverless functions platform.


The future of Kubernetes is serverless” wrote Kubernetes co-founder Brendan Burns, and it is quite evident that Kubernetes will evolve to embrace and address some of the challenges posed by functions. The continuing adoption of serverless functions, the convergence of functions and containers, and more generally the rise of serverless computing means that it’s not just the future of Kubernetes that is serverless, but it’s the future of the cloud. We are at the dawn of a new Cloud Computer.