Source

Two missing links in Serverless Computing: Stateful Computation and Placement Control

by Ion Stoica and Devin Petersohn

Serverless computing is rapidly gaining in popularity due to its ease of programmability and management. Many see it as the next general purpose computing platform for the cloud [4]. However, while existing serverless platforms have been successful in supporting several popular applications such as event processing and simple ETL, they fall short of supporting latency and throughput sensitive applications such as streaming and machine learning (ML). The main challenge stems from the gap between the performance required by these applications — typically deployed on virtual machines (VMs) — and the performance of existing serverless platforms.

In this post, we argue that to bridge the performance gap to traditional VM-based solutions, serverless platforms needs to add support for:

  1. stateful computation, and
  2. communication- and data-aware function placement.

These are 10x features, that is, these features can improve the latency and/or throughput of an application by 10x or more.

To validate these claims, we added these two features to Ray — a general-purpose system that previously only provided a serverless like abstraction. After adding stateful computation and communication- and data-aware function placement, Ray was able to support new throughput and latency sensitive applications that are not possible in existing serverless abstractions.

Sizing the Performance gap

Serverless platforms allow developers to run functions in parallel by transparently distributing and executing these functions over large clusters. Typically, these functions are stateless, i.e., each function reads its input from a shared storage (e.g., S3), performs the computation, and writes its output back to the shared storage, so other functions can consume it.

While serverless computing relieves the developers from managing a cluster, it comes with significant performance overhead [4]:

  • Data transfer: Today, cloud functions read and write data from a shared storage, such as S3. With AWS Lambdas, arguably the most mature serverless offering, a cloud function (lambda) can read data from S3 at 70–80MB/s, and write data to S3 at around 50MB/s, respectively. Furthermore, the latency of accessing S3 can reach tens of ms. Both the throughput and latency are orders of magnitude worse than accessing local memory or even a local SSD. These performance discrepancies disproportionately affect latency and throughput sensitive applications (e.g., ML).
Figure 1. Read/write throughputs from/to S3 of AWS’ lambdas. The mean write throughput is ~50 MB/s while the read throughputs has two peaks, one ~70 MB/s and one ~85 MB/s. (Results thanks to Eric Jonas, http://ericjonas.com/.)
  • Startup time: Cloud functions can take several seconds, and in some cases 10s of seconds, to start. Startup time consists of (1) scheduling and starting the cloud function, and (2) downloading the application software environment. While it is possible for a cloud function to start in sub-seconds [3], downloading the software environment can take much longer. For example, a Python program might require downloading hundreds of MBs of library and environment dependencies before starting, which might take 10s of seconds.
  • Communication cost: Cloud functions may need to transfer 10x or more data than an equivalent virtual machine (VM) based solution. This is because serverless platforms provide no way for applications to optimize the placement of the cloud functions. As an example, Figure 2 shows the aggregation communication pattern — a pattern often generated by SQL aggregation queries or distributed SGD — for both VM and function based solutions. By packing K functions per VM, the VM-based solution transfers K times less data than the serverless solution, where typically K = 10–100.
Figure 2: (a) One cloud function aggregating the results from N other functions. (b) Same aggregation pattern where the N functions are running on N/K virtual machine (VM) instances where each instance runs K functions. The VM based solution transfer K times less data. More importantly, the function aggregating the result receives K less data.

Is there an easy fix?

Recently, there have been several proposals to improve the performance of serverless platforms and cloud function services [2,4]. These proposals include:

  1. Faster shared storage systems (e.g., memory or NVRAM backed systems).
  2. Reducing the function startup time by keeping the cloud function warm.
  3. Direct communication between functions.

Although these proposals can significantly improve the performance of the existing cloud functions, we believe that they are not enough to support new workloads, such as ML training, model serving, and streaming.

First, as we will see next, even if the shared storage system is in-memory, accessing data remotely from this storage is still much slower than accessing the same data from local memory, and even more slower than accessing the data from the on-chip memory of a hardware accelerator such as a GPU. Second, while keeping the function warm certainly helps, the startup time is still in the order of hundreds of ms [7]. Finally, while direct communication obviates the need for read/write from/to the shared storage, it does not control the function location, and thus does not address inefficient communication patterns as the one shown in Figure 2.

To further illustrate the insufficiency of these improvements, we next describe our experience with Ray, a general-purpose distributed system we built at UC Berkeley.

Ray tasks — “best-case” performance for serverless platforms

Ray started as a task oriented framework, where tasks are stateless functions running remotely, similar to a serverless platform. Since multi-tenancy was not an original goal of Ray, we were able to push beyond the performance improvements mentioned above. As a result, we argue that the performance of Ray is a practical upper bound on the achievable performance by current serverless architectures. Ray employs a host of optimizations to significantly reduce the startup time latency and improve the data transfer throughput between functions.

  • In-memory shared storage: To improve the throughput and latency of reading the function’s input and writing its outputs, we have implemented an in-memory storage engine. Functions on the same machine use shared memory when sharing data between them to avoid copying data. This enables a function to read/write data from/to shared storage at a rate of several GB/s, which is orders of magnitude more than accessing an S3 like storage (see Figure 1).
  • msec-level startup time: With Ray, the code of a function and its environment are distributed to all nodes of the cluster eagerly, when the function is declared, rather than when the function is invoked. As a result, a function’s startup time is minimal. Executing a no-op function takes around 300 usec on the same machine, and around 1 ms on a remote machine (these results are obtained on a cluster running in AWS). This is orders of magnitude faster than the time it takes to run a cloud function on existing platforms which can be 100ms or more [7].

While these design decisions and implementation provide massive improvements for existing cloud functions — larger than the techniques proposed recently and described in the previous section — we found that they were not enough to support several applications, such as ML training and streaming.

Stateless functions are not enough

One of the primary workloads originally targeted by Ray was distributed machine learning and reinforcement learning (RL) applications. These workloads quickly exposed the limitations of the function abstraction alone, many of these limitations also being shared by other applications, such as streaming and databases [2].

Inefficient GPU training

In many cases, we want to train the same model on the same data but starting from a different initial state. With a function abstraction, the model weights and training data would need to move to the GPU for every training episode. Unfortunately, even if the data is in the local RAM (rather than on disk or on a remote storage), transferring it to GPU can take a nontrivial amount of time. This is because the GPU can typically only read from the main memory via PCI Express which currently maxes out at 32 GB/s. This is a far cry from the internal memory bandwidth available in today’s GPUs which starts from 500 GB/s. With the stateless function abstraction, we can’t keep the state in GPU memory between training episodes to maximize performance.

Figure 3. A simulator example. (a) Every action is implemented as a stateless function, which reads the simulator’s state from the shared storage, performs one simulation step, and writes back the updated state. (b) The same setting implemented by a stateful operator. The simulator’s state is internal, so there is no need to transfer the state. (c) A closed source simulator that doesn’t expose the state. In this case we have no choice but to implement the simulator as a stateful operator.

Serialization/deserialization overhead

In Ray, the read/write performance of task inputs/outputs is limited to around 8 GB/s, which is 10x lower than directly accessing the memory. This is despite the fact that data is stored in shared memory on the same machine and we use Apache Arrow as a data format for fast serialization and deserialization. Figure 3 illustrates this problem in the case of running a simulator, a common workload for RL applications. A natural solution would be to run a task for each simulation step, as shown in Figure 3(a). At each step, a task would (1) initialize the simulator with the current state, (2) apply an action based on some policy, and (3) read the simulator’s state (to be used at the next step) and possibly a reward. However, this would require to read/write the state of the simulator from external storage, and pay significant serialization/deserialization overhead. Assuming the simulator state is 80 MB, the read/write throughput to the external storage is 8 MB/s, and it takes 1ms to update the state. This means that it will take 10 ms to read the state and 10ms to write it back, 20x more than updating it! In contrast, if we use a stateful operator which stores the state locally, it takes just 1ms to perform a simulation step (see Figure 3(b)).

Lack of support for closed source simulators

If the simulator is closed source (e.g., StarCraft 2), we do not even have access the simulator’s full state. As a result, we are forced to treat the simulator as a black box. At every step, we apply an action, and, instead of reading the simulator’s state, we read an external observation of the simulator’s internal state. A typical observation is the image of the screen after each action. Unfortunately, since we are not able to initialize the simulator with the current state, we can no longer use the task abstraction for simulation. In this case, wrapping the simulator in a stateful operator is the only viable solution. Figure 3(c) illustrates this use case.

Lessons from Ray: finding the missing links in serverless

Stateful Operators

As discussed above, while stateless computations are elegant and easy to reason about, they can incur significant overhead, even when the data and the computation are on the same machine (e.g., the data is stored in the local RAM and the computation runs on the GPU). To support a broader range of applications, we see no other choice but co-locate data and computation, and add support for stateful computations, e.g., actors. Actors encapsulate mutable state, which enables them to avoid costly state transfers between successive operations on the same actor.

In Ray, actors allow us to perform efficient training, interactive query processing, and support proprietary simulators. For example, in the case of training, the network weights and the training data are encapsulated as the actor’s state. Then, a training episode reduces to a method invocation on that actor. Thus, a new training episode will only need to reinitialize the model weights, which is much cheaper than reading/writing and serializing/deserializing these weights. With both actor and task abstractions, Ray supports a broader set of application with better performance than current serverless platforms.

Placement Control

As illustrated above, the lack of placement control can lead to an order of magnitude lower performance when large amounts of data need to be shared across cloud functions. There are two generic approaches to address this problem:

  • Low-level API. In this approach, the serverless platform can provide flexible and low-level mechanisms that enables developers to implement the desirable policies at the application level. This is the approach taken by Ray which provides the application with the ability to define logical resources and then associate functions with such resources [5]. For example, this allows the application to schedule two functions on the same node by specifying the same resource. This allowed us to implement efficient communication patterns to match the VM-based solutions (e.g., Figure 2(b)), high performance SGD, and complex applications, such as AlphaGo.
  • Declarative API. In this approach, the serverless platform can expose an API that lets applications to specify their preferences, such as the “n Choose k” model proposed in the TetriSched work [6]. In this model users can specify that a job can use any k out of an equivalent set of n resources.

Conclusion and Open Challenges

In summary, this blog argues that to achieve its promise of providing a general-purpose computing platform to support a wide range of applications, serverless platforms need to support (1) stateful computations and (2) the ability to control function placement to minimize data transfers. Each of these features can increase the performance of data transfers by at least 10x, and/or reduce the amount of data being transferred by 10x or more.

As a proof point, in Ray we extended the original task (function) abstraction by adding support for these two features. These extensions have allowed Ray to support latency and throughput sensitive applications, including streaming, distributed training, and simulations, which were not possible using tasks alone.

Looking forward, it would be exciting to re-architect existing serverless platforms to support long-running stateful computations and placement control. In this context, we note two Ray optimizations that are challenging to provide in a multi-tenancy environment: single-node shared memory and proactively pushing the function code before starting the function — it might be infeasible to push a function to every node in a datacenter where we might run that function. Addressing these challenges while keeping the performance improvements provided by Ray is an interesting future research direction.

Citations

[1]https://dl.acm.org/citation.cfm?id=3154630.3154660
[2]https://arxiv.org/abs/1812.03651
[3]https://www.usenix.org/system/files/conference/hotcloud16/hotcloud16_hendrickson.pdf
[4]https://arxiv.org/abs/1902.03383
[5]https://rise.cs.berkeley.edu/blog/ray-scheduling/
[6]https://dl.acm.org/citation.cfm?id=2901355
[7]https://www.usenix.org/conference/atc18/presentation/wang-liang