Solving Serverless Computing’s Fault Tolerance Problem

Vikram Sreekanti
riselab
Published in
4 min readMar 19, 2020

Serverless infrastructure and Functions-as-a-Service have become increasingly popular in recent years thanks to their operational ease-of-use, autoscaling, and pay-for-what-you-use features. A missing piece in the usability story for FaaS infrastructure, however, is application fault tolerance.

Pass the Pain to the Programmer?

By default, FaaS systems like AWS Lambda or Google Cloud Functions require programmers to worry about failed executions. All that they guarantee is this: function executions that fail — whether because of an application error or infrastructure failure — will be retried. This means that your function may run 2 times. Or 3 times.

Unfortunately it also means your function may run 0.5 times. Or 3.2 times. Huh? How can that happen?

Here’s the most painful issue: most FaaS systems provide no guarantees that a failed execution that reaches out to shared resources like databases or files will be “cleaned up”. In the process of failures and retries, applications that modify shared state can unwittingly expose partial results. If a request will update two keys k and l, but the function crashes between the two updates, parallel clients will now see a newer version of k and an older version of l.

This is the state of play in commercial FaaS today.

Painless Fault Tolerance for Commodity FaaS: Atomicity

To avoid this type of anomaly, a simple guarantee that developers can rely on is atomicity: either all of the updates from a single logical request should be visible or none of them should. Atomicity is traditionally guaranteed by strongly consistent (transactional) storage engines, but those systems have well known-scaling and performance issues. How can we get atomic behavior for FaaS executions?

To that end, we’ve built a system called AFT (Atomicity for Fault Tolerance) that is a shim layer sitting between any serverless compute layer (e.g., AWS Lambda, Google Cloud Functions) and storage layer (e.g., AWS DynamoDB, Redis). Each logical request at the compute layer (which may be composed of multiple functions) is treated as a transaction. AFT guarantees that all of the updates made by a transaction are atomically installed at the storage layer.

AFT is designed to be flexible: We make no assumptions about the compute layer, and all we require of the storage layer is that it be durable. We guarantee atomicity for functions running over eventually consistent systems like DynamoDB and S3. AFT has two main features: (1) coordination-free, atomic installation of updates and (2) guarantees that transactions only read committed data. AFT writes each new key version to a different physical storage location to avoid write-write conflicts.

To ensure that transactions read semantically coherent data, AFT guarantees what is called read atomicity in the database research literature. Read atomicity requires that clients only read data from committed transactions, in the order that the transactions were committed. That is, if transaction T1 wrote key version K1 and a later transaction T2 wrote K2 and L2, a client can’t read K1 and L2 because transaction T2 wrote a newer version K, called K2. All of this can be done without any coordination¹. This type of anomaly is called a fractured read.

An illustration of invalid reads under the fractured reads guarantee.

Under the hood, we’ve developed new protocols for ensuring writes are installed atomically and also for the read atomicity guarantee. We also developed new garbage collection protocols to safely mitigate the overheads of versioning the storage system. If you’re interested in learning more, check out the full paper here.

We implemented the AFT and its protocols in a couple thousand lines of Go over three storage backends — AWS DynamoDB, AWS S3, and Redis (AWS ElastiCache). We’re able to minimize the overheads relative to doing IOs directly to/from the underlying storage engines (see section 6 of the paper for concrete results), and AFT scales smoothly to hundreds of parallel clients and thousands of transactions per second (see the graph below). At the same time, we prevent significant numbers of anomalies that occur without AFT. The table below shows the frequency of read-your-write (within a transaction) and fractured reads anomalies over 10,000 transactions².

Frequency of read-your-writes and fracture reads anomalies that occur over various cloud storage engines.

What’s Next

We’re excited about the shim layer architecture as a means to explore different developer guarantees. In particular, there is a whole class of applications that could benefit from commodity autoscaling serverless infrastructure, but require finding ways to bring stronger consistency into the serverless world.

Those of you familiar with database internals might have noticed that the read atomicity guarantee we talked about here is akin to a coordination-free version of snapshot isolation, which is a commonly used form of strong consistency in traditional databases. We’re planning to explore how to bring strong consistency to serverless applications in this context. If any of this sounds interesting to you, we’d love to hear more!

¹ The catch here is that each individual client might read stale data because it cannot know about all committed transactions — that would require the strong consistency techniques discussed above. This compromise allows us to scale without coordination concerns while still guaranteeing fault tolerance via atomicity.

² For those wondering about DynamoDB’s transaction mode, it does not support mixed read-write transactions, which AFT does. Our comparison separates reads and writes into separate transactions, which is why it still sees certain anomalies.

--

--

Vikram Sreekanti
riselab
Writer for

Working on distributed systems and serverless things in grad school @ Cal.