FaunaDB’s Official Jepsen Results

I am pleased to present, along with Kyle Kingsbury of Jepsen.io, the official Jepsen results for FaunaDB version 2.5.4 and 2.6.0.

Our team at Fauna worked extensively with Kyle for three months on one of the most thorough Jepsen tests of all time. Our mandate for him was not merely to test the basic properties of the system, but rather to poke into the dark corners and exhaustively validate that FaunaDB is architecturally sound, correctly implemented, and ready for enterprise workloads in the cloud.

We’re excited to report that FaunaDB passed the core tests right away:

It now passes additional tests, covering features like indexes and temporality:

Additionally, it offers the highest possible level of correctness:

It is self-operating:

And its architecture is sound:

In consultation with Kyle, we’ve fixed many known issues and newly discovered bugs, made API improvements, and expanded our documentation. Kyle has extended the Jepsen suite itself with new tests specifically inspired by FaunaDB. We have also incorporated the extended Jepsen test suite into our internal QA, to help ensure that we never backtrack on the level of reliability we intend to provide.

What Is Jepsen?

Kyle describes Jepsen as “an effort to improve the safety of distributed databases.” It is an open source software verification suite born out of industry frustration with the unsubstantiated claims made by database vendors at the dawn of the cloud era. Jepsen is now widely regarded as the critical test that any distributed system must pass before it is considered mature.

Those familiar with Jepsen reports will note that no other database tested has met the stringent reliability levels that FaunaDB has now met. The FaunaDB report also contains a lovely, extensive description of FaunaDB’s architecture, and I encourage you to read it in its entirety.

Why Test FaunaDB?

When we started building FaunaDB, our objective was to deliver a cloud-native database that offered both transactional consistency and global scalability. For that reason, we chose Calvin as the basis for underlying transaction protocol.

Other distributed, transactional databases use the first-generation Google Percolator model, which cannot scale transactions across datacenters, or the second-generation Google Spanner model, which requires atomic hardware clocks and a specialized operational environment. FaunaDB is the only production database to use the third-generation Calvin protocol.

By designing for global correctness up front, FaunaDB offers mainframe-like capabilities even in the chaos of a multi-cloud deployment. Externally consistent, multi-partition distributed transactions were widely believed to be impossible in a software-only solution until FaunaDB showed the way. We are proud to see our architecture validated in Kyle’s analysis.

Summary of Correctness Tests

Jepsen’s correctness tests exercised FaunaDB under a wide variety of fault conditions and administrative actions to simulate the unreliable operating conditions of the public cloud, including:

  • Individual process crashes
  • Individual process restarts
  • Rapid multi-process crashes
  • Rapid multi-process restarts
  • Small and large forward jumps in clock skew
  • Small and large backwards jumps in clock skew
  • Rapidly strobing clocks
  • While undergoing log topological change
  • While undergoing replica topological change

The testing validated that FaunaDB meets its expected isolation levels, avoids anomalies present in other databases, and maintains ACID semantics at all times. Additionally, the process of updating and running the Jepsen suite itself provided extensive verification of the general liveness, availability, and durability properties of FaunaDB, and let to numerous improvements.

Ongoing Work

FaunaDB does not depend on clock synchronization or a central clock oracle to maintain correctness, as the Jepsen analysis shows. Databases that rely on synchronized clocks can enter a state of ambiguous, irrecoverable data corruption if clocks skew beyond tolerance. FaunaDB never corrupts data, regardless of skew.

FaunaDB versions 2.6 and earlier do partially rely on clocks to maintain liveness — the ability to process new transactions. Jepsen testing uncovered an issue where clock skews many seconds long, chaotically introduced across multiple nodes, can create cluster pauses until the skews are resolved. This operational scenario is rare in practice.

However, as Kyle notes, FaunaDB’s architecture makes it possible to maintain complete availability and liveness even with extreme clock skew. This is an implementation detail of FaunaDB rather than an architectural limitation. We look forward to proving it in an upcoming release.

Conclusion

Since no amount of bug fixing can save the wrong architecture, we are gratified that the Jepsen report is highly complimentary of that of FaunaDB, and that the report validates that the issues found during testing were rapidly fixed:

We look forward to working with Kyle and the Jepsen team in the future as we make further improvements to FaunaDB’s architecture and implementation. In the meantime, go read the full report!

If you enjoyed this topic and want to work on systems and challenges just like this, Fauna is hiring!

Author: Evan Weaver
Date: March 5, 2019
Originally published at
fauna.com.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fauna Inc

Fauna is a distributed document-relational database delivered as a cloud API.