CSRTale #13 comes from Dr Martin Kleppmann, a researcher in distributed systems at Cambridge University and author of the book, Designing Data-Intensive Applications. In this CSRTale, Martin tells the story behind the paper Verifying Strong Eventual Consistency in Distributed Systems, which won both the Distinguished Paper Award and Distinguished Artifacts Award at OOPSLA in 2017.
Martin begins by introducing us to the difficult history of collaborative editing algorithms. It is this history which motivated his research into formally verifying that these algorithms provide strong eventual consistency. Martin describes how he as an engineer overcame the challenges of mastering his theorem prover of choice, Isabelle, and went on to write an award-winning paper about the results. I’ll hand you over to Martin now to tell you more.
Going beyond pen and paper proofs
This line of research considers the algorithms behind collaboration software, such as Google Docs, where multiple users can edit a single document at the same time. This leads to arbitrary concurrent edits and in an ideal world, we want everyone to converge on a consistent state of the document once they have exchanged their edits.
This area of computer science has had a difficult history. Various algorithms were proposed in the 1990s to solve the issue of different users concurrently editing text at different places within a document. Many algorithms today use a centralized server to sequence edits. This simplifies the problem but requires all edits to go via a centralized server (or a set of servers using consensus). We would like this to work in a peer to peer setting where any user can send their edits to any other user. The first algorithm for this was proposed in the late ’80s but was shown a few years later to be wrong due to an edge case which can lead to permanent inconsistency. People fixed these problems, published new algorithms and these again turned out to be wrong. Throughout the ’90s, this process of proposing new algorithms and then showing them to be to incorrect continued again and again.
Given the history of incorrect algorithms in this space, we wanted to go beyond a pen and paper proof to be certain of correctness. Instead, we utilized mechanized proofs, thus ruling out errors in our reasoning. This paper describes the framework we developed for formally verifying this family of algorithms and how we put it to the test by checking three such algorithms.
The property we are verifying is known as strong eventual consistency. This means that if two collaborators have seen the same set of edits, even if those edits were seen in different orders, then the resulting document must be the same. Typically, this is achieved through commutativity, by defining edit operations such that they result in the same outcome regardless of ordering. We restrict the types of edit operations to allow only commutative edits. However, not all operations will commute: for example, consider adding a word then deleting it — the deletion then depends upon the addition. To handle cases such as this we use a more refined model where only concurrent operations need to commute. The details of how the framework is formalized are not just specific to text editing. Other examples include collaboratively edited sets and a shared counter. You could easily imagine more complex and interesting applications.
Utilizing minimal assumptions
Interestingly, some previous algorithms in this space did come with proofs of correctness (or so they claimed to at least). Those proofs turned out to be wrong due to assumptions which did not hold in all circumstances. Some proofs were even mechanized, but if you have the wrong assumptions then a mechanized proof does not help you.
This is why we were very careful with our assumptions. We made only very general assumptions and made no assumptions that were specific to the algorithm we were verifying. We rely on only a basic set of axioms such as “if a message is received then someone sent it”. The proof is structured to only use those very general assumptions about networks, and everything follows from there. We prove individual algorithms correct under certain assumptions, and then we have a second proof which shows that those assumptions always hold under all possible executions of the network. No matter if the network drops, reorders or duplicates messages, those assumptions still hold.
From systems engineering to formal verification
A few years ago, I was trying to develop new algorithms for collaborative editing. The algorithms were getting very subtle and I did not quite believe that they were correct. I tried pen and paper proofs but I was still not convinced. It was so easy to make a subtle mistake and difficult to spot it afterwards. So I started investigating more formal approaches to avoid my own mistakes. At the beginning of this journey, I know nothing about formal verification. I spoke to a few people around me and learnt about theorem provers such as Isabelle and Coq. I heard that Isabelle had good automation so it seemed like the best option for me.
Writing the algorithms in Isabelle was not too difficult because the language is similar to ML. However, I had no clue how to actually prove anything about them. Fortunately, we have some local experts in formal verification here at the computer lab, including Dominic Mulligan and Victor Gomes. So we sat down together for hours and hours, trying to work out how to express the properties that we wanted. Many times, we would write some simple conjecture in Isabelle which seemed obviously true, and Isabelle would find a counterexample. It often took us many attempts to correctly formulate the property we were trying to prove, let alone prove it. It was an amazing learning experience. However, I am now wary of pen-and-paper proofs as there were many things we got wrong along the way that we would not have spotted without Isabelle.
Dominic and Victor had previously worked on formalizing multi-core concurrency using Isabelle, but hadn’t worked on distributed systems. I had the opposite background, I knew my way around distributed systems but I knew nothing about using Isabelle. The best way to learn is to sit down with people who know their stuff. For the first several months, I was looking over Victor and Dominic’s shoulders with no clue as to what was going on. Over time it went from gobbledygook to slowing making more and more sense. We carried over some of Victor and Dominic’s experience in multi-core concurrency, but we also found that distributed systems are quite different, since they also need to handle partial failures like dropped messages and crashed nodes.
Some of the formal proof tools are designed more for people who are interested in proofs than people who simply want to prove a specific result and move on. Formally proving, even simple things, can be a lot of work. It can be frustrating. The tools are getting better though, e.g. small steps of the proof can be automated. But you still need to provide intermediate reasoning steps between your assumptions and your goal. It is becoming more and more manageable but still requires some intuition for which reasoning steps work best.
Formal verification allows us to consider edge cases that occur rarely in practice. We gain a much better understanding of algorithms by trying to formalize them. When implementing, you naturally focus on the common case because that’s what you want to optimize. Formal verification pushes you to consider the rare edge cases.
Publication at OOPSLA & next steps
When it came to writing up our results, we weren’t sure whether to try a distributed systems venue or a programming languages venue like OOPSLA. The PL community is comfortable with formalizations whereas mechanized theorem proving is rarely utilized in distributed systems. We choose OOPSLA in the end because we felt that its community could appreciate how difficult it is to do this type of proof.
We also submitted the proofs, packaged up in a VM, to OOPSLA’s artifact evaluation. Each step in the proof is sufficiently simple that it takes only a few ms and little code to check. This also means that the trusted computing base is therefore very small. We also made the formalizations available on GitHub and we have already seen people build upon it.
Because many algorithms for collaborative editing have been wrong in the past, we had a strong story about why we were doing this work. Since then we’ve worked on similar formal verification projects which have been more difficult to publish. This might be because they had a weaker motivation case. We are actively working on follow-up papers; for example, we developed a new algorithm for concurrently modifying tree data structures. It can still be a struggle, even after all these years. Sometimes the result you write down can look obvious so there’s a challenge in convincing the reader that there has been substantial work. I hope we will find a venue that appreciates it.
I would like to say a huge thank you to Martin for taking the time to speak with me and for his efforts to further the application of formal verification in distributed systems. In my own experience, distributed systems are seldom formally verified, sometimes leading to algorithmic bugs which go unnoticed for years. I sincerely hope that we will see a revolution in distributed systems research where the formal verification becomes as common as experimental evaluation.