CSRTale #12: Distributed consensus can sometimes result in an exploding Inbox

Published in

CSR Tales

12 min readJun 28, 2019

I sat down with Heidi Howard to talk about her graduate school experience, the summer she spent at VMware Research in 2016 working on Flexible Paxos, which enjoyed explosive success, and her experience being a speaker on the tech conference circuit. Heidi is currently a Research Fellow for Trinity Hall College at Cambridge University.

— — — — —

Amogh: You were one of the first few interns at VMware Research, right?

Heidi: Yes! I didn’t really know that VMware had a research group at the time. I was at SIGCOMM ’15 in London and I ended up sitting next to David Tennenhouse at the conference dinner. I had no idea who he was. Probably a good thing cause I would have been very nervous to talk to him otherwise. Thankfully, I didn’t know who he was and I’d had a few glasses of wine. I was chatting very merrily to him. I told him I work on consensus. And he said he had a research group that does research on distributed consensus, and that his team includes Dahlia Malkhi. I was like “Oh My God! She’s one of heroes! I’ve read so many of her papers!”. Right. So my ears kinda opened up and I was like “tell me more about this place”. I love Cambridge, this is my home and I love the university and the city, but our department is fairly diverse and there’s not that many other people who work on distributed systems at Cambridge (with the notable exception of Martin Kleppmann). So I was quite keen to get an opportunity to go be an intern at a larger group doing distributed systems. So I followed up with David after that dinner; I spoke to Dahlia and they were like come out to Palo Alto and work with us on consensus. I jumped on the opportunity. I was super excited to go somewhere where people cared what I was working on.

A: Yeah, I’ve been there twice myself. It’s an awesome place. So pulling on that thread a little, how far along were you in your PhD were you at that point?

H: I think I was about a year in, at that point that I was going to VMware. I went straight from undergraduate into my PhD program. Well, backing up, when I was an undergraduate, I read the Raft paper and I wanted to implement it in my favorite programming language, which at the time was OCaml, the functional language. So I wrote that implementation, and then wrote up the Raft Refloated paper. I felt I could optimize the algorithm, like if I tweak this and do that. I thought my ideas were really novel and really cool at the time. When I told people about my tweaks, I learned that they’re pretty standard practical optimizations. Anyway, that’s how I fell down the distributed consensus rabbit hole, and read more and more about consensus algorithms and the algorithms that were out there.

A: So how did you end up getting interested in distributed systems and consensus?

H: I had never programmed before I went to university to do computer science, so I never really saw myself as a systems person, but I think the problem of consensus just kind ticked the boxes. I went to university for Computer Science wanting to work on AI for robotics. It was really an off-chance that I came to distributed systems. I heard Diego Ongaro speaking on the Think Distributed podcast about Raft. I was really fascinated with the problem. It’s an interesting theoretical problem, but it’s also very, very practical. When it comes to distributed systems, there are so many systems that have all different consistency guarantees. I find it super confusing what these different guarantees are, what different systems give you and have you compare them. There’s a really good survey paper, Consistency in Non-Transactional Distributed Storage Systems, it’s got 60 different types of consistency guarantees and it’s just frightening. Then you see consensus and you’re like, oh wait, you could use this to build linearizability, which is really desirable because it allows us to think of our distributed system like it’s not distributed at all. I wanted to do that. It seemed just like it would be really powerful and I kind of sidestepped into distributed system.

A: So let’s continue with where we kind of left off at VMware and the whole Flexible Paxos story.

H: Right. I went to work with Dahlia at the VRG. The original plan was that I was going to work on Byzantine fault tolerance, which I knew nothing about. Right before heading off to Palo Alto, I had been trying to put together some material for explaining Paxos to people. There’s this sentence in Lamport’s Paxos Made Simple paper that goes: “The Paxos algorithm is among the simplest and most obvious of distributed algorithms. The consensus algorithm follows almost unavoidably from the properties we want to satisfy.” However, when you try to teach people Paxos, you can could kind of convince them that it’s reasonable, but I don’t think anyone looks at that and thinks ‘well that was just unavoidable; I couldn’t have helped but derived Paxos’. I was really trying to find a way to communicate the underlying intuition, as to why you would solve consensus that way and why it works. I wanted to take a first principles approach and communicate how and why the Paxos algorithm was derived. It’s kind of the difference of showing a student a proof and getting the student to prove it themselves.

H: So, I was trying to put together this material and I was looking at quorums. I was aware that Paxos uses majority when reaching consensus. Well, any intersecting quorum system will work, but most people use majority. I was looking at what I had written down, and I couldn’t work out why we required intersection across all the quorums. My first thought was “Wow, I must be really stupid, I’m doing a PhD in this area and I don’t understand this fundamental thing.” I couldn’t see why the intersection was needed. I kept looking through what I’d written down in this derivation of Paxos from first principles, but I just couldn’t see why it was the way it was. It briefly occurred to me that maybe I was right, that maybe you don’t need all the quorums to intersect, that the requirement is weaker, but I immediately doubted that that could possibly be true.

H: I left it for a couple of weeks until I got to VRG. So I arrived at VRG, I just couldn’t get this thing off my mind, and I figured who better to ask than Dahlia. So I wrote up a one page description of the result. When you’re communicating with someone who’s familiar with Paxos, it’s quite easy to write down the result in a page. I told Dahlia wanted to talk to her about consensus and pick her brain on something. I was super nervous. I was quite sure I was wrong and that she’d think I’m a complete idiot. So I went in and I explained to her that I don’t think Paxos requires quorum state intersect on the whiteboard. She went ‘well of course they do’. I asked her to explain it to me. She started to explain it to me, and then she was like ‘I think you’re right, maybe they don’t need to intersect’. Everything just kind of moved from there. So instead of working on Byzantine fault tolerance, we ended up trying to put together a paper on this. We ended up getting the paper out of the door really quickly, because once you’ve proved a property, that’s all that is required.

A: You didn’t implement it?

H: I wanted to publish the result on its own so that people could run with it and implement it in different systems in different ways. Publishing it with a system sometimes results in the core result basically getting lost. If you look at a successful consensus systems like Zookeeper, there were some really interesting insights about how to achieve consensus in the paper, that have just kind of gotten lost. They exist in Zookeeper, of course, but never quite make it to other systems. Even though Zookeeper is really successful, some people avoid it because it’s in Java and they don’t want to use Java. I wanted to make sure that the paper focused only on the algorithmic fundamentals. I did put together a research prototype by modifying libPaxos to check that it worked, though. After that we put together this paper with the proof and a couple of examples, and submitted it to OPODIS’16.

A: Wow, the whole process was quite quick, then?

H: It looks like it didn’t take much time to do that work, but as much as I hate saying this, I think I only got to that level of insight into Paxos because I had previously spent a ridiculous amount of time trying to get all this stuff to work and failing. Yeah. There had been a lot of lot of dead ends prior to that summer.

A: I absolutely understand that. I think that’s inevitable, especially in grad school.

H: So, we submitted the paper and then came the challenge of convincing people of the validity of Flexible Paxos. Basically, anyone who knew anything about Paxos would immediately be like ‘Of course intersection is necessary; it's like the basic thing that Paxos is based on. That’s why we use majority's’.
I think if it hadn’t been for the fact that I had Dahlia to help me write the paper in a way that that communicated it to that community, it would’ve been a complete nightmare to get that work published. I’m really glad I had her there at that time.

H: So we submitted the paper to the conference and put it on Arxiv, so that it was out there. I ended up using that as content for talks, and whilst I was in the bay area, I got to go to various big tech companies and talks about Flexible Paxos and about how to use it. It was quite an exciting summer.

Glitz, Glitter & Fame

A: How did you set those talks up? Did people reach out to you? Did you know people at these companies?

H: (laughter) The announcement of Flexible Paxos was the definition of a success disaster. I was used to wallowing in insignificance, but when this came out it got really big really fast. The paper went on Arxiv and the CTO of Amazon tweeted about it! That got picked up by other places, and soon it was on the front page of hacker news, it was covered in The Morning Paper. It was big news, and a lot of people were reaching out to me to ask me about all sorts of random things: ‘I have a system like this, can I use this?’ ‘Can you explain this example in the paper?’ Just dealing with email became a full time job for a couple of weeks.

H: So I got invited to go with talks at a bunch of places. I was in the bay area and I was only there for a short time, so I was like, right, let’s do this. I think one of my colleagues jokingly refers to it as my ‘publicity tour’. It was about trying to communicate the results to a broad audience. Especially because we submitted the paper to a theory conference, and it was quite a theoretical paper without a system to go download and try, I think it would have been quite easy for it to get lost. So that was how the rest of my summer at VRG went down.

A: So what happened after that summer when you came back to Cambridge?

H: Oh, that was a difficult period. The thing with Flexible Paxos is that it opens a bunch of different doors, in terms of like what you can do with consensus. It was very hard to figure out which one to pursue. I spent a while pursuing a couple of avenues that didn’t work out. I think the fundamental issue was that I can code, but I was not an experienced developer. I would build this new consensus system based on a new algorithm that was theoretically much more performant, and then benchmark it against the state of the art like Zookeeper. On a good day, I was lucky if I got comparable performance. I soon realized that I could put another year of my life into engineering this system, but that would not be the most productive use of my time. I had to take a step back and recognize that I only need to do enough engineering to be able to effectively communicate my results to other people. It’s not my job to write production code for companies. That took me much longer to realize than I care to admit.

H: Once I realized that, I decided to instead to invest my time in the theory behind my algorithms. I went back to tools like TLA+ to improve my confidence in my consensus algorithms. And that gave rise to my phd thesis. Early on, my plan for my dissertation had very much been to develop a big performant system. If you had told me then that I was going to write an entire thesis just about the theory of achieving consensus over a single value, I would’ve thought it was ridiculous, but that was the route that I went down in the end and I’m fairly happy with the result.

Speaking at Tech Conferences

A: So let’s shift gears a bit. You’re one of the few academics I know who give talks at Tech/Developer Conferences. How did you get into this whole scene, and why?

H: I mean, I had no idea this whole scene existed until Adrian Colyer of The Morning Paper, who had previously covered my work on Raft Refloated, invited me to give a talk at QCon in London. He was putting together a track on transitioning research into practice. I figured I’ve read so many consensus papers, so why don’t I make a talk about what I’ve learned? I did kind of a tour of consensus results. Everybody uses Paxos or Raft or Zookeeper, but there’s loads of really, really good and interesting consensus algorithms out there and they’re just kind of lost in the mountain of academic papers. I wanted to pull them out and say look at this really cool thing. Once I did the first talk, it kind of moved forward from that. Apparently I must have done reasonably on that first talk, because people who invite you to speak at these conferences watch your previous talks to check whether you’re a decent speaker. I kinda got hooked, and started submitting to more of these tech conferences and I also got invited to some of them, and it just kind of spiraled from there.

A: How are these conferences different from say an academic conference?

H: I find them very interesting because I get to meet a lot of engineers from companies, which if you’re not in the states and you’re not in the bay area in particular, you don’t actually tend to meet people who are kind of grappling with things like distributed consensus in production. It gives me a lot of exposure to those kinds of people, which I think was really helpful. I’m always kind of the odd one out there because everyone is from industry and there’s only like one or two academics floating around. The talks are usually not about novel work. It’s more about reaching a larger audience. Academic talks have quite a clear structure to them in terms of you know there is a problem, here’s how we try to solve it, how other people have tried to solve it, but our solution is better than that, etc. These talks are designed to be very accessible to a broad audience. So you’re not assuming these people know anything about say consensus when they come in. You have to provide the background, which can be a little bit painful sometimes as you have to sometimes skip a lot of caveats and details because you’re probably the first person that’s presenting this topic to them. So it’s definitely a very broad audience, and the talks are really more about exposing the audience to a new or different set of ideas and how they might be able to use it in practice.

A: It’s almost like you’re teaching a class.

H: Yes. It is much closer teaching a class. Engineers at tech conferences may only attend a conference once every few years (and the tickets can be expensive) so everyone’s really paying attention. You walk on stage at one of these conferences and everybody is watching you intently. No laptops or phones. That said, I think most of the impact of these talks actually comes from the talks being posted online.

A: Makes sense. Someone attends a talk at one of these conferences and then goes back tells their team about the talk and so on…

A: So, if somebody else wants to get into this scene, what should they do? Do you have any advice for say a PhD student aspiring to be a tech conference speaker?

H: I guess the main thing I would say is to come up with a talk that you would like to give. And it’s got to be something that you are genuinely interested in and would enjoy giving a talk on potentially many, many, many times till you are terribly sick of it. Most of these conferences have, open calls for speakers similar to academic conferences. You submit the title and abstract of the talk, and perhaps a bio about yourself. You apply and then you wait for them to get back to you. And I think once you’ve done a few, you tend to start to be invited instead of having to go through the application process. Unlike academic conferences where you have to pay to go, they will cover all of your expenses if they select your talk. This is such a basic assumption in that world that a bunch of the calls may not even say that they cover your costs, but they do.

— — — — —

Here’s a short list of popular tech conferences, in case you’re interested in learning more about them:

CSRTale #12: Distributed consensus can sometimes result in an exploding Inbox

Glitz, Glitter & Fame

Speaking at Tech Conferences

Written by Amogh Akshintala