CSR Tale #9: Reliability of Distributed Storage

Vijay Chidambaram
CSR Tales
Published in
12 min readAug 22, 2018

CSRTale #9 comes from two award-winning graduate students in Wisconsin, Ramnatthan Alagappan and Aishwarya Ganesan. Ram and Aishwarya work with Prof. Andrea Arpaci-Dusseau and Prof. Remzi Arpaci-Dusseau in the Advanced Systems Lab. In this CSRTale, they discuss they how started working on distributed-storage reliability at Wisconsin, and lessons learnt along the way. Ram and Aishwarya will be on the job market soon!

This tale is organized around six main ideas:

1. Find projects through measurements

2. Pay careful attention to feedback (even if it is harsh)

3. If you have a good enough idea, see it through

4. Writing is as important as the work itself

5. How advising helps

6. Sometimes, shelving an idea and revisiting helps

To explain and emphasize each of the ideas, Ram and Aishwarya share their experience from three projects: “Correlated Crash Vulnerabilities” (OSDI ‘16), “Redundancy Does Not Imply Fault Tolerance” (FAST ‘17), and “Protocol-Aware Recovery for Consensus-based Storage” (FAST ‘18).

Find projects through measurements

As graduate students, one of the most important questions we have and ask many people is “how did you get the X idea?”. We believe choosing an interesting problem (interesting at least to us) is very crucial; only if it is interesting, we will be motivated to work on the project.

One way to search for new projects is by reading papers and thinking about them. We feel that this approach doesn’t work most of the times. Instead, we have found that measuring or studying some aspect of a new system/protocol through experiments (e.g., fault injection) will show some light on real problems and thus lead to new projects. Our advisors have always encouraged this kind of measurement- and study-driven approach to searching for new problems.

The measurement-based approach leads to new projects in two ways. First, we get to study a new problem in a real system; applying the study methodology to many systems can reveal surprising patterns across many systems, leading to a paper that presents the new observations. Second, to solve the discovered problems, we can build new solutions. Our experiences from two projects (“Redundancy Does Not Imply Fault Tolerance” and “Protocol-Aware Recovery”) show how this “measure then build” approach has worked well.

During the Fall of 2015, one of us (Aishwarya) took the CS-739 distributed systems course with Remzi. For the course project, Remzi suggested that it might be interesting to study how distributed systems react to disk corruptions. We wanted to understand if distributed systems use the inherent redundancy to recover from corruptions on one node.

Right from the beginning, we started conducting experiments: we studied Cassandra by manually injecting bit corruptions. Within a few weeks, to our surprise, we discovered some serious problems — silent user-visible corruption, the spread of corruption, etc. We immediately knew that there was some potential here. So we studied a few more systems and found that they have serious problems (e.g., data loss) too. To study more systems, we automated the process by building CORDS, a fault-injection framework that injects disk corruptions. We applied CORDS to eight different systems and found many more problems.

One main challenge with this project was deciding how to present the results (we had too much data). Our advisors encouraged us to take one step back and look at the results across different systems and make cross-cutting observations. Doing this, we identified some overarching fundamental themes: none of the systems used the inherent redundancy to fix corrupted data; most systems conflate corruptions and crashes, leading to data-loss events; local detection strategies interacted with distributed recovery in unsafe ways, leading to more problems. These observations intrigued us, leading to our next project.

One of the interesting problems that we found from our previous project was that of crash-corruption entanglement. At a high-level, a checksum mismatch could be due to two reasons: a partial write (e.g., due to a crash) or a storage corruption. However, most systems always regarded a mismatch to a crash and invoked the crash-recovery code. We felt that disentangling crashes and corruptions was an interesting problem to solve.

We started making modifications to ZooKeeper and LogCabin. With the new storage format, we were able to differentiate between the two cases. We realized that one benefit of such disentanglement was that of faster recovery. Without our change, a corrupted node would discard all of its data and fetch it in entirety from some other node over the wire. However, we thought this was only a small benefit. However, after interacting with the developers and conducting more experiments, we came to understand the more severe effects of conflating partial writes and corruptions: the fundamental safety condition can be violated in RSM (replicated state machine) systems. Soon after, we analyzed many different approaches to deal with corrupted data and found that none of the approaches were ideal. So we designed CTRL, a new approach that recovers from corruption in consensus-based systems. CTRL was in some sense a solution to some of the problems we discovered in our previous project.

For this project, we collaborated with Vijay Chidambaram, who gave us detailed feedback on CTRL’s correctness guarantees. Vijay and his student (Eric) also helped us model check CTRL. We also collaborated with Aws Albarghouthi (who cites Jay-Z in his PhD thesis) to prove an impossibility result regarding crash- corruption disentanglement.

Pay careful attention to feedback (even if it is harsh)

“Know, or listen to those who do. — Baltasar Gracian

We believe one of the most effective ways to improve an idea or a paper is by paying careful attention to feedback. While most people (like our advisors and collaborators) are nice when giving feedback, not everyone is like that; some feedback can be harsh. While it is hard to face harsh critique, we feel that it is useful to ignore the tone but objectively consider the improvements suggested by the feedback. Our following experience shows how taking feedback seriously helps.

We submitted the first version of our “protocol-aware recovery” paper to SOSP ’17. While one of the reviewers liked our work, other reviewers pointed out several important issues and limitations of our work. Many reviews were scathing. However, they were factual and helped us think about how we can improve our work. Here are some excerpts from the actual reviews.

One reviewer suggests that snapshot recovery is required — “It is not clear why the model only considers corruption of the replicated log and not of the system state: the replicated state machine cannot function correctly if snapshots of the state are corrupted….

Another reviewer suggests the same — “…, there’s the question of whether making the log corruption-tolerant is really enough in the context of a real system. Most systems maintain application state that’s separate from the log, … What happens if this state is corrupted?

A reviewer suggesting a useful organization-level change — “I did find the way you presented your protocol rather confusing. Often I found myself thinking that what you presented couldn’t possibly work, only to discover several sections later that you make additional assumptions…you should first present your storage layer ….

We felt that the reviewers were right and reasonable. We took the reviews seriously with an open mind. We implemented snapshot recovery as suggested by the reviews. We also reorganized the sections and rewrote many portions of the paper. We submitted the revised version to FAST ’18. The paper was accepted, and it also won one of the Best Paper awards. We believe this was possible largely because of the excellent feedback from the SOSP reviewers.

If you have a good-enough idea, see it through

“Get a good idea, and stay with it. Dog it and work it until it is done, and “done right — Walt Disney

Another important thing that we have learned is to persist until the completion of a project. Like eminent researchers have observed, people count projects you finish, not the ones you start. We have had an experience where we started a project with some good idea but were trying to look for “better” ones.

During the Spring of 2014, one of us (Ram) was working on a project with a couple of other (then) senior graduate students (Thanu and Vijay) and our advisors. We developed Alice, a framework that can check for crash vulnerabilities in single-machine storage systems. A paper describing this project was published at OSDI ’14. Immediately afterward (Fall 2014), as part of the distributed systems class project, we thought it might be worthwhile to make Alice work with distributed systems. So, we started porting some of Alice’s code to record system-call traces of distributed systems. Applying this framework, we were able to visualize the replication and local-update protocols of a few distributed systems.

The obvious next step for this project was to implement the code to replay the recorded traces and check for crash vulnerabilities (like Alice). However, we kept looking for “better” ideas and did not implement this code. For example, using the traces we collected, we tried to infer the consistency and durability guarantees of distributed systems; however, this idea did not get any traction.

After 3–4 months of unproductive work, our advisors encouraged that just completing the implementation of our framework to replay the recorded traces and check for crash vulnerabilities was a potential direction. We stopped worrying about how the paper would turn out or even if we would find something new or interesting (e.g., unknown vulnerabilities); we concentrated on just getting the implementation done. After a month and a half, we had the first version of PACE, which can check distributed storage systems for correlated crash vulnerabilities.

We met with our advisors every week to discuss our work; these weekly meetings motivated us to apply the tool to many systems. We hit a few issues in our framework which we fixed one at a time and continued to study more systems. After a month or so, we had applied PACE to 5 systems and found new vulnerabilities in all the systems (including popular ones such as ZooKeeper and Redis). Finally, we completed this project, and the paper was published at OSDI ‘16.

Writing is as important as the work itself

“I write to find out what I’m thinking” — Edward Albee

We believe that writing is as important as the work itself. At first, writing an entire paper can be intimidating. However, we are fortunate to have advisors who have very patiently taught us how to organize our thoughts and write better.

The first time we tried to write a paper was during our OSDI ’16 submission; it was intimidating. Approximately seven weeks before the OSDI deadline, we sent the first draft of a few sections to our advisors. This is where our advisors did an amazing thing (that baffled us at the time!) They patiently gave us incredibly detailed feedback on the writing; Andrea marked every draft with a pen and left us the marked copy. We thought to ourselves that it might be easier and quicker for them to edit the paper themselves. However, we later realized that this was how they prepared their students to become better writers.

Although we don’t consider ourselves good at writing, without this kind of exercise, we couldn’t have attempted to write the subsequent papers. Looking back at the first draft now, we can spot common writing and organization mistakes that we had committed at the time; this speaks to the fact that our advisors’ training has not gone in vain!

Further, we believe that writing early can be helpful. Writing early enables many iterations with feedback from your advisors. We think this is a great way to improve the paper incrementally.

How advising helps

As Dave Patterson notes, a bad graduate career includes not trusting your advisor. As an alternative, he suggests to trust your advisor because (1) they want to work with and help graduate students, and (2) their success is judged in large part by the success of their students. We think this is great advice.

Two specific things to our advisors’ style of advising have helped us. First, they teach us to how to do things even if it would be more efficient to do the task themselves; for example, they patiently taught us how to write better instead of just editing the paper by themselves. Similarly, when designing new solutions, they taught us how to evaluate alternative solutions in general, instead of just saying what is good or bad about a particular solution.

Second, the regular meetings with our advisors help us tremendously. Most importantly, the positive feedback they provide, however small, boosts our confidence and motivates us. For instance, after our unproductive phase, the encouragement we got from our weekly meetings motivated us to apply PACE to many systems and get more results. The meetings also help us streamline our thoughts; it acts as forcing-function to think about our work from another person’s view who understands the details but is not doing every low-level task. We also think that the frequency of the meetings (every week) worked well for us; a week gives us enough time to make progress while enabling frequent-enough feedback. However, we believe this is specific to us; some students may prefer meeting more frequently.

Sometimes, shelving an idea and revisiting helps

“Everything has been thought of before; the problem is to think of it again.” — Goethe

We believe that sometimes if a project doesn’t pan out immediately, it might be good to put it on the back burner and revisit it later. For example, we started the distributed crash vulnerabilities project in Fall 2014, worked on it for three months, and shelved it. Then, after about nine months, we revisited this project in Fall 2015. Similarly, we revisited the distributed storage corruption work four months after initially starting it. When revisiting, we had better clarity and experiences from the interim projects helped us look at the project in a different way.

Editor’s notes: In addition to sharing their tale and lessons learned, Ram and Aishwarya graciously answered a few more questions.

1) What did you want to work on when you started grad school?

Ram: I didn’t have any research experience before coming to grad school. So, I didn’t have much expectations on what I wanted to work on. However, I worked as an engineer building some cloud-based software; so I knew I wanted to work on systems. Our group’s main focus is on file and storage systems. Thus, I expected I would work in the same area; however, we slightly branched out to distributed storage.

Aishwarya: I had two years of previous research experience at Microsoft Research India in mobile computing. I planned to work on mobile computing/ computer networks when I applied for grad school. But, I always had a liking towards systems in general and distributed systems in particular; I wanted to work on distributed systems both during my fellowship at MSR and my masters at IIT Bombay but ended up working on different areas. So, I joined Andrea and Remzi’s group and took Remzi’s DS class in my first semester and thus started working on distributed storage.

2) Did you run into any challenges setting up and understanding distributed systems in the first project? There is a bit of a learning curve, right?

Yes, we had to learn how to set up these systems. Sometimes, the documentation is very nice, easing the process. However, most times it is not. Forums and mailing lists were very helpful in this regard (e.g., https://forum.cockroachlabs.com/t/right-way-to-start-a-cockroachdb-cluster/256).

3) Could you talk about the extra effort involved in open-sourcing and why you made the code available?

We made the code of PACE and CORDS publicly available. There was some extra effort in cleaning up the code and making it more usable. However, the cleaning-up effort was not more than a week or so. We felt that making the code publicly available has more chances of adoption. For example, CockroachDB was interested in applying CORDS to their product (please see https://github.com/cockroachdb/cockroach/issues/20337). Other testing tools (e.g., Jepsen) now also run corruption tests which are very similar to what CORDS does.

4) Could you talk about any additional factors like friends at Wisconsin or your prior work experience that contributed to your success?

Ram: I personally feel that learning from peers is an important thing in grad school. I work with Aishwarya on all the projects. Discussing with her always refined my hazy ideas and thoughts. Before that, I was lucky to have worked with two senior grad students (Thanu and Vijay); these two people worked on exciting problems and served as a source of inspiration. Also, I feel that working in the industry for a while helped me work independently. However, I have seen students with no work experience also doing the same.

Aishwarya: One of the main factors that helped me kick start on research from the first semester was my research experience at MSR India as a Research Fellow. Working at MSR gave a prior first-hand experience of working on a problem from the stage of problem conception, to arriving at a solution, to evaluation, to paper writing. I specifically learned how to come up with different solutions and how to evaluate a system thoroughly while at MSR. In grad school, one of the main contributing factors is working with Ram. Most of our ideas were conceptualized during our intense discussions; these ideas become concrete as we question and reason about each others’ ideas thoroughly.

--

--