CSR Tale #10: CrashMonkey — finding bugs in file systems (even verified ones!)

I figured it was time for another story of my own. This is the story of the CrashMonkey project, and how it came to be published at OSDI 2018. We ran the CrashMonkey project as an open-source project from Day 1, so I’ll also be commenting on that aspect of the project.

The origin of the CrashMonkey project lies in our SOSP 2013 paper on Optimistic Crash Consistency. We had built a new file system, OptFS, and we were wondering how to ensure that it was reliable. Unfortunately, there were no tools available to check the reliability of a file system, especially to check if the file system recovered correctly after a crash. At the time, my co-author Thanu put together a quick framework to crash the file system at random points, let the file system recover from the crash, and then see if it had recovered correctly. By checking several random crashes, this provided a measure of confidence that the file system was crash-consistent. Thanu’s framework was also at the heart of our OSDI 2014 paper where we checked whether applications recovered correctly from crashes.

Fast-forward to Fall 2016, when I joined UT Austin as an assistant professor. Ashlie Martinez, an undergraduate student and Turing Scholar at UT Austin, was interested in working with me. So I pitched to her the idea of a framework to test file-system crash consistency, based on the ideas from SOSP13 and OSDI14 papers. I actually expected Ashlie would spend the first semester just getting up to speed on things, so I was flabbergasted when she came back in around a month with a minimal working prototype! We wrote up the results of her work as a workshop paper, and it got accepted to HotStorage 2017. Ashlie delivered a great talk, and the project received a lot of enthusiasm and positive support.

We then started tackling the harder questions in the project: how do we systematically test whether a file system recovers correctly? Crashing at random points during a random workload seems unsatisfactory — can we do better? While CrashMonkey (at that point) provided the infrastructure to crash during a workload and then test whether the file system recovered correctly, it required a hand-written checker (because it didn’t know what correct recovery was otherwise). The developer also had to hand-pick the workload that would be run on the target file system; what workloads should be picked?

It seemed clear CrashMonkey was no longer a one-person project, so we roped in two graduate students: Jayashree Mohan and Soujanya Ponnapalli. To tackle the hard questions posed above, we looked back at the crash-consistency bugs reported in Linux file systems over the past five years. We saw some interesting patterns: most of the reported bugs could be reproduced by replaying a small set of file-system operations on a new, empty file system. All the bugs involved crashing just after a file or directory had been flushed to storage with a call to fsync(), fdatasync(), or sync. We realized that most file-system operations only modified state in memory; only fsync(), fdatasync(), or sync provided any guarantees about what would be flushed to persistent storage. Thus, it is enough to crash right after these points in the workload. Another reason is that file-system developers are overloaded with work — they do not particularly care if a crash in the middle of a write() leads to data loss; they will simply say “you should have used fsync”. On the other hand, if you show them a bug where a file was persisted with fsync(), and still lost data or resulted in corruption, then they will take you seriously.

Based on these insights, we came up with a new approach to testing file-system crash-consistency: bounded black-box crash testing (B3). We would consider the set of file-system operations as primitives, and compose workloads consisting of these primitives. For example, rename would be a primitive, linking a file to a different name would be another primitive, and so on. Since our study tells us we only need a few operations to expose reported bugs, we exhaustively test small combinations of different operations and see how the file system reacts. We restrict the set of files and directories used in the operations to two files each inside two directories. We also restrict the range for writes inside a file to the beginning, the middle, and the end; we try to have write operations overlap. We also allow appends to extend the end of the file.

The bounds used by Crashmonkey in crash-testing file systems

To accomplish this sort of exhaustive testing, we augmented CrashMonkey so that as it runs the workload, it records which files have been modified and persisted; only these files should be expected to survive the crash. CrashMonkey also understands the semantics of each file-system operation, so it understands what to check in the file system after it recovers from the crash. We also created another tool, the Automatic Crash Explorer (Ace), that generates the different workloads given the constraints such as the set of file-system operations and the number of file-system operations in the workload.

Exhaustively testing even small number of file-system operations results in testing a significant number of workloads. For example, with three file-system operations, we estimated that 25 million workloads needed to be tested! We reduced the set of workloads by focusing on some cases (such as nested directories) to 3 million. But this would still take six months if run on a single core!

Fortunately, at UT Austin, we had access to the Chameleon cloud computing test bed. We were able to provision 65 nodes, on which we were able to run 780 virtual machines in total. This part of the project was handled by Pandian Raju (of PebblesDB fame, now at Rubrik) who managed running Crashmonkey on all 780 virtual machines simultaneously and handling it when a virtual machine crashed, or a physical server went down. As a result of these heroic efforts, we were able to test 3.3 million workloads in two days!

We found 10 bugs in btrfs and F2FS! We breathed a sigh of relief, since we had no idea if we would find any bugs when we started this experiment. But the results showed us our approach worked: this form of systematic exhausting testing is effective at catching crash-consistency bugs! All the bugs we found result in metadata corruption or data loss even after fsync/fdatasync/sync is called. Developers have fixed four of the bugs, and are working on the other bugs.

We wrote this up and submitted it to OSDI and got accepted with great reviews :) The reviewers loved how we tackled a hard problem. Our work also fills a long-standing need in the research community to test the reliability of new file systems! Our tools are available at our group’s Github page: https://github.com/utsaslab/crashmonkey. We have tried to include enough documentation so that someone else can pick up and use our tools, and we have made the tools robust enough that they won’t fall apart when testing real file systems.

Documentation in the CrashMonkey codebase

After we finished the camera-ready version of the paper, we shared it within UT Austin and on Twitter. Folks encouraged us to try testing verified file systems; we were skeptical we would find anything, but to our surprise, when we tested MIT’s FSCQ file system, we found it did not persist data on fdatasync()! Apparently they had a bug in the un-verified portion of their code (Haskell-C bindings), which was caught by Crashmonkey! This shows that even verified file systems have un-verified components which are often complex, and which will have bugs. One other thing to note was that Jayashree found the bug in FSCQ using Crashmonkey in a couple of days: on Friday Sep 28 we were encouraged to test verified file systems, Jayashree spent the weekend testing Yxv6 and FSCQ, and we had emailed the authors about the FSCQ bug on Monday Oct 1. Thus, setting up Crashmonkey on a new file system and testing it doesn’t take a lot of time; Crashmonkey is also effective in finding bugs quickly.

Here’s a quick demo video Jayashree made about the project:

Some thoughts as I look back on the project:

  • We ran this as an open-source project from Day 1. If you go look at our repository, you can see our thoughts as the project evolved :) We used code reviews and pull requests for coordination. Ashlie had high standards for how the code should look and be documented; thanks to this, our code base was relatively easy for others to understand and modify. The open-source nature of the project also helped draw in undergraduate students who wanted to work on systems research.
  • This project involved significant interaction with file-system developers, to understand the guarantees provided on a fsync/fdatasync. All file systems provide guarantees above and beyond what the POSIX states. Our interactions with developers had two significant outcomes. First, it helped us understand that some file-system operation semantics are vague. For example, if we create a symlink to an existing file, it can go missing even if the symlink file is explicitly persisted before a crash. This is because, unlike hard links, symlinks are not regular files and it is not possible to directly open the symlink to fsync it. This helped us generate workloads in a more reasonable manner, eliminating file-system operations like symlinks which provide no strong guarantees. Second, although file systems provide guarantees more than what the POSIX expects, these are not formally documented. In Ted’s words, “Filesystem developers knows what it means, and it gets encoded as things like test in xfstests”. Our conversations prompted developers to write down such guarantees explicitly for the first time.
  • This was a project where we weren’t guaranteed results. We had no idea if we would find any new bugs (the gold standard for a tool like this). The thrill of running the experiment and finding that first new bug was amazing — but it also came after months and months of effort. Projects like this require a ton of persistence.