SHAttered Subversion

The SHA-1 collision and why predicting the future is hard

By now we’ve all seen the bug: After the release of the SHA-1 hash collision by Google last week, WebKit tried to check in copies of the two colliding PDFs and subsequently hosed their Subversion repository. This is the story of how that bug came to be.

To understand why this is a problem with some Apache Subversion repositories, it’s important to understand how Subversion stores information on the server. The current default backend is known as FSFS (“fuzz-fuzz”) because it uses raw, immutable files on disk to store repository contents. Each revision stores all its file contents and metadata in a single file on the filesystem, and this file isn’t supposed to change. Future revisions can reference content in older revision files because they never change, and the full text of every file is known as a “representation”.

Fast forward to 2008, when I was looking at ways to reduce the storage requirements for Subversion repositories on the server. One piece of low-hanging fruit was duplication of repository contents. While Subversion uses deltas to reduce the size of a representation, repositories which experience high amounts of branching and merging can also end of with multiple representations of the same file contents, which is wasteful. I decided to implement a content key-value store so that new representations wouldn’t need to be written if they already existed in the repository. Since revision files were immutable, we simply added an sqlite database which tracked the mapping from a key to the location of the representation’s contents in an earlier revision file.

You can probably guess where this is headed: we chose to use SHA-1 as the hashing key for repository contents. Surprisingly, the debate was between MD5, which we already used for corruption detection, but was known to be cryptographically broken, and SHA-1, which was much more robust. While I did spend some time investigating more modern hashing functions like SHA-256, we decided that SHA-1 was sufficiently secure that generating intentional collisions wouldn’t happen soon. Unfortunately, last Thursday happened sooner than expected.

Representation sharing has been on-by-default for new repositories since Subversion 1.6, but fortunately, it remains optional. The simple workaround is to disable representation sharing on the server, which prevents content which hashes to the same SHA-1 from sharing a representation with the resulting problems. This works for all future revisions, and should prevent problems for the foreseeable future (i.e., until a prefix attack on SHA-1 is found).

Unfortunately, disabling representation sharing negates the original goal of smaller repositories. The FSFS file format is flexible enough to allow it to use a new hashing algorithm in future Subversion releases, but doing so without a repository reload would only allow sharing of future representations, not any currently in the repository. While not ideal, correctness is always better than optimization.

Representation sharing isn’t the only part of Subversion that’s impacted by potential SHA-1 collisions. Subversion has a client-side store of pristine copies of local contents it uses for showing differences and avoiding fetching duplicate content from the remote server. For files with different contents and the same hash value (as those demonstrated in the SHAttered attack), local working copies will silently return incorrect data, as Subversion will internally consider both files to be identical.

I’m no longer actively involved in Subversion development, but from what I can tell, the Subversion team is responding to these issues and issuing both short-term work arounds and working on long-term solutions. In watching their response, I am confident that Subversion remains a tool you can trust your data to.

If there’s a lesson to be learned here, it’s that no piece of technology is ever immune to obsolescence, and as software engineers, we need to plan for even the remote contingencies as we design systems which need to live for decades. We may not always make the right technical decisions, but we must preserve the ability to change those decisions if we expect our systems to stay viable for the long-term.