In the last post you saw how the Fluid Framework ordering service works as well as how to run the service. Clients send messages to the service which stamps them with a sequence number and minimum sequence number and then broadcasts them to clients. This operation log is all a client needs to join the collaboration. But by creating snapshots of the log Fluid can dramatically decrease both load times as well as the amount of data required to send to clients.
Although having to play through the entire log can still be fast it comes with some serious disadvantages. The major one is being able to do virtualized loads of a document. If all you need to look at is the fifth slide or the tenth page it would be nice to only send that data to the client in the initial payload and allow them to still collaborate. With the entire op log you’d need to send and then process all ops to load this data. In addition op logs can get large. You can compress them to mitigate some of this but if the actual file contents is 0 bytes — because you typed out a big document and then decided to start over and deleted it all — you’d like to only pull in 0 bytes. The less data you’re required to download and process the quicker pixels can get on the screen and the faster a user can get to collaborating with others.
Supporting those scenarios listed above and having fast loads with low-latency collaboration were design goals of Fluid from the beginning. We knew we wanted to be able to snapshot our logs but didn’t know exactly how to do it. The system that ended up influencing us the most was one we were using every day: Git.
Git is an incredible system. But the commands developers use every day often hide the simple but incredibly powerful architectural underpinnings of Git: the content addressable filesystem. In a content addressable filesystem you look up content based on an identifier rather than its file location. For Git this identifier is the SHA1 hash of the content. All Git objects except refs and tags are stored in this filesystem (you can see it in the .git/objects folder). Git appends a header to the object data which identifies it as a blob, commit, or tree. All content is immutable once in the file system. When you perform a GC in Git it walks the refs/tags and deletes objects that aren’t reachable. Chapter 10 of the Git Book has a great overview of this process.
But storing all object data for every SHA1 hash would add up quick. If you’re making small changes to a file as part of a commit you wouldn’t want to store the entire contents again. Instead you’d prefer to just store the change as a diff relative to the old value. This is where Git’s packfiles come into play. They’re designed to optimally store the data and compress it to minimize total space on disk and the network. This is how Git can efficiently store your entire revision history. The blog below on implementing git clone in Haskell is an absolutely amazing article on this process that’s worth checking out (no Haskell knowledge required either).
git clone in Haskell from the bottom up
Stefan Saasen - March 2013 - @stefansaasen In order to give some structure to my ongoing investigation of git's data…
When we were building Fluid we knew we wanted a couple things. At every snapshot we wanted a complete representation of the document and we wanted to be able to load this fast. This is so people could copy/paste links to historical snapshots of a document and load them directly or embed them in other documents. Not to mention loading the latest version as fast as possible. We also wanted to make sure you didn’t need the entire snapshot to begin collaborating. Fluid data structures are all fully independent of one another. This was a very deliberate choice for performance and scalability. The app developer may define load dependencies but the data structures themselves never take a dependency on one another. The data structures are also designed to be virtualized. For example a string doesn’t require all of its data to be loaded for a user to begin collaborating on it. A developer could just load the first page of text and let the user collaborate on that data. The rest of the document can then be paged in on demand. By having independent, virtualized data structures a client can load exactly what they need.
With these goals in mind we then had to figure out how to do this efficiently. We had our log of messages. And then assuming we had a snapshot we knew that the messages since that snapshot were the only thing that changed in the document by the time we wanted to make a new snapshot.
Our design goals combined with how Git works, plus our message deltas between snapshots sharing similarities with Git’s packfiles, led us to build Fluid on top of Git’s content addressable filesystem. This was a great decision. The two systems worked great with one another out of the box. At one point we considered trying to use our delta messages in place of git’s delta compression. But the delta compression worked great as is and so this wasn’t necessary. By leveraging Git we also were able to take advantage of the great code and ecosystem it provides. Because all objects are immutable it allows us to aggressively cache content and fully leverage CDNs. And because of how tested and efficient Git is we also got fast, robust, and efficient storage. It also let us do cool things like store your documents to GitHub/GitLab/etc… As well as take inspiration from Git itself to allow you to fork and merge documents.
We don’t require you to use Git when running your own service. You just need to provide a content addressable filesystem. But Git certainly is my personal favorite to combine with Fluid. Especially when storing directly to GitHub.
So with all of that context how do snapshots actually get made? A good way to think of them is like a pull request. Fluid elects a client in the collaboration to be in charge of summarizing the document and creating the snapshot. This client goes and creates the snapshot at a particular sequence number and then submits the PR to the service. The service then inbounds this snapshot message, validates it, and “clicks” on the merge button. When a new client joins they simply do a shallow Git fetch to get the latest data. And then download the set of messages since that snapshot and apply them to the document. This makes loading Fluid documents fast and efficient.