At Gigantum, we are building an open-source tool for developing, executing, and sharing data science projects that automates the creation of versioned and containerized code. This way your work is always accessible, reproducible, and transparent. Our ultimate goal is to make science and data science more efficient and reproducible, and we want people to directly access and build on each other’s work without all of the technical hassles. You can learn more about Gigantum, try the Client in the cloud, or download and install it locally at our website: https://gigantum.com
A core concept of the platform is the Gigantum Project. Projects bundle data, code, and environment configuration into an augmented repository that is automatically managed by the Gigantum Client. Projects can be created from scratch, imported as a file, or shared via Gigantum Cloud, and each one contains a granular history of changes to data, code, and environment. This high resolution history is accessible through the Activity Feed, which is a visual record of figures and searchable text that lets you find and inspect everything that has been done by every person that has worked on the Project.
When we started, we knew that we wanted to leverage Git to version changes to Projects because of Git’s distributed design and maturity. The main issue we had was that in order to capture rich metadata for things like the Gigantum Client’s Activity Feed, we needed to store additional metadata (e.g. figure thumbnails, code snippets, tags) with each commit, which was impractical to store in Git directly.
To store this additional data we needed a system that satisfied some important constraints. First, the datastore had to be embedded in the Project itself so that everything remained bundled together. Second, it had to be a simple, file-based system that didn’t require a complex database or server process to be running. Finally, the datastore needed to be Git compliant, as the initial version of Gigantum Projects were to be built on Git and Git LFS.
The solution we developed was a simple key-value datastore based on a checkout-aware, append only log file structure. This allows the Gigantum Client to store arbitrary data (i.e. key-value pairs) within the repository that are linked to a specific git commit. A simple library was developed to manage reading and writing to this datastore.
To write data, the library takes a “checkout ID” and value, writes the value to the appropriate log file for the given checkout ID, and then returns a key to read the stored value. A checkout ID is a unique hash, that is generated and stored on each git operation that would create a new context (e.g. checkout, merge, clone). This guarantees that the datastore files will never cause merge conflicts when performing git operations. Additionally, data larger than 4kB is automatically blosc compressed to reduce disk space. To retrieve a value based on a provided key, the library decodes the key to get the checkout ID, file ID, byte offset, and byte count. It then reads and decompresses the value as needed.
This library is integrated with the Gigantum Client, which automatically generates checkout IDs and creates new versions for the user. To best understand how this embedded datastore works, let’s work through an example.
First, we create a new Project. This automatically generates a checkout ID “af349bc1” and stores it to an untracked file in the repository. This ID is used as the basis for the datastore’s active filename to which data can be written. If this file gets too large, it will automatically roll, incrementing the file ID for this given checkout ID which results in creation of a new file.
Next, let’s add a new Jupyter notebook to the Project and execute a cell, which generates a figure. This will cause the Gigantum Client to automatically create a new version and store a thumbnail of the figure. To do this, first a `git add and `git commit` is made to track all changes made by the user.
These changes are captured by Gigantum and processed to extract metadata. The client is able to collect this extra data due to a tight integration with Jupyter, which will be discussed in another post. In this case, it’s a thumbnail and the code snippet that was executed. These are written to the embedded datastore and keys are collected.
Finally, another git commit is made to capture the changes to the datastore file. During this commit, a specially crafted git commit message is used to indicate that this is an Activity Feed record. The message contains a reference to the previous commit hash (which actually contains the user’s changes), high-level metadata, and the keys to access the additional detailed data that was written to the store.
With this write complete, both a new version and an Activity Feed record have been created. The Client can then simply use the git log to render the Project’s version history in the Activity Feed, by reading detailed data from the embedded store when needed. Because the git log is used to order and access data, git operations like branching and merging not only operate on the user’s content, but on the embedded datastore as well.
This simple, file-based design will help ensure that data stored in Gigantum Projects remains open and accessible, even if you don’t use the Client. By directly interfacing with the git log and decoding the simple keys stored there, anyone can access everything that was written.
The Gigantum Client is still in beta and the Project format is still evolving. We plan to release a formal specification of the Project structure and the embedded datastore as things continue to mature and stabilize. Additionally, we are continuously working on improvements, including changes that will simplify git operations in Projects, boost sync speed, and improve large file performance.