Painless Data Versioning for Collaborative Data Science
A Bolt-On Approach to Relational Data Versioning
We describe our VLDB’17 paper on OrpheusDB, a bolt-on approach to support versioning capabilities on large datasets. More information at http://orpheus-db.github.io. Blog post written by Silu and Liqi, edited by Aaron Elmore and me.
When data analysts collaboratively analyze data, hundreds to thousands of versions can be generated from a variety of data processing operations, including data wrangling, data curation, data editing, and feature engineering.
For example, consider a group of bioinformaticians analyzing a protein-protein interaction dataset. Here, each individual may prefer to download a copy of the dataset, perform independent data manipulations in private, then make it visible to others (e.g. commit or merge) later when they are confident of their modifications. Hundreds to thousands of such versions may be generated in this manner, and it is often necessary to reconstruct past versions, query across versions (e.g. find some that satisfy a given property), or compute statistics.
Unfortunately, state-of-art source-code version control systems (e.g., Github) do not scale for large unordered datasets. On the other hand, relational database systems are efficient at handling processing large amounts of structured data and come built in with a sophisticated query language and execution capabilities — but provide no versioning capabilities.
The OrpheusDB Architecture
We propose OrpheusDB — a bolt-on approach to support versioning on top of a traditional relational database, without any modifications to the underlying database. Not only does Orpheus support standard version control commands, such as checkout and commit, but also allows SQL queries, including queries on a set of versions, or queries that identify version that satisfy certain properties. For more details on Orpheus’ query language, see our SIGMOD’17 demo paper.
Bolt-On Versioning: The Issues
To support a bolt-on approach to versioning, we need to figure out a way to represent versioned datasets within a database. One “strawman” approach is to simply associate each tuple with a version ID, as a versioning attribute. Then, each tuple is repeated as many times as the number of versions it is present in. This process raises two issues:
- Storage is expensive — there could be thousands of versions, meaning each tuple could be repeated a thousand times.
- Checkout is slow — “checking out” (i.e., producing a private copy of) any specific version is expensive since we need to read a number of records that are not pertinent to the version being checked out.
These two issues motivate our two key design decisions, as we will see below:
- How do we represent versioned data to minimize storage, while not impacting versioning operations?
- How do we speed-up checkout of one or more versions?
Challenge #1: Data Representation
Compared to the strawman approach (a) above, we can reduce the storage overhead by using an array data type to encode a list of all versions that each tuple belongs to as in (b). Array data types are supported in most popular relational databases. Unfortunately, the commit operation is expensive in this representation, since when a new version is committed, we need to modify every tuple in the new version by appending the new version number to each tuple in the array. Instead, we can separate out the versioning information from the data, and store it in a separate table, as shown in (c) and (d), where (d) depicts two ways of representing the version-tuple membership information: organize according to versions, with a list of records each (split-by-rlist), or organize according to records, with a list of versions each (split-by-vlist). Since the split-by-rlist approach only needs to append one tuple for each commit, it has lower commit latency than the split-by-vlist approach. Therefore, we select the split-by-rlist as our final data representation in OrpheusDB. We conduct extensive experiments in our paper, and verify that split-by-rlist is the best choice.
Challenge #2: Storage/Access Trade-off
The reason why checkout is costly is that in order to check out a version, we need to access all the tuples in the table, and thereby waste effort on “irrelevant” records. To reduce the checkout latency, we propose to break up the table into small partitions, such that when checking out a version, we only need to touch the small partition that the version belongs to. Of course, this may come at the cost of some extra storage overhead. We formalize our problem by setting a threshold γ on the storage, and aiming to minimize checkout time. Before introducing our approach, here are two observations:
[Observation I] The storage cost is minimized when all versions are stored in one partition.
[Observation II] The average checkout cost is minimized when each version is stored as a separate partition.
The first observation holds since there are no duplicated records in this single partition, while the second holds since we are only accessing the pertinent records when checking out a version. If we illustrate these two extreme cases on a plot with x-axis as the storage size and y-axis as the checkout cost, then they lie in the two corners and the optimal scheme for storage threshold γ should be somewhere in between. Instead of single partition or partition-per-version, the intuition behind the optimal scheme is that we should group versions with similar records together.
If we use a traditional clustering algorithm to group similar versions, this will involve examining every single tuple, which can be costly since the number of tuples can number in the hundreds of millions or more.
The LyreSplit Approach: Operate on the Version Graph
Instead, we operate on the version graph, the graph of derivation of versions, encoding not only the derivation information, but also indicates the similarity between parent-child versions. For instance, in left panel of the figure alongside, version 1 is derived from version 2 and has 6 records in common with version 2. Our OrpheusDB prototype actually records this information as versions are committed.
We introduce a new property called δ-coherence. This property can help us identify an approximation algorithm with guarantee. The intuition behind the δ-coherence property is that if the current average checkout cost is not far away from the smallest possible checkout cost, then we say it is δ-coherent as shown in the left figure above.
Our LyreSplit algorithm starts from a single partition with all version. We first check whether this partition is δ-coherent or not. If no, then we will split the version graph until every partition satisfies the δ-coherence property as shown in the right figure above.
Evaluation of LyreSplit and OrpheusDB
We evaluated LyreSplit on a versioning benchmark from a prior paper and compare it with two state-of-the-art clustering-based algorithms. Even though LyreSplit operates on the version graph, and therefore does not have complete information, LyreSplit is still multiple orders of magnitude faster than the other two algorithms, while returning better results (for storage and checkout). We also evaluated the benefit of partitioning — in brief, we found that with 2X increase on the storage size, we can achieve up to 20X reduction on the checkout time. There are many more experiments in the paper.
Open-source Prototype and Version Explorer
We’ve been spending a lot of time developing OrpheusDB as open-source software, with releases catalogued here.
OrpheusDB supports versions to be “checked out” into CSV files or relational tables, following which modifications can be performed in external programming languages or SQL, to be later committed back into the system. OrpheusDB can support a wide range of git-style queries along with SQL queries that refer to versions and data. We have also spent time developing an interactive version explorer, catalogued in our SIGMOD’17 demo paper. We’re looking for beta-testers to try out OrpheusDB. Let us know if you’re interested!
OrpheusDB represents our best belief for how collaborative data science should be performed in the future — moving away from the current ad-hoc, poorly managed process, with scant reproducibility, storage and latency inefficiencies, and lack of understanding or querying capabilities. With querying capabilities that OrpheusDB inherits for free from mature relational databases, coupled with our intuitive and efficient representation and partitioning schemes, we’ve made substantial progress already!
We thank NSF and the NIH for the support.