Building a time machine: seamless ML dataset versioning with Lance

Chang She
LanceDB
Published in
5 min readNov 17, 2022

Dataset versioning, debugging, and model reproducibility

In machine learning, being able to version your dataset is really useful for model debugging and reproducibility. In a lot of model development scenarios today, model iteration is really dataset iteration. So model debugging is also really dataset debugging.

If you find that your F1 / recall / accuracy score isn’t getting better with more labels, it’s critical that you understand why this is happening. You need to be able to compare label distributions and imbalance between dataset versions. You need to compare top error contributors, check for new negative noise introduced, among many other things. Today this process is extremely cumbersome even when possible, involving lots of copying, complicated syntax, and configuration files that need to be managed separately from the data itself.

What if it didn’t have to be complicated? Instead of writing custom infrastructure to manage snapshots, what if you could just add new samples to your training data and have it be versioned automatically? Instead of having to spin up another database or pay for another SaaS, what if all you needed was your data in S3 and your laptop running Jupyter Notebook?

All of this is easy with Lance. Lance is a new open-source ML data format that supports blazing-fast exploration and analysis of ML data using any Arrow-compatible SQL engine (e.g., duckdb) or DataFrame API (e.g., pandas). Lance versions your data automatically and supports comparisons across versions without needing to “checkout” one version at a time. Let’s see how this is done.

github.com/eto-ai/lance

Model iteration is really dataset iteration

Suppose we’re training a classification model. For this example, we’ll use the Oxford Pet dataset. You can get the raw dataset using fastai:

get raw oxford pets dataset

With Lance it’s easy to ingest datasets from this known format:

Ingest dataset as pandas DataFrame

Let’s say we’re getting this dataset in batches of 1000 and for each new batch we train a new model version by labeling new training examples and adding them to our dataset. Without Lance, keeping this process reproducible involves a lot of manual work. You need to explicitly create copies in different directories and create system to retrieve different versions. Or you might use a git-like tool that requires you to double the git complexity. Or you need another service to hide the complexity and take care of a lot of stuff under the hood.

Lance versions datasets automatically

With Lance, you don’t need to think about creating new snapshot copies, you just keep appending to the same dataset:

create the dataset in batches

Lance automatically creates a new version each time you add a batch of training examples:

versions are automatically created

Lance versions are queryable

You don’t need config files to manage these versions nor do you have to “checkout” a particular version explicitly. Lance dataset versions is super easy to retrieve:

pass in `version` parameter to get a different version from same uri

Different versions are still just Arrow-compatible datasets, which means they’re also super easy to query.

For example, I can compute the label distribution across different versions using duckdb:

label distribution across versions

And I can easily visualize it using pandas:

A bunch of dog breeds weren’t even in the first batch

Diff’ing data

Finding the difference between different dataset version is also easy. Using a left join I can find the images added in later versions:

Images in the latest version but not in the first version

Try your own comparisons

Label distribution is just the most basic of comparisons we perform during the model iteration process. Now try computing your own metrics using various python libraries (e.g., sklearn) across different Lance dataset versions. You can use this notebook as a base.

How does it work?

Data versioning

As you add a batch of new records, Lance creates a new version automatically. New data is written to a new Lance file and a new manifest is created:

A peek inside the sausage factory

Each version’s manifest keeps track of the schema and the lance files included in this version. For example, if you look at the manifest for version 1, you see that only 1 file is included:

version 1

And if you look at version 8, you see all 8 files:

version 8

Each of these files correspond to an Arrow Fragment and thus Lance is able to produce Arrow tables transparently for each version

Roadmap

Currently Lance supports appending new rows for versioning. The model development workflow needs more than that and we’re actively working on enriching the versioning capabilities in Lance.

Schema evolution

We often need to augment our dataset with things like model inference results, notes/comments, or metadata from other sources. We’re working on supporting append_column so that adding a column doesn’t require re-writing the dataset.

Updates and deletes

As we fix labeling errors or replace bad images, we’ll also want to correct the data without needing to rewrite the whole dataset. We’re working to support fast cell-level updates in a transparent way.

These new features will be exposed to python / duckdb for a seamless developer experience.

Conclusion

None of the things we demonstrated in this post requires any extra infrastructure or SaaS services. Lance is able to automatically version your datasets as you refine your training data and makes it easily retrievable and queryable. Whether you’re using pandas, duckdb, or other analytical tooling, Lance makes it effortless to check data quality and compare model performance across different versions.

Lance is open-source and easily installable via pip. You can find it here: https://github.com/eto-ai/lance. We welcome your feedback on github or join our discord.

Special thanks to Lei Xu, who is chiefly responsible for the core Lance architecture and implementation.

--

--