Dataset versioning evaluation methodology for ML tasks

with Pachyderm evaluation example

Lev Plotkin

Published in

Israeli Tech Radar

6 min readSep 5, 2024

Why is dataset versioning important for ML?

The success of an ML project relies on proper versioning tools.

According to best practices, versioning should cover all key components: code, data, and models. This ensures experiments can be reproduced, making it easier to evaluate and improve approaches. Without versioning, managing ML workflows becomes inconsistent and productivity slows down significantly.

In this post, we’re focusing on dataset versioning for two main reasons: it’s closely aligned with our domain, as most of our group members are data engineers, and while we may be biased, we strongly believe in the Data-Centric approach.

This lack of control makes it hard to:

Reproduce past results for audit purposes or when debugging model performance issues.
Monitor data drift and evaluate how changing fraud patterns impact model accuracy over time.
Pinpoint data errors or inconsistencies introduced during the dataset updates.
Maintain consistency across teams when retraining models or developing new versions, leading to confusion and inefficiency.

Let’s consider a dataset versioning approach and evaluate the tools that can support it.

Proposed evaluation methodology:

Define Evaluation Criteria:

Core Requirements:

Support for your type of data
Integration with other tools in the ML pipeline (platform, infrastructure, storage)
Git-like versioning, history, and branching
License

Extras:

Easy to use and set up
Minimal overhead (performance and storage impact)
Ability to preview data
Collaboration: search/data catalog/access control
Diff and compare features
Support for cloud/private/hybrid deployment
Open-source
Available commercial or community support
Well-documented
Community
Additional functionality such as quality tests, preprocessing, and more

Test the Tools:

Setup: Install and set up each tool in your environment. Document the process and any challenges encountered.
Use Case Simulation: Use a standard dataset and a common ML pipeline to evaluate how each tool handles versioning, integration, and other criteria. This might include versioning datasets, integrating with your existing pipeline, and testing collaboration features.
Performance Benchmarking: Measure any performance impacts, including speed, storage, and resource consumption.

Summary:

Analyze Results
Set Scores
Strengths and Weaknesses
Recommendations

In the end, I want to create something similar to this comparison of version-control software but focused on dataset management. This will help make a well-reasoned decision when choosing a dataset versioning tool that fits the specific needs of a project.

Pachyderm evaluation report

it’s the first dataset versioning tool I picked for the start.

Pachyderm Community Edition

For small teams that prefer to build and support their software.

Apache License 2.0
Up to 16 Data-Driven Pipelines
Parallel Workers Limited to 8
Not support Role Based Access Controls (RBAC)
Not support Pluggable Authentication

Pachyderm Enterprise Edition

Commercial license
Unlimited Data-Driven Pipelines
Unlimited Parallel Processing
Role Based Access Controls (RBAC)
Pluggable Authentication
Enterprise Support

Features

Data-driven pipelines automatically trigger based on detecting data changes.
Immutable data lineage with data versioning of any data type.
Autoscaling and parallel processing are built on Kubernetes for resource orchestration.
Uses standard object stores for data storage with automatic deduplication.
Web UI for visualizing running pipelines and exploring data
JupyterLab mount extension

Integrations

Determined: a deep learning platform for training machine learning models
Google BigQuery: connector ingests the result of a BigQuery query into Pachyderm
JupyterLab: Notebooks are connected directly to your Pachyderm projects, repos, branches, and data
Label Studio: a multi-type data labeling and annotation tool with a standardized output format GitHub
Superb AI: a data labeling platform that supports image, video, text, and audio data
Weights and Biases: a tool for tracking and visualizing machine learning experiments

Deployment options:

Local: https://docs.pachyderm.com/products/mldm/latest/set-up/local-deploy/
Cloud (AWS, Azure, GCP): https://docs.pachyderm.com/products/mldm/latest/set-up/cloud-deploy/
On-Prem (Kubernetes & Openshift): https://docs.pachyderm.com/products/mldm/latest/set-up/on-prem/

Setup for Local Deployment

I followed the Local Deploy guide to set up Pachyderm on my laptop. Unfortunately, with kind what I had already installed, it did not work.
So I used minikube instead.

See the commands in Makefile at https://github.com/MLL-group/pachyderm-demo-project.

It’s Kubernetes-based, so it’s easy to deploy on any cloud provider or on-premise. But I suppose in real-life use cases it will be a lot of effort to set up a Pachyderm in the production environment. Kubernetes’ storage, volumes, network, security, monitoring, etc.

Usage

pachctl is a Git-like tool used to interact with the Pachyderm cluster.
Same as Git: branches are called branches repos are called repos, commits are called commits, and files are called files. The Pachyderm calculates hashes of the data and stores it in the object store. The files with the same hashes are considered the same and are not stored again.

An additional useful feature is the pipeline, which is a DAG of data processing steps. Each step is a container that takes input data from the previous step and produces output data.

Here is a pipeline example from the documentation. It will process the data from the data dataset and store the result in the count dataset.

pipeline:
    name: 'count'
description: 'Count the number of lines in a csv file'
input:
    pfs:
        repo: 'data'
        branch: 'master'
        glob: '/'
transform:
    image: alpine:3.14.0
    cmd: ['/bin/sh']
    stdin: ['wc -l /pfs/data/iris.csv > /pfs/out/line_count.txt']

A nice additional benefit is that the pipeline is triggered automatically when the data in the data dataset changes. We can create triggers for branch or pipeline

For example, create an SQL table with the result of the pipeline (from the output CSV file)

Also, because it’s git inspired nature we have data lineage and can see the history of the data processing.

BigQuery support

I was wondering how Pachyderm supports BigQuery, I expected that it would be able to read data from BigQuery and update the dataset on the changes in the BigQuery table.

But this connector just reads the BigQuery dataset and stores it as parquet files. A bit disappointing

There was a plan to support SQL ingests, but it has not been implemented yet. (Check https://github.com/pachyderm/docs-content/blob/main/.archive/sql-ingest.md)

Conclusion

On first impression, I like Pachyderm. It’s easy to use, leverages Kubernetes and has useful features. for file-based data processing, it’s a good choice. I think it needs to improve and add SQL ingests as mentioned, but implemented in a way that allows reading the data from the BigQuery and updating the dataset on the changes in the BigQuery table.

List of tools to consider for future evaluation:

Git LFS, An open-source Git extension for versioning large files
Hub, allows you to manage changes that have been made to datasets with similar commands to Git
lakeFS, atomic, and versioned data lake on top of object storage
Quilt, a self-organizing data hub with S3 support
Intake is a lightweight set of tools for loading and sharing data in data science projects.
Dud, is a lightweight CLI tool for versioning data alongside source code and building data pipelines.
Arrikto, dead simple, ultra fast storage for the hybrid Kubernetes world.
Delta Lake, is an open-source storage layer that brings reliability to data lakes
Iceberg high-performance format for huge analytic tables.
Dolt, SQL database that you can fork, clone, branch, merge, push and pull just like a git repository
DVC, management and versioning of datasets and machine learning models.
Pachyderm is a complete version-controlled data science platform that helps to control an end-to-end machine learning life cycle
Neptune let you version datasets, models, and other files from your local filesystem or any S3-compatible storage
Weights & Biases (WandB) Artifact product provides dataset and model versioning
ClearML Data: ensures data management and versioning
DagsHub version datasets, data files, and code

About me

I’m a Backend and Machine Learning engineer at Tikal, where we provide consulting services to IT companies.

At Tikal, we prioritize continuous learning through various courses and workshops that our colleagues create and share. It supports our commitment to self-improvement. We also have several interest groups, including a newly launched ML learning group focused on exploring and practicing different machine-learning topics.

In our ML group, we frequently evaluate various ML tools. In this post, I’ll share a template we’ve developed for these evaluations, along with a practical example of how to use it.

While this method is still a work in progress and not yet a standard practice in our group, I believe it provides valuable insights. I’d love to get your feedback to help refine it further.