Managing Data Versioning in MLOps: An In-depth Analysis of Tools and Practices

Aryan Jadon
8 min readDec 21, 2022

--

Monitoring and controlling changes to data over time is known as data versioning. It entails producing and keeping track of several copies or versions of the data, each representing a different period. In this post, we will cover nine various data versioning tools for your MLOps.

Data versioning is crucial for various applications, including machine learning, where it can guarantee that the data used to train models is of high quality and consistency. It enables you to keep track of data changes and see any issues or complications that may occur from them.

Keeping Track of Changes, Image By Mary Amato

Data versioning tools typically provide features such as data annotation, data labeling, and data quality assessment to help ensure that the data is fit for machine learning. They may also allow you to compare different versions of the data and revert to a previous version if necessary.

Additionally, it will facilitate effective teamwork so everyone can track developments in real time and is always aware of what is happening. It’s a terrific approach to streamline workflow, organize data version management, and reduce possible errors.

How do you choose a data versioning tool?

Choosing Tool, Image By Mary Amato

You should look at the following before selecting a data versioning tool for your workflow:

  1. How does your data modality support video and audio? Is there a preview of the tabular data available?
  2. How simple is it to use in your workflow? How much extra work does it require you to execute?
  3. Compare and contrast: Are datasets comparable?
  4. Does it integrate well with your stack? Can you quickly connect to your platform, infrastructure, or training workflow model?

Data Versioning Tools

1. Git LFS

GIT LFS, Image Source: Atlassian Website

Git LFS is an open-source tool that allows you to version large files with Git.

Use Git LFS when-

  • You want to version large files such as datasets.
  • You want to use the same commands and interface as versioning code with Git.

Be cautious when -

  • Tools for source code repositories may include a file size restriction (2GB for GitHub Free)
  • When cloning a remote repository, training data must be downloaded first. It might be highly sluggish and needs a lot of Internet bandwidth.

2. DVC

DVC Logo, Image Source: DVC GitHub Repo

DVC is an open-source version control system for machine learning projects.

Use DVC when-

  • You need a procedure that is similar to Git so that you can version arbitrary huge files, databases, and models.
  • You want various options for remote data storage (S3, Minio, Google Cloud Storage, Google Drive, Azure Blob Storage, etc.)
  • You want to build pipelines and keep track of your tests.

Be cautious -

  • You need to use a Git repository to use DVC's versioning features.

3. Weights & Biases

Weights & Biases Logo, Image Source: wandb.ai website

Weights & Biases is a tool to track and visualize your machine learning pipeline pieces, from datasets to production models.

Use Weights & Biases when-

  • The experiment tracking, artifact tracking, and visualization capabilities of a lightweight Python library are what you need.
  • With a free (for personal use) hosted platform, you want to get going as soon as possible.
  • You want integrated support for well-known machine learning frameworks, repositories, and tools, including Keras, Fastai, and XGBoost (Kubeflow Pipelines, OpenAI Gym, etc.)
  • For machine learning projects, you need project management and collaboration tools.

Be cautious -

  • Only the client is open-source and demands registration.
  • Data logging to Weights & Biases, data queries, and downloads of your logged data are all covered by the client.
  • Only personal projects may use the free plan.
    Enterprise plans are necessary for self-hosted implementation.

4. Neptune

Neptune AI Logo, Image Source: neptune.ai website

Neptune is a flexible metadata store for MLOps that allows teams to log, store, display, organize, compare, and query model-building metadata.

Use Neptune when-

  • You want a compact tool to handle data versioning, model registry, experimentation, and artifact tracking.
  • You want to document and keep track of model testing, production runs, and evaluations.
  • You need a tool that records and shows various metadata formats, from model parameters to audio and video files.
  • You want more than 25 pre-built connectors with well-known Python libraries.

Be cautious -

  • Only the Neptune client requires registration and is open-source.
  • The Neptune client allows you to download your logged data and log data to the Neptune server.
  • Only one person may use the free plan for up to 200 hours per month of monitoring.
  • An Organization plan is required for self-hosted deployment.

5. Pachyderm

Pachyderm Logo, Image Source: Pachyderm GitHub Repo

Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations across any data. Pachyderm provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and lineage tracking.

Use Pachyderm when-

  • You want a tool that can automate pipelines and manage data versioning.
  • You need a resource creation and configuration tool that works with JSON or YAML, regardless of the language.

6. Dolt

Dolt Logo, Image Source: Dolt GitHub Repo

Like a Git repository, Dolt is a SQL database that you can fork, clone, branch, merge, push, and pull. To execute queries or change the data using SQL commands, you can connect to Dolt precisely like any other MySQL database.

Use Dolt when-

  • You want to enable Git capability for SQL databases.
  • You should keep track of the databases and the timing of model training.
  • For each team member, you want to construct their database.
  • You want MySQL to be directly replaced. The application code shouldn’t be rewritten.
  • You need a solution that supports every Git command natively.

Be cautious -

  • It is slower than MySQL by 2–20 times.
  • It only supports tabular data.
  • Private databases require a Dolthub Pro membership.

7. FastDS

FastDS Logo, Image Source: FastDS Repo

FastDS is a command line wrapper that combines Git and DVC to automate repetitive tasks.

Use FastDS when-

  • You want to combine Git and DVC commands into a single command.
  • You want to streamline the Git+DVC workflow.
  • You want to avoid errors in repetitive tasks.

Be cautious -

  • Git and DVC installations are required.
  • Not all Git/DVC commands are supported in the initial release.

8. LakeFS

LakeFS Logo, Image Source: LakeFS GitHub Repo

LakeFS converts your object storage into a repository that resembles Git. You can handle your data lake the same way you manage your code.

Use LakeFS when-

  • You want to make your object storage into a repository similar to Git.
  • You want to manage your data lake like you manage your code.
  • A tool that can work with any data is what you need.
  • You want to store metadata in a relational database to avoid duplicating data.
  • You want to isolate data in a Data Lake without copying it.
  • You want to manage the CI/CD of your data.
  • You want a tool that easily integrates with data frameworks such as Airflow, Spark, Kafka, Presto Delta Lake, Databricks, etc.

Be cautious -

  • There is no simple method for deleting files. Retention policies control the actual file destruction.
  • LakeFS do not support federated identities.
  • As underlying object storage detects connections from the lakeFS gateway user, data usage auditing may provide a challenge.

9. DagsHub

DagsHub is a platform where machine learning and data science teams can build, manage, and collaborate on their projects.

DagsHub Logo, Image Source: DagsHub Github Repo

Use DagsHub when-

  • You want data storage that is already set up.
  • You require an open-source tool-based data versioning platform.
  • You need integrated tools for managing, tracking, and annotating data.
  • You seek tools for seamless cooperation.
  • You need a single UI to control a whole open-source MLOps stack.
  • Easy integration into an ongoing project is what you want.
  • You need a platform that is not language-specific.

Be cautious -

  • Only the DagsHub client requires registration and is open-source.
  • Data logging to the DagsHub Storage, data queries, and downloads of your logged data are all supported by the DagsHub client.
  • With DagsHub Storage alone, the free plan is compatible.
  • A plan for your organization is necessary to use your assigned storage.
  • The open-source tools (DVC and Git) must be installed on your local or cloud machine.

Conclusion

In the rapidly evolving landscape of MLOps, managing data versioning stands as a cornerstone practice, pivotal for ensuring reproducibility, accountability, and seamless collaboration. Throughout this analysis, we’ve navigated a variety of tools, each with its strengths, catering to different facets of data versioning challenges.

As we’ve seen, the choice of a versioning tool can significantly impact the efficiency of an MLOps workflow. Whether you lean towards solutions like DVC for its Git-centric approach, Pachyderm for its containerized data pipelines, or favor the comprehensive ecosystem of Kubeflow, the key lies in selecting a tool that aligns with your project’s specific needs, team skills, and the complexities of your workflows.

Implementing robust data versioning practices is not a mere technicality but a strategic advantage. It fortifies the foundation upon which reliable machine learning models are built and provides a scaffold for scalable, future-proof MLOps infrastructure.

As we wrap up this exploration, it’s essential to remember that tools are facilitators; the real substance lies in the practices and principles we adopt. A diligent, thoughtful approach to data versioning is what will ultimately drive MLOps towards excellence.

We encourage you to continue the conversation by sharing your experiences and insights. Which tools have you integrated into your workflow? What lessons have you learned in managing data versioning? The floor is open, and we look forward to your contributions to this ever-important discussion.

Until then, may your data remain consistent, your versions well managed, and your machine learning models ever improving.

Thank you for joining us on this deep dive into data versioning in MLOps. I hope this analysis serves as a valuable resource as you navigate the complex yet rewarding realm of machine learning operations.

References

  1. https://neptune.ai/blog/best-data-version-control-tools
  2. https://www.mymlops.com/
  3. https://github.com/git-lfs/git-lfs
  4. https://github.com/iterative/dvc
  5. https://wandb.ai/site
  6. https://neptune.ai/
  7. https://github.com/pachyderm/pachyderm
  8. https://github.com/dolthub/dolt
  9. https://github.com/dagshub/fds#-aka-fds
  10. https://github.com/treeverse/lakeFS
  11. https://github.com/DagsHub/client

--

--