Dataset Factory: A Tool Chain for Computer Vision Datasets

Jenifer De Figueiredo
DVC — Data Version Control
2 min readMar 20, 2024

The fast proliferation of analytical and Generative AI solutions is driving requirements for data versioning and data curation to the next level, where the dataset management tools must understand data and be able to use metadata for data curation. This goal is not achievable with the traditional MLOps toolchains that remain blind to the content of managed files. We solve this problem by introducing the next generation of Data-Centric AI software — DataChain.

We have been building DataChain for several years now and are happy to share some of the thinking and motivation that came into this product. For example, this paper written by our Technical Product Manager Daniel Kharitonov, and Customer Success Engineer, Ryan Turner was published at 2023 ICCV and explains the challenges of building generative computer vision datasets at scale and the benefits of using a tool like DataChain.

Dataset Factory: A Tool Chain for Generative Computer Vision Datasets
Dataset Factory Poster Presented at ICCV 2023

The following table summarizes the problems faced when tackling massive Computer Vision projects and solved with our latest tool:

Read the full paper for a more in-depth discussion on the problems and solution as well as an example of the dataset factory approach using the LAION-5B dataset. While this paper focuses on a specific Computer Vision data use case, the same approach works for all Unstructured Data workflows including text, video, audio, GIS, and multi-modal. We would love to talk to you about your use cases and explore how we can help you master your unstructured data workflows.

Learn more at DVC.ai!

Follow DVC.ai on LinkedIn to find more tutorial blog posts, videos, and upcoming online meetups!

--

--