Memory Leak
Published in

Memory Leak

ML Data Management — A Primer

A machine learning (ML) model’s performance is determined by code and data. When trying to improve a ML model you can write better code, increase testing, or improve the data itself. The ML space is maturing with more companies pushing models to production than ever before. With this shift, teams are less challenged by how to build and deploy a model, but rather on improving a model’s precision and recall, which often means iterating on the training data. Data has notoriously been a constraint to building great models and has led to the rise of data labeling providers like Scale. As more data is collected and the frequency of ML deployments increases, teams are less concerned about having enough data, but rather having the right data. Enter ML data management, the tooling that helps improve ML models by improving datasets.

ML data management solutions help engineers improve data quality and debug data. These solutions have a GUI to help understand, visualize, and curate data for training, uncover corrupted data like mislabeled examples, and identify difficult edge cases. Solutions tie into a ML model and its data to determine the data that is helpful, hurtful, or useless for training algorithms. ML data management essentially takes the data quality practices for tabular data or debugging solutions for programmers and applies it to ML.

Techniques used to understand the data include embeddings/similarity search, active learning, meta-learning, and reinforcement learning. An embedding is “a low-dimensional vector representation that captures relationships in higher dimensional input data. Distances between embedding vectors capture similarity between different data points.” It’s a particularly powerful technique for understanding relationships of Computer Vision (CV) and Natural Language Processing (NLP) data types and identifying outliers, which could be edge cases or mislabeled assets. Embeddings can either originate from trained supervised ML model (with labels) or from unsupervised ML models (without labels). For example, the ML team at Waymo utilizes embeddings from unlabeled data to identify data that reflects model failure cases.

ML data management solutions can help in numerous ways:

  1. manage what collected data (unlabeled) is stored and which slices are for training data (labeled) and test sets (labeled);
  2. curate datasets subsets;
  3. help identify and fix incorrect labels;
  4. pinpoint edge cases and encourage more data collection by integrating with labeling services or generating synthetic data;
  5. appropriately balance training data composition to enhance model performance;
  6. analyze how different datasets affect the performance of a model and how different models perform with the same training data; and
  7. retrain a better model that’s production-ready.

These platforms are collaborative so team members can discuss datasets, share reports with stakeholders, and compare models’ performance with different datasets. We’ve heard that ML data management tools are often used daily, demonstrating their value to individuals and teams. When speaking with buyers they thought about the return on investment in multiple ways: 1) realizing the value of existing data sets; 2) decreasing annotation spend; 3) dataset debugging; and 4) ML tuning and performance enhancements.

We’ve come across businesses that have built ML data management solutions in house including Waymo, Cruise, Tesla, Nuro, among others. Third-party solutions have emerged as well, and we’ve highlighted 13 below including Aquarium, Unbox, Scale’s Nucleus, Labelbox’s “model diagnostics,” Alectio, Clarifai, LatticeFlow,, Dataloop, SuperAnnotate, SiaSearch, Activeloop, and Voxel51’s FiftyOne. Some startups like Aquarium are well known for helping with CV while others like Unbox have an initial focus on NLP. When it comes to security it is important to understand the depth of information a vendor has access like visibility into different layers of a neural net and ingesting the raw data vs. metadata.

We are excited about startups that take an API-first approach that is easy and fast to implement. Vendors that can support multiple data types (e.g., image, text, video, 3D sensor fusion, documents, tabular) have an opportunity to move across teams through a land and expand motion. ML data management (a.k.a. ML DataOps) startups can also expand into multiple adjacencies like data labeling, label split balancing, embeddings-as-a-service, model monitoring, and model robustness verification.

This market is incredibly early, so we are excited to watch as the ecosystem evolves. If you or someone you know is working on a ML data management startup or adjacent offering, it would be great to hear from you. Comment below or email me at to let us know.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store