Data Management and Data Labeling: It takes two to tango

Yaser Khalighi
SceneBox
Published in
6 min readJun 3, 2022

TL;DR: There is a conflict of interest inherent to the business model of a provider of data management solutions (data curation and triage) and a provider of data labeling services (ground truth) for the development of machine learning (ML) models. For this reason, ML teams are better off relying on separate technology partners for data labeling and data management.

As a technology company building a data platform for a field as young, and fast-evolving as computer vision (CV), we are constantly reaching out to machine learning teams to see if our product solves their problems. However, before we are able to speak with a prospective customer, we are often mistaken for one of the many labeling companies vying for their business. Even when we talk to prospective customers, one of the most common questions we receive is about labeling and if we plan to go in that direction. Labeling is known to be the armpit of machine learning. It is still an open multi-billion dollar problem and it will be around for the foreseeable future. Because of that, many great teams working are on, it but we consciously made a decision to NOT be a labeling company. In fact, we believe this would create a conflict of interest.

Before I started SceneBox, my team was tasked with the problem of providing the right datasets to train/improve computer vision models for self-driving cars. After speaking with a few computer vision experts who were doing similar work in different verticals (medical imaging, last-mile delivery, satellite imagery analytics, cashier-less checkout, etc.), I began to understand that providing the right data for model training is a universal problem in deep learning-based perception engineering.

As I began digging into the challenge of providing the right data, I uncovered three components:

  1. Data Curation — Finding the most effective raw datasets to be labeled
  2. Data Labeling — Labeling the raw datasets
  3. Data Triage — Ensuring the quality of labels is sufficient

We decided to focus our efforts on Data Management, which is the combination of Data Curation and Data Triage. Data management was, and continues to be, our bread and butter. Contrary to data labeling, we are one of the very few teams focused on data management. We consciously avoided building a data labeling platform for the reasons we will explore below. As more and more data labeling companies start building their own version of a data management platform, we are beginning to learn from our customers that there are conflicting interests when it comes to data management and data labeling. We address exactly this conflict below and make the argument for using separate providers for data management and data labeling on your perception-based ML journey.

You don’t want your dentist selling you candy.

There is an inherent conflict of interest between data labeling and data management. One of the purposes of a data management platform is to optimize the amount of data that is sent to be labeled — Data Curation. In actuality, most teams without a data management platform struggle to find the right data to label, which results in more labeled data than necessary. This is great for data labeling companies’ bottom line, but quickly and inefficiently burns through the ML team’s resources, as redundant data is labeled for training.

Another purpose of a data management platform is to find label noise and other imperfections in the labels — Data Triage. While there are some fantastic labeling tools and workforces out there, label noise is inevitable. Again, this is not something that a labeling company, that prides itself on providing quality ground truth, wants to bring to its customers’ attention.

The problem here is that labeling companies are now marketing their own data management platforms to their existing labeling customers. Because of the conflict between data discovery/triage and data labeling explored above, I would be hesitant to leap for a data management solution provided by a labeling company — particularly when labeling is the more profitable business model for them. You would understandably be skeptical of an ad-blocker made by Facebook or Google. A similar argument applies here.

You want your data managed by an enabler, not a monopolizer.

Chances are, in a perfect world, you and your ML team would use more than one labeler or a different source of ground truth (such as synthetic data). This is because data types and labeling requirements change, and different labelers focus their specialty in certain areas. Sometimes, you might find it more effective to do labeling in-house with SMEs. Other times, outsourcing is best. Sometimes for privacy reasons, you are limited to labeling workforces in certain regions. Maybe you want to try auto-labeling. Or perhaps you want to use a different solution and/or workforce for semantic segmentation from the one you use for bounding boxes, and another for point clouds.

However, if a labeling company is providing your data management solution, it is very unlikely that they make it easy to integrate with another labeler or a source of ground truth— you are essentially locked into their ecosystem with a sole labeling vendor.

The point here is that labeling is complex, and computer vision teams w̶a̶n̶t̶ ̶n̶e̶e̶d̶ deserve flexibility when it comes to labeling vendors.

Building a labeling platform is hard. So is data management.

It is extremely difficult for a company to provide fantastic tools for both data labeling and data management. Why?

Labeling is hard. When building labeling solutions, you are dealing with the human element — managing large teams and complex workflows/tools is really challenging. There are some fantastic teams out there providing world-class labeling solutions such as Scale, Labelbox, Dataloop, Segments.ai, SageMaker, Deepen, and SuperAnnotate,— just to name a few. These are all great teams providing very elegant solutions to incredibly difficult problems. With these sharp minds pressing forward with a focus on the frontier of labeling tools, workflows, and overall quality, we continue our collective drive towards an automated world.

Having said that, these solutions are far from complete and ML practitioners are still asking for better labeling options.

Data management is hard. Data management is another beast of its own. First, a data management platform should manage all the data in its entirety. For example, a self-driving car company usually generates TBs of data per day. That means the data management platform should easily scale and manage PBs of completely unstructured data. That is not an easy feat!

Another problem is data modality, perception data comes in various modalities. A data platform needs to provide a comprehensive suite of modules to index and search the data across all modalities. Your dataset may come from a robot or vehicle with other temporal data, or it may have some other metadata, or could be completely raw where you need a machine learning model to understand and categorize it. A data management platform should be able to make all data across all modalities completely usable and discoverable.

Last but not least is the integration. Often teams have their way of generating and storing their piles and piles of raw data. Due to the size of these datasets, it’s often impossible to re-ingest the data to a new data platform, as a result, the data management solution should be an overlay platform meaning it should sit on top of existing data in its original format/structure rather than restructuring it.

Specialization is imperative for progress.

Keeping in mind how monumental both of these challenges are, one company building both a data labeling and data management solution appears infeasible. Rather, the most productive way forward is integration across different solutions and services, forming the optimal solution.

As a data management company, our mission is to be laser-focused on our efforts to build the best products for our customers. Many labeling companies, like data management companies, feel the same way.

This is why we decided not to do labeling, and to support other companies with laser-focused product visions. Rather than build labeling tools ourselves, we make it extremely simple to interface with many different best-in-class labeling tools/workforces.

Both labeling and data management companies still have lots of work to do, so the less jumping around we do, and the more specialized we become, the faster we will get to our ultimate destination.

I would love to hear your thoughts on this article. Our team in SceneBox is always open to conversations. Please reach out to me on LinkedIn if you would like to take a deeper dive with me.

Special thanks to Arman Mazhari.

--

--