Six Strategies for Better Data Annotation

Labelbox
Labelbox
Published in
4 min readApr 27, 2021

Training data platforms, or TDPs, enable teams to create and manage ML training data as quickly as possible. But there are common challenges faced by everyone involved in data annotation. We’ve collected six seminal strategies that we’ve observed our customers using to maximize the effectiveness of a TDP.

  • AUTOMATE: Model-assisted labeling can speed your annotation process by pre-labeling data that a human labeler needs only to confirm or correct. The Labelbox platform assigns pre-labels a confidence score, reflecting the probability that the model has labeled the data correctly. Low-confidence pre-labels require more human intervention, more time to get right, and therefore cost more. Labelbox allows users to sort pre-labeled data according to that confidence score, putting lowest-confidence pre-labeled data at the top of the pile so labelers can start working on them right away. Those labeled assets can then be fed back into the model to better pre-label other edge cases, creating a positive feedback loop. This iterative cycle increases overall model accuracy. It also saves time: using Labelbox’s queue management systems together with model-assisted labeling has cut labeling time by as much as a third for some users.
  • TEST: Don’t rely on intuition. Try out different strategies using AB testing. Labelbox can run multiple experiments by creating multiple projects quickly within the Labelbox app. We recommend teams use Labelbox to quickly spin up projects and see which sets of training data are most effective for their ML model. One customer discovered that certain combinations of pre-labeling and process changes that intuitively should have saved time, actually increased labeling time. Pick your control scenario carefully, making sure it is representative of the data in your experimental scenarios. Otherwise, you may end up believing you’ve found a better strategy when in fact it is worse.
  • EDUCATE: Don’t limit education to subject experts and assume that the knowledge will trickle down to labelers. Give your labelers the same level of education as experts on exactly what you need. With Labelbox, tutorial video or written instructions can be embedded in the platform. Labelbox includes a feature called ‘Issues and Comments,” which allows data scientists and ML engineers to provide feedback on the labels for labelers to correct. Regularly providing labelers with reports of most commonly seen mistakes can be helpful because labelers may miss patterns that otherwise carry insights. Ask labelers what will help them do a better job. They are your best source of information on how to improve your labeling operations. What you find might surprise you: one company we work with discovered that many labelers were having trouble because they were using low-resolution monitors. That was cheap and easy to fix and resulted in an immediate improvement. Convene key experts and data scientists and labeler representatives for brainstorming at least once a quarter. Targets drift and projects change and such sessions can help realign strategies, bring up new platform functionality, introduce tips heard from the industry and come up with ways to test new ideas.
  • BENCHMARK: Create a set of gold standard labels against which all annotations are measured. Labelbox’s benchmark tool allows teams to do this and the platform then mathematically compares the gold standard with what labelers are doing. A label that has a 50% benchmark score is only 50% of that gold standard. This shows which labels are going to make the model most performant, and guides the labeling team to create more gold-standard examples to improve the model.
  • AGREE: Accuracy isn’t the only measure of quality labeling. Consistency — the degree to which labeler annotations agree with one another — is also key. But without automation, tracking consistency takes a lot of time and effort. With Labelbox, consistency is measured algorithmically, measuring the average of the label agreement between labelers. When consensus is set to 3, for example, every label will be grouped with two similar labels and the average of the three becomes that label’s consensus score. Users can filter for specific consensus scores or set a threshold score, below which labels are excluded from the training data. Users can decide whether to rework those labels that fall below the threshold or investigate when there is a fundamental disagreement among labelers. If the decision is to rework a label, its consensus score can be calculated again until it reaches the threshold necessary to go into the training data set.
  • MONITOR: Use a dashboard to monitor the performance — things like average time per label — of individual labelers or teams. The best dashboards show you all the metrics tied to individual, team, and project performance — if a labeling solution doesn’t have that transparency, find one that does. Look for patterns where your labelers are messing up. You can learn a lot by plotting where bad labels appear in an image, for example, or at what point in a labeler’s shift most errors are made. One customer found that most mistakes were in the bottom, right-hand corner of images. Presumably, labelers scanning from top to bottom, right to left, were most fatigued by the time they reached the bottom right-hand corner of an image.

Committing to a labeling platform can be costly if you pick the wrong one, so try out the free versions of TDPs to see how they work first. If a TDP doesn’t offer a free trial, or doesn’t include the features mentioned above, think again.

--

--

Labelbox
Labelbox

Manu Sharma is an aerospace engineer and o-founder of Labelbox, a training data platform for deep learning systems.