Overcoming Data Annotation Challenges: Strategies for Ensuring Quality, Efficiency, and Fairness in Training Datasets

Published in

GumGum Tech Blog

10 min readFeb 29, 2024

At the heart of deep learning, lies the need for vast amounts of data. This is where human data annotation comes into play, with annotators labeling or tagging data — images, text, or audio — to create training datasets. These annotated datasets play a crucial role by teaching AI models to recognize patterns, understand language, make predictions, and more, effectively translating the complexity of the real world into a form that machines can understand.

However, data annotation inherently presents several challenges. Annotating large datasets demands considerable human effort which translates to high financial and temporal costs. This can pose significant challenges for projects with limited budgets or stringent timelines. Furthermore, maintaining quality and consistency across annotations, especially when multiple annotators are involved is challenging. Variabilities in human judgment can introduce inconsistencies, negatively impacting AI model performance. This necessitates rigorous training and continuous monitoring of annotators, further complicating and escalating the costs involved. More importantly, human annotators with their inherent biases and perspectives can inadvertently introduce these biases into the AI systems. This issue can lead to the development of skewed or unfair AI models. Finally, often times, data annotation is an iterative process in which the instructions and the taxonomy can evolve over time which implies that annotators have to be constantly re-trained and data has to be constantly re-annotated.

At GumGum we solve some niche problems involving a lot of Ad-tech speak. Instead, let us consider a more everyday scenario: Imagine that you are a Computer Vision scientist working for a dog adoption agency that is responsible for verifying the authenticity of listing images. You are in the process of collecting data and over time the requirements for these algorithms keep changing due to the evolving market needs, regulatory shifts and user feedback.

This post outlines how you can use Large Multi-Modal Models to build tools that can help overcome some of the data related issues mentioned above. In summary, the 3 main issues outlined are:

biases in data collection,
re-organizing data, and
converting hard annotations to easier ones.

1. Biases in Data Collection

The idea that different biases can cancel each other out in the context of data annotation is theoretically appealing but practically complex and often wrong. Bias in AI systems generally refers to systematic errors or unfair discriminations in the data or algorithms that lead to partial outcomes. In theory, if biases were purely statistical and directly opposed, mixing them might lead to a form of cancellation, reducing the overall bias in the system. However, biases in AI and human systems are rarely so straightforward or uniformly distributed. They can be multidimensional, interrelated, and context-dependent, making cancellation non-deterministic.

For most problems, biases arise due to a lack of real-world counterfactual data — quite the oxymoron right? Quoting H.Yang et al, has said, “Counterfactual reasoning can be understood as a kind of thinking mechanism to discover the facts that contradict existing facts and could potentially alter the outcomes of a decision-making process.”

Now, let’s go back to our scenario. As the dog CV Scientist, you are told to develop an algorithm that is capable of taking down fake listing photographs — ie. anything that does not look like a dog. The goal of the algorithm is to perform binary classification — does the image contain a dog or not. Counterfactual reasoning is when you are able to think of how the model would perform if it was presented with an image of a wolf, coyote, dingo, jackal, etc. Someone who grew up in India (like me) may be familiar with a wolf or a jackal but not a dingo or a coyote. These are biases that might creep up to the model due to failure in collecting images of coyotes, dingos and other dog lookalikes.

With recent advancements [1][2][3][4], LLMs appear to possess a broader knowledge base than an individual human due to its architecture and the extensive dataset it has been trained on, encompassing a wide array of topics from various sources across the internet. This allows it to draw upon a vast pool of information, ranging from scientific articles to literature and general knowledge, ensuring a comprehensive coverage that exceeds the capacity of any single person’s experiences or memory. Moreover, its training spans multiple disciplines, granting it a multidisciplinary perspective that is challenging for individuals who often specialize in specific fields. We leveraged GPT3 (which obviously has its own biases) to come up with a list of visual cues that could be misinterpreted for a class that we are trying to classify. In this case here are some examples provided by GPT3, when prompted what are some animals that might be visually mistaken for a dog (terrible prompt I know):

Coyotes: In urban or suburban settings where coyotes might venture close to human habitats, they can be mistaken for medium-sized dog breeds, especially at a distance or in low light. Coyotes have a similar body structure to some dogs but are generally leaner and have a wilder appearance.
Wolves: Wolves and large dogs, particularly those breeds that retain a lot of their ancestral wolf-like features (such as Huskies or Malamutes), can be difficult to distinguish from one another, especially for individuals not familiar with the subtle differences between them.
Foxes: Some smaller fox species, like the red fox, share a similar body size and fur texture with small dog breeds. From afar, or especially if the fox is not displaying its characteristic bushy tail clearly, one might briefly mistake it for a small dog.
Dingoes: Native to Australia, dingoes often resemble medium-sized domestic dogs in size and stature. Their sandy to red coat color can also be found in domestic dogs, leading to confusion.

We convert these into Image Search Queries and scrape the internet for images. Anyone who has used an image search engine knows that images get decreasingly relevant to the query as the index of the image increases — the 50th image for an image search is less likely to be relevant to the query than the 5th image. So, we create variants of each search query and scraped a limited amount of images for each query. We then pass these images to our models to perform counterfactual analysis. If the model classifies them as a positive for the class for which we collected counterfactuals (in this case a dog) then we keep the image as a probable hard negative. We then send the “probable hard negative set” to another model (like CLIP or Visual Question Answering) to verify if the data contains the hard negative that we were meant to scrape (wolf, dingo, etc.). For instance if we scraped 20 images for a Dingo, and 15 of them were classified as a dog by our model then we would send these 15 images to another algorithm to verify if the image contains a Dingo. Whatever was verified to contain a Dingo would be added to our algorithms training data as a “NOT DOG” datapoint.

The flowchart above shows how we collect counterfactual training data that our model is incapable of dealing with.

2. Re-organizing data

In this section we talk about how we use these multi-modal models to re-organize data around. The need for this could be any one of the following reasons (not limited to):

Changing product requirements — Adding / removing classes

Following a recent hurricane in Florida, there was a significant increase in the number of stray cats requiring homes or medical care. This led to a notable rise in cat adoption listings across the internet. Motivated by a strong desire to help these cats find homes, your product team is keen on adapting your algorithm to support cat identifications. Merely gathering cat images isn’t sufficient, given that your existing dataset includes various non-dog animals, cats included, which necessitates data reclassification or transfer to a new “CAT”-egory.

Shown below is a framework where we used CLIP to clean data. We perform zero shot classification with CLIP which is trained to minimize distance between the text embedding of an image caption (this image contains a cat) and the visual embedding of an image. A high confidence threshold was set for identifying cats (70%), while a lower threshold (5%) was applied to categories of lesser interest (birds and lions in this case). In instances where the model has even a marginal confidence (5%) that an image might depict a lion, that data is discarded. However, if the model is 75% certain the subject is a cat, and under 5% likely it’s a lion, we keep the data point.

This strategy enabled us to efficiently refine a substantial amount of data. Although some data was discarded, the primary goal is not to maximize data volume but to ensure its high quality. By leveraging both the techniques described here and previously, we successfully amassed a large collection of high-quality data.

Incorrect / Unclear annotation guidelines

Having enhanced your models to accommodate both cat and dog adoption services, your focus now shifts to something else. Initially, due to the rarity of huskies in adoption listings, you didn’t notice a significant problem. However, a few months in, you begin receiving complaints from users about their husky adoption posts being mistakenly removed by the website. Upon investigation, you discover that the cause is your model’s failure to recognize huskies as dogs. This misclassification traces back to an error during the wolf data labeling process, where many husky images were incorrectly tagged as non-dog. You realize that the same methodology applied for cats can be adapted to rectify the husky data, ensuring that they are accurately categorized under the “DOG” class.

In GumGum, we primarily use these techniques for data collection for our moderation APIs where threat taxonomies evolve rapidly over time. We need to quickly address these, and there is a lack of open source data available for the use-cases we want to solve for. Hence, these home-grown techniques have helped to address such issues.

3. Converting hard annotations to easier ones

Several months later, feedback from users highlight an issue with numerous listings featuring images where the dog appears blurry, occupies only a minor portion of the composition, or is poorly centered. To enhance user experience, identifying listings with such images and prompting the uploader better photos of the animal intended for adoption is paramount. This task necessitates segmentation or detection, both of which require labor-intensive and time-consuming data labeling.

Our trials with various algorithms for zero-shot detection/segmentation revealed that CLIPSeg was the most effective. CLIPSeg stands out for its ability to create image segmentations from either textual prompts or image inputs. It excels with text prompts when the target object is within CLIP’s training dataset or features readable text (e.g., searching for nameplates). Image prompts are particularly useful for categories with low within-class variance or outside CLIP’s training data. The model also demonstrates robust performance with generalized prompts, such as “this image contains something to sit on.”

To minimize confusion, we first confirm the presence of the target object in the image. Given our prior data labeling for classification or specific data scraping, we can ascertain the target object’s presence. For ambiguous cases, we employ Visual Question Answering (VQA) to verify the object’s presence. With confirmation, we apply CLIPSeg to generate segmentation maps for the object of interest. Following this, we refine the segmentation maps with morphological operations and outline the probable object location with bounding boxes, calculating an object-wide probability score to identify likely object locations (bounding boxes). These areas are then thresholded, and images with highlighted bounding boxes are submitted for binary annotation to determine the box’s validity.

Annotation now becomes a much faster and easier task. Additionally, this process also proves to be less error prone provided CLIPSeg is applicable for your use-case. If we have a lot of images to begin with, then the yield we get out of the human verification process is likely to be enough to train a detector model. Additionally, there is some really cool work here where bounding box annotation is not necessary — just a binary good/bad annotation is enough to train a detector.

Multi-modal models are evolving at an incredible pace and there are probably better ways to simplify / accelerate annotation tasks specific to the problem you are trying to solve. Keep in mind that these models themselves are prone to having their own biases and we need to use diverse sources of information to diminish these biases. Like we mention here, there should always be a human in the loop to critically evaluate and verify the outputs of these models, especially if used for important decisions or professional purposes.

“With great power comes great responsibility” — Uncle Ben

We’re always looking for new talent! View jobs.

Overcoming Data Annotation Challenges: Strategies for Ensuring Quality, Efficiency, and Fairness in Training Datasets

1. Biases in Data Collection

2. Re-organizing data

3. Converting hard annotations to easier ones

Written by Ganesh Balakrishnan