Crowdsource Image Labeling for Computer Vision Using CVAT.ai and HUMAN Protocol

Published in

CVAT.ai

7 min readMar 8, 2024

Introduction

Crowdsourcing annotations, especially for computer vision tasks, can be tricky. While it offers great benefits like being able to handle a lot of work at a low cost, it also comes with problems like poor quality of work, tricky quality assurance, and tough management. But, we’ve got a new way to make big-scale annotation easier, using CVAT.ai and the HUMAN Protocol together. This article will show you how this combo can change the way you handle your annotation projects and manage your data. How? We’ve run a small experiment on a real-world dataset.

CVAT.ai & HUMAN Protocol Overview

The combination of the HUMAN Protocol’s smart services with CVAT’s top-notch annotation tools, adds Web3 tech for payments in crypto.

Here’s who benefits:

Requesters: If you need data marked up, whether you’re building AI models, researching, or running AI contests, this is for you. You get an automated process for setting up, managing, and checking the quality of tasks, and you only pay based on how good the work is. Just tell the platform what you need, upload your stuff, and set your standards and price. No fuss on your part.
Annotators: For anyone wanting to make some cash from data labeling. You could be doing this as a side gig, freelancing, or doing it full-time. Signing up is easy, and so is picking a job. Each task comes with clear instructions, and with CVAT’s help — a big name in open-source annotation tools — the work is straightforward and quick. Tasks are short, so you can fit this work into your life however you like.

Paying and getting paid is done with crypto, meaning annotators need a crypto wallet. Requesters can also pay with crypto or a card. Money for the work is set aside at the start and paid out automatically once everything’s done and checked.

This mix of CVAT.ai and HUMAN Protocol is changing the game in data annotation, making it simpler, faster, and worthwhile for everyone involved.

Quick Start

On how to set up the account as a requester or annotator, please check our article: Mastering Image Annotation Crowdsourcing for Computer Vision with CVAT.ai and HUMAN Protocol

Here is a summary of what you will need as a Requester:

In case you are a requester, before creating an annotation task, you will need:

A crypto wallet and a configured browser extension for wallet access.
An Amazon S3 data bucket is configured for public access.
A dataset with images in common formats (.jpg, .png, …), with at least 2 images.
A small (3–5%) annotated image subset for automatic validation.

‍

Currently, the platform only supports 2 task types: bounding box and single-point annotation

Task Examples:

‍

The only format we support for Object Detection and Key Point Detection tasks is the MS COCO dataset format, which is a popular choice in the field. For the COCO Keypoint Detection, each point is represented as a single-keypoint skeleton.
When it comes to validating your images, make sure the validation dataset is also in the COCO .json file format.

Remember, the simpler the task, the faster and easier it is to annotate. For tasks requiring detailed annotations, consider breaking them down into smaller, more manageable parts. This approach can improve both the quality and speed of your work.

‍

Now let’s go back to the experiment:

Why We Conducted the Experiment:

We did this experiment to assess the effectiveness of crowdsourcing for data annotation in real-life scenarios. Our objectives were to understand:

The time commitment that is required.
The achievable quality of annotations.
The cost-efficiency of the method.

For requesters, understanding these aspects is vital to deciding if crowdsourcing the annotation fits all needs, quality, and speed.

The Dataset Used:

We selected the Oxford Pets dataset for our study. This publicly accessible dataset includes around 3.5k images, annotated with details like classifications, bounding boxes, and segmentation masks.

Despite its moderate size, the dataset provides high-quality, manual annotations for each image. For simplicity, we narrowed our focus to two classes: cats and dogs, aiming to have annotators accurately outline the animals’ heads with precise bounding boxes. This task is especially important for models designed to differentiate between pet species.

The Experiment Process:

We gathered a group of 10 inexperienced annotators and monitored their work closely. We aimed to achieve at least an 80% accuracy rate, a standard essential for the accuracy required by machine learning models.

To ensure high-quality annotations, we relied on Ground Truth (GT) annotations, or Honeypots, keeping a small, select portion of the dataset used for validating the work. We used the original dataset annotation to set up GT.

We carefully prepared task instructions and chose 63 GT images (2% of the dataset) to monitor annotation quality. We assigned small sets of images to each annotator for labeling, then compared their work against the GT to assess accuracy, allowing us to systematically confirm the annotations’ quality.

Execution and Outcomes:

Addressing our initial queries with the experiment’s findings:

Time Investment:

The experiment demonstrated that it’s possible to achieve high-quality annotations without significant delays. While we initially estimated that a seasoned annotator team might complete the task in 1–3 days, including all validations and management, our novice group completed the task in 3–4 days. This time frame included minor adjustments on our part and accounted for the occasional unavailability of some annotators.

We state that this is a positive result, indicating that even without prior experience, completing an annotation of a full dataset in a short amount of time is achievable. With the lessons learned from this first attempt, we anticipate future projects will have reduced completion times, aligning more closely with our initial estimates.

Achievable Quality Level:

Contrary to the expectation that crowdsourced annotation might yield lower quality compared to professional teams, our experiment offered encouraging insights. We aimed for and achieved an 80% accuracy target, necessary for the reliability of machine learning models.

The quality of the annotations was notably good. Despite some errors, the overall results are suitable for model training, highlighting the potential of crowdsourced annotation to meet significant accuracy benchmarks effectively.

It’s important to note that in our experiment, having the complete set of annotations available allowed us to validate our statistical predictions accurately. Although we observed a slight decline in quality across the entire dataset compared to the Ground Truth (GT) subset, this was anticipated given that the GT represented just 2% of the images.

Furthermore, the quality of annotations from our crowd-sourced approach exceeded the average results typically seen on platforms like MTurk, which often range from 61% to 81%. Our findings are in line with the highest quality standards for data annotation as per existing research on crowdsourcing data quality.

This insight is vital for those contemplating the use of crowd-sourced annotation for their projects. It demonstrates that crowd-sourcing can provide a cost-effective and timely means of annotating visual data and delivering work quality that meets the requirements for training advanced deep learning models.

Cost Effectiveness

Exploring the cost-effectiveness of using a crowd for data annotation, we found that the expense was impressively low. Annotating each dataset image or bounding box costs merely $0.02, slightly under the usual market rates. For the entire dataset, where most images contained a single object, the total cost amounted to $72.

Here’s a straightforward breakdown of our pricing model:

For every task, we included up to 10 regular images that we compensated for, alongside 2 Ground Truth (GT) images that weren’t paid for. At 2 cents per image, the cost for each task was 20 cents, culminating in $72 for the entire 3,600 images. This approach ensured we only paid for annotations that passed our quality assessments, guaranteeing expenditure only on precise annotations.

In our system, payments are made using HMT, a form of cryptocurrency, facilitating a quick and seamless transaction process. Although we exclusively deal in cryptocurrency, annotators have the option to convert their earnings into other cryptocurrencies or traditional (fiat) money if they prefer.

This demonstrates the efficiency and affordability of employing CVAT.ai and Human Protocol for crowd-sourced annotation, offering a cost-effective method for acquiring high-quality labeled data.

Conclusions

Was our approach practical? Our findings show that crowdsourced annotation is a feasible and efficient method, delivering the expected quality with minimal variance.

We pinpointed areas for improvement, notably in reducing workforce management to merely onboarding and technical assistance, as all aspects from recruitment to payment were automated.

We encourage requesters and annotators to use our service, which provides a streamlined, automated platform for executing high-quality data annotation projects. If you need assistance, feel free to reach out to us at contact@cvat.ai.

Crowdsource Image Labeling for Computer Vision Using CVAT.ai and HUMAN Protocol

Written by Mariia Krasavina