Data Annotation using Open Source and Proprietary Tools

RoboFlow, AWS SageMaker, Azure ML, GCP Vertex AI

Chanaka

12 min readJun 30, 2024

What is Data Annotation?

Data annotation is also known as Data Labeling, Data Tagging, Data Enrichment and Ground Truth Generation (AWS SageMaker Ground Truth).
It is adding relevant and contextual information to data or datasets.
This raw data can be images, texts, audio files, or even videos.

Types of Machine Learning

Supervised machine learning: Where the training data set is labeled
Unsupervised machine learning: Where the training data is not labeled
Semi-supervised machine learning: Where the training data is a mixture of both labeled and unlabeled data
Reinforcement learning: Where the model trains by trail and error.

Data annotation is the lifeline for Machine Learning Tasks. It is essential to label, tag, describe, annotate or add metadata to your data to make machine learning work.

Principles of data annotation

Accuracy — use clear guidelines, use domain experts when need (eg: medical data)
Relevance — Use case (eg: object detection annotation of a image dataset is not useful for a image classification model), Data should representative and well sampled as close as possible to the final use case (eg: If our final prediction is on bird image data, we should annotate bird image data).
Quality — Sampling (eg: samples should be tested at predetermined time intervals), Using inter-annotation scoring is also can be used to ensure the quality.
Efficiency — Timeliness and use of Efficient annotation tools.

Types of Data Annotation

Text annotation — used for text classification, named entity recognition, speech tagging, keyword extraction, sentiment analysis, text translation, and question answering.
Audio annotation — used for transcription, speaker recognition, sound detection and classification.
Image annotation — can be done for pictures, geospatial images, radiological images, and they can be used for machine learning tasks such as classification, scene description, object detection and segmentation. These annotations can be bounding boxes, polygons, dots or segmentation depending on the use case.
Video annotation — used for object tracking and activity recognition, as well as every task that can be done one images, because videos are a sequence of images.

Common Data Storage Formats

Text data: CSV, JSON, JSONL, XML
Image data: JPEG, PNG, SVG, TIFF
Video data: MP4, AVI, MOV, WebM
Audio data: WAV, MP3, AAC

Types of Data Annotation Tools

Open Source — Free
Proprietary — Paid

Choice of Data Annotation Tool Depends On

Budget
Relevance
Availability
Expertise
Task

Examples of Open-Source Annotation Tools

LabelImg — For computer vision tasks
CVAT (Computer Vision Annotation Tool)
VGG Image Annotator (VIA)
Doccano — For NLP tasks
TagEditor — For NLP tasks
LightTag — For NLP tasks
brat (Browser Based Rapid Annotation)
Label Studio — For many types of labeling tasks
Audacity — For audio
Arduino — For audio

Proprietary Annotation Tools

Roboflow: https://roboflow.com/
Prodigy: https://prodi.gy/
Amazon SageMaker ground Truth: https://aws.amazon.com/pm/sagemaker/
Scale AI: https://cloud.vast.ai/
Labelbox: https://labelbox.com/

Common Annotation Formats

Computer Vision

COCO JSON
Pascal VOC XML (Visual Object Classes)
YOLO TXT
ImageNet VID
TensorFlow TFRecord
LabelMe JSON
Amazon SageMaker ground Truth manifest
Google Cloud AutoML Vision CSV
VGG Image Annotator JSON
Microsoft VoTT CSV

Natural Language Processing

spaCy’s JSON
CoNLL
OntoNotes XML
brat
TXT

There are many other tools and annotations as well.

Assessing the Data Annotation Quality

Inter-Annotator Agreement (IAA)

measures of Degree of agreement
there are many Metrics for IAA

IAA Metrics

Cohen’s kappa
Fleiss’s kappa
Pearson correlation coefficient
Spearman’s rank correlation coefficient
Kendall’s Tau
Krippendorff’s alpha
Fleiss’s multirater kappa
Percentage agreement
Interclass correlation coefficient (ICC)

2. Calculating Mean Opinion Score

Expert raters

3. Gold standard evaluation

Subset annotation as ground truth
Assessment against the ground truth

4. Comparing to closely related Benchmark datasets

Comparison with known benchmark

5. By measuring the Annotator reliability

Consistency in annotation
Real-time feedback

6. Random checks by looking at the dataset

Data annotation Web based and Software based Platforms

1. Python pigeon

python pigeon: https://github.com/agermanidis/pigeon

Very basic image annotator: Python Pigeon Library

2. CVAT (web based tool)

cvat: https://www.cvat.ai/ (supports labeling images, and drawing bounding boxes, model assisted labeling using models like YOLO https://pjreddie.com/darknet/yolo/ , manual semantic segmentation, automatic semantic segmentation with segment anything model https://segment-anything.com/ )

CVAT: Simple annotation (Cat or Dog for Image Classification)

CVAT: Drawing bounding boxes (for object detection)

CVAT: Semantic segmentation (for Advance Computer Vision Tasks)

YOLO Model (Pre trained Automatic Object Detection Model)

SAM: Segment Anything Model (Pre trained Automatic Image Segmentation Model)

3. RoboFlow (web based tool)

https://roboflow.com/

In this tool, anything you do in free version is public.
For a private workspace, you have to subscribe to them.

Roboflow can be used for many tasks as you can see below

As you can see above, we can even resize/ pre-process the images when exporting a labeled dataset from Roboflow.

Roboflow: Simple image labeling for classification

Roboflow: Bounding boxes for object detection

Roboflow: image segmentation for plant disease detection

Roboflow allows collaboration

Roboflow: Invite others to contribute as a Labeler

Important: We can use Roboflow with Ultralytics and YOLOv8 for easier and quick annotations of the real life projects: https://docs.ultralytics.com/#where-to-start

4. Using Google Sheets for Creating Text Datasets

Google sheets: Text dataset creation (can be exported as either a xcell or CSV)

5. Universal Data Tool

Download the tool from here: https://universaldatatool.com/ But this can be use online as well.
Universal data tool supports following annotation tasks

Universal data tool: Supported annotation tasks

Universal data tool: Named Entity Recognition labeling

Universal data tool: Label text for text classification

6. Prodigy (A fully scriptive tool)

To use this you need to obtain a license from: https://prodi.gy/
As this is a fully scriptive tool, you can use a comman line tool like Anaconda prompt to setup it

Prodigy: setup the tool using a terminal

Prodigy: creating a labeling interface using the terminal

Prodigy: Manual annotation for named entity recognition

Prodigy: named entity recognition annotations that we just created

Prodigy supports semi automatic text annotation for Named Entity Recognition (NER) using pre-trained models.

eg: python -m prodigy ner.correct sample_text en_core_web_sm “file path to the dataset”

model => en_core_web_sm

other annotation tasks that is supported by prodigy:
- manual annotation for named entity recognition
- semi automatic text annotation for NER
- Command line text classification
- Labeling for text classification
- Part of speech (POS) labeling
- Sentence boundary labeling
- Audio data labeling
- Audio data transcription
and much more…

if you want to read more, read the prodigy docs here: https://prodi.gy/docs

Data annotation Cloud based Platforms

1. AWS SageMaker

1. AWS SageMaker ground Truth (setup)

AWS Console: Search for Amazon SageMaker

SageMaker supports many tasks as I have explained follows.

Typical SageMaker workflow

1. Label data

Set up and manage labeling jobs for highly accurate training datasets within Amazon SageMaker, using active learning and human labeling.

2. Build

Connect to other AWS services and transform data in Amazon SageMaker notebooks.

3. Train

Use Amazon SageMaker’s algorithms and frameworks, or bring your own, for distributed training.

4. Tune

Amazon SageMaker automatically tunes your model by adjusting multiple combinations of algorithm parameters.

5. Deploy

After training is completed, models can be deployed to Amazon SageMaker endpoints, for real-time predictions.

6. Discover

Find, buy, and deploy ready-to-use model packages, algorithms, and data products in AWS Marketplace.

2. Single label image classification annotation

AWS Console: Search for S3 (we are going to upload an unlabled dataset to start labeling)

S3 bucket creation: give a name and create a bucket with default settings

S3 bucket: select the newly created S3 bucket to upload the dataset

AWS ground truth: Create a labeling workforce

Labeling workforce: create a private team

Labeling workforce: give your team a name

Labeling workforce: create a new user group

Labeling workforce: Add emails and Invite new workers, you cal preview the invitation here as well

AWS SageMaker Ground Truth: Image labeling task types

Labeling images using the temporary worker account

3. Multi label image classification annotation

Setup a multi label classification annotation job

4. Image boundary box annotation for object detection

5. Image semantic segmentation annotation

How the task will look like to the annotators

6. Video object tracking annotation

AWS SageMaker Ground truth job can be any of these types: for this task we are gonna use Video type

7. Text labeling for classification

8. NER text labeling

2. Azure ML

Goto: https://ml.azure.com/

We can enable incremental refresh if the dataset is updating continuously (if the dataset is from a remote resource which is updated continuously)

Important: One benefit that we have over AWS here is, we can control brightness and contrast while annotating for better annotation.

Other annotation tasks that can be done with Azure ML:
- Multi class image classification annotation
- Multi label image classification annotation
- Image bounding box annotation for object detection
- Image instance segmentation annotation
- Text labeling for classification
- NER text labeling
- Audio transcription
and much more…

3. GCP Vertex AI

Goto: https://console.cloud.google.com/

Create a Cloud Storage Bucket (same as AWS S3 Bucket) to keep the dataset

Now you can upload here, if there is any instruction files for data annotation

Setup Vertex AI

Select datasets from Vertex AI Dashboard

Create dataset (this is where you are going to upload your dataset)

Important: GCP supports many data types compared to Azure ML (Specially for video annotation tasks)

Multi label image annotation with Vertex AI

first of all create a image classification multi-label dataset as follows

Upload dataset files from your computer (or somewhere else)

Give a cloud storage path to save the uploaded images (that’s why we created a storage bucket previously)

Wait for completion of the Importing images

After importing complete it will look like this

This is how it looks like when doing the labeling

Asigining data to either training, validation or test set

Other annotation tasks that can be done with Vertex AI:
- Multi-label image annotation
- Image bounding box annotation for object detection
- text entity annotation
- text sentiment scale annotation
- single-label text labeling
- video classification annotation
and much more…

Things to consider when selecting a Data Annotation Tool

Complexity of your labeling task
Cost of the each platform
Pre trained model support
Quality of the data samples in your dataset (if it is a computer vision based dataset)
Brightness and Contract of your data samples (if it is a computer vision based dataset)
Output data types that we can take after labeling a dataset

References:

🌐 Follow me on LinkedIn: https://www.linkedin.com/in/chanakadev/

👨‍💻 Follow me on GitHub: https://github.com/ChanakaDev