Data Annotation using Open Source and Proprietary Tools

RoboFlow, AWS SageMaker, Azure ML, GCP Vertex AI

Chanaka
12 min readJun 30, 2024

What is Data Annotation?

  • Data annotation is also known as Data Labeling, Data Tagging, Data Enrichment and Ground Truth Generation (AWS SageMaker Ground Truth).
  • It is adding relevant and contextual information to data or datasets.
  • This raw data can be images, texts, audio files, or even videos.

Types of Machine Learning

  • Supervised machine learning: Where the training data set is labeled
  • Unsupervised machine learning: Where the training data is not labeled
  • Semi-supervised machine learning: Where the training data is a mixture of both labeled and unlabeled data
  • Reinforcement learning: Where the model trains by trail and error.

Data annotation is the lifeline for Machine Learning Tasks. It is essential to label, tag, describe, annotate or add metadata to your data to make machine learning work.

Principles of data annotation

  • Accuracy — use clear guidelines, use domain experts when need (eg: medical data)
  • Relevance — Use case (eg: object detection annotation of a image dataset is not useful for a image classification model), Data should representative and well sampled as close as possible to the final use case (eg: If our final prediction is on bird image data, we should annotate bird image data).
  • Quality — Sampling (eg: samples should be tested at predetermined time intervals), Using inter-annotation scoring is also can be used to ensure the quality.
  • Efficiency — Timeliness and use of Efficient annotation tools.

Types of Data Annotation

  • Text annotation — used for text classification, named entity recognition, speech tagging, keyword extraction, sentiment analysis, text translation, and question answering.
  • Audio annotation — used for transcription, speaker recognition, sound detection and classification.
  • Image annotation — can be done for pictures, geospatial images, radiological images, and they can be used for machine learning tasks such as classification, scene description, object detection and segmentation. These annotations can be bounding boxes, polygons, dots or segmentation depending on the use case.
  • Video annotation — used for object tracking and activity recognition, as well as every task that can be done one images, because videos are a sequence of images.

Common Data Storage Formats

  • Text data: CSV, JSON, JSONL, XML
  • Image data: JPEG, PNG, SVG, TIFF
  • Video data: MP4, AVI, MOV, WebM
  • Audio data: WAV, MP3, AAC

Types of Data Annotation Tools

  • Open Source — Free
  • Proprietary — Paid

Choice of Data Annotation Tool Depends On

  • Budget
  • Relevance
  • Availability
  • Expertise
  • Task

Examples of Open-Source Annotation Tools

  • LabelImg — For computer vision tasks
  • CVAT (Computer Vision Annotation Tool)
  • VGG Image Annotator (VIA)
  • Doccano — For NLP tasks
  • TagEditor — For NLP tasks
  • LightTag — For NLP tasks
  • brat (Browser Based Rapid Annotation)
  • Label Studio — For many types of labeling tasks
  • Audacity — For audio
  • Arduino — For audio

Proprietary Annotation Tools

Common Annotation Formats

Computer Vision

  • COCO JSON
  • Pascal VOC XML (Visual Object Classes)
  • YOLO TXT
  • ImageNet VID
  • TensorFlow TFRecord
  • LabelMe JSON
  • Amazon SageMaker ground Truth manifest
  • Google Cloud AutoML Vision CSV
  • VGG Image Annotator JSON
  • Microsoft VoTT CSV

Natural Language Processing

  • spaCy’s JSON
  • CoNLL
  • OntoNotes XML
  • brat
  • TXT

There are many other tools and annotations as well.

Assessing the Data Annotation Quality

  1. Inter-Annotator Agreement (IAA)
  • measures of Degree of agreement
  • there are many Metrics for IAA

IAA Metrics

  • Cohen’s kappa
  • Fleiss’s kappa
  • Pearson correlation coefficient
  • Spearman’s rank correlation coefficient
  • Kendall’s Tau
  • Krippendorff’s alpha
  • Fleiss’s multirater kappa
  • Percentage agreement
  • Interclass correlation coefficient (ICC)

2. Calculating Mean Opinion Score

  • Expert raters

3. Gold standard evaluation

  • Subset annotation as ground truth
  • Assessment against the ground truth

4. Comparing to closely related Benchmark datasets

  • Comparison with known benchmark

5. By measuring the Annotator reliability

  • Consistency in annotation
  • Real-time feedback

6. Random checks by looking at the dataset

Data annotation Web based and Software based Platforms

1. Python pigeon

Very basic image annotator: Python Pigeon Library

2. CVAT (web based tool)

CVAT: Simple annotation (Cat or Dog for Image Classification)
CVAT: Drawing bounding boxes (for object detection)
CVAT: Semantic segmentation (for Advance Computer Vision Tasks)
YOLO Model (Pre trained Automatic Object Detection Model)
SAM: Segment Anything Model (Pre trained Automatic Image Segmentation Model)

3. RoboFlow (web based tool)

https://roboflow.com/

  • In this tool, anything you do in free version is public.
  • For a private workspace, you have to subscribe to them.
Roboflow: creating a workspace
  • Roboflow can be used for many tasks as you can see below
Roboflow: creating a new project
Roboflow: Exporting a labeled dataset
  • As you can see above, we can even resize/ pre-process the images when exporting a labeled dataset from Roboflow.
Roboflow: Simple image labeling for classification
Roboflow: Bounding boxes for object detection
Roboflow: image segmentation for plant disease detection
  • Roboflow allows collaboration
Roboflow: Invite others to contribute as a Labeler

Important: We can use Roboflow with Ultralytics and YOLOv8 for easier and quick annotations of the real life projects: https://docs.ultralytics.com/#where-to-start

4. Using Google Sheets for Creating Text Datasets

Google sheets: Text dataset creation (can be exported as either a xcell or CSV)

5. Universal Data Tool

  • Download the tool from here: https://universaldatatool.com/ But this can be use online as well.
  • Universal data tool supports following annotation tasks
Universal data tool: Supported annotation tasks
Universal data tool: Named Entity Recognition labeling
Universal data tool: Label text for text classification

6. Prodigy (A fully scriptive tool)

  • To use this you need to obtain a license from: https://prodi.gy/
  • As this is a fully scriptive tool, you can use a comman line tool like Anaconda prompt to setup it
Prodigy: setup the tool using a terminal
Prodigy: start using conda
Prodigy: creating a labeling interface using the terminal
Prodigy: Manual annotation for named entity recognition
Prodigy: named entity recognition annotations that we just created
  • Prodigy supports semi automatic text annotation for Named Entity Recognition (NER) using pre-trained models.

eg: python -m prodigy ner.correct sample_text en_core_web_sm “file path to the dataset”

model => en_core_web_sm

other annotation tasks that is supported by prodigy:

- manual annotation for named entity recognition

- semi automatic text annotation for NER

- Command line text classification

- Labeling for text classification

- Part of speech (POS) labeling

- Sentence boundary labeling

- Audio data labeling

- Audio data transcription

and much more…

Prodigy: Audio data transcription

if you want to read more, read the prodigy docs here: https://prodi.gy/docs

Data annotation Cloud based Platforms

1. AWS SageMaker

  1. 1. AWS SageMaker ground Truth (setup)
AWS Console: Sign up
AWS Console: Search for Amazon SageMaker
AWS Console: Amazon SageMaker Page

SageMaker supports many tasks as I have explained follows.

Typical SageMaker workflow

1. Label data

Set up and manage labeling jobs for highly accurate training datasets within Amazon SageMaker, using active learning and human labeling.

2. Build

Connect to other AWS services and transform data in Amazon SageMaker notebooks.

3. Train

Use Amazon SageMaker’s algorithms and frameworks, or bring your own, for distributed training.

4. Tune

Amazon SageMaker automatically tunes your model by adjusting multiple combinations of algorithm parameters.

5. Deploy

After training is completed, models can be deployed to Amazon SageMaker endpoints, for real-time predictions.

6. Discover

Find, buy, and deploy ready-to-use model packages, algorithms, and data products in AWS Marketplace.

  1. 2. Single label image classification annotation
SageMaker: Ground Truth Labeling Job
AWS Console: Search for S3 (we are going to upload an unlabled dataset to start labeling)
S3 bucket: create a new bucket
S3 bucket creation: give a name and create a bucket with default settings
S3 bucket: select the newly created S3 bucket to upload the dataset
S3 bucket: Starting the dataset upload
S3 bucket: Uploading the dataset
S3 bucket: Uploading the dataset
S3 bucket: Uploading the dataset
S3 bucket: Dataset upload completed
AWS ground truth: Create a labeling workforce
Labeling workforce: create a private team
Labeling workforce: create a private team
Labeling workforce: give your team a name
Labeling workforce: create a new user group
Labeling workforce: email invitation
Labeling workforce: create private team
Labeling workforce: Invite new workers
Labeling workforce: Add emails and Invite new workers, you cal preview the invitation here as well
AWS SageMaker Ground Truth: Image labeling task types
AWS SageMaker Ground Truth: Image labeling task types
Creating a labeling job
Labeling images using the temporary worker account
  1. 3. Multi label image classification annotation
Setup a multi label classification annotation job
How the task will look like
How the result will look like
Temporary login for workers
  1. 4. Image boundary box annotation for object detection
creating the task
How it will look like to the annotators
How the output will look like
  1. 5. Image semantic segmentation annotation
Task setup
How the task will look like to the annotators
How the output will look like
  1. 6. Video object tracking annotation
AWS SageMaker Ground truth job can be any of these types: for this task we are gonna use Video type
Create task
How the job will look like
  1. 7. Text labeling for classification
select text from the datatype
task creation
task creation
how the task will look like
how the output will look like
  1. 8. NER text labeling
Select named entity recognition
Task creation
How the task will look like

2. Azure ML

Goto: https://ml.azure.com/

create a new workspace
select data labeling and select create
image labeling tasks
text labeling tasks
audio labeling tasks
create project
Upload a dataset
We can enable incremental refresh if the dataset is updating continuously (if the dataset is from a remote resource which is updated continuously)
create image labels
Annotation Interface

Important: One benefit that we have over AWS here is, we can control brightness and contrast while annotating for better annotation.

Classification interface
Export labeled dataset
How the output will look like

Other annotation tasks that can be done with Azure ML:
- Multi class image classification annotation

- Multi label image classification annotation

- Image bounding box annotation for object detection

- Image instance segmentation annotation

- Text labeling for classification

- NER text labeling

- Audio transcription

and much more…

3. GCP Vertex AI

create/ select an existing project
select new project
create a new project
select the project that we just created
  • Create a Cloud Storage Bucket (same as AWS S3 Bucket) to keep the dataset
search for storage
create a new bucket
fill all the required fields
Now you can upload here, if there is any instruction files for data annotation
  • Setup Vertex AI
Search for Vertex AI
Select datasets from Vertex AI Dashboard
Enable Vertex AI API
Create dataset (this is where you are going to upload your dataset)
Types of Image data
Types of Tabular data
Types of Text Data
Types of Video data

Important: GCP supports many data types compared to Azure ML (Specially for video annotation tasks)

  • Multi label image annotation with Vertex AI

first of all create a image classification multi-label dataset as follows

create image dataset
Upload dataset files from your computer (or somewhere else)
Give a cloud storage path to save the uploaded images (that’s why we created a storage bucket previously)
Wait for completion of the Importing images
After importing complete it will look like this
Add new labels for annotation
This is how it looks like when doing the labeling
Asigining data to either training, validation or test set
Exporting an annotated dataset
Exporting to Google Cloud Storage

Other annotation tasks that can be done with Vertex AI:
- Multi-label image annotation

- Image bounding box annotation for object detection

- text entity annotation

- text sentiment scale annotation

- single-label text labeling

- video classification annotation

and much more…

Things to consider when selecting a Data Annotation Tool

  • Complexity of your labeling task
  • Cost of the each platform
  • Pre trained model support
  • Quality of the data samples in your dataset (if it is a computer vision based dataset)
  • Brightness and Contract of your data samples (if it is a computer vision based dataset)
  • Output data types that we can take after labeling a dataset

References:

🌐 Follow me on LinkedIn: https://www.linkedin.com/in/chanakadev/

👨‍💻 Follow me on GitHub: https://github.com/ChanakaDev

--

--

Chanaka

Problem Solver 🧩 | UX Designer 🎨 | Software Engineer 💻 | Data Science enthusiast 📊