Data Annotation using Open Source and Proprietary Tools
What is Data Annotation?
- Data annotation is also known as Data Labeling, Data Tagging, Data Enrichment and Ground Truth Generation (AWS SageMaker Ground Truth).
- It is adding relevant and contextual information to data or datasets.
- This raw data can be images, texts, audio files, or even videos.
Types of Machine Learning
- Supervised machine learning: Where the training data set is labeled
- Unsupervised machine learning: Where the training data is not labeled
- Semi-supervised machine learning: Where the training data is a mixture of both labeled and unlabeled data
- Reinforcement learning: Where the model trains by trail and error.
Data annotation is the lifeline for Machine Learning Tasks. It is essential to label, tag, describe, annotate or add metadata to your data to make machine learning work.
Principles of data annotation
- Accuracy — use clear guidelines, use domain experts when need (eg: medical data)
- Relevance — Use case (eg: object detection annotation of a image dataset is not useful for a image classification model), Data should representative and well sampled as close as possible to the final use case (eg: If our final prediction is on bird image data, we should annotate bird image data).
- Quality — Sampling (eg: samples should be tested at predetermined time intervals), Using inter-annotation scoring is also can be used to ensure the quality.
- Efficiency — Timeliness and use of Efficient annotation tools.
Types of Data Annotation
- Text annotation — used for text classification, named entity recognition, speech tagging, keyword extraction, sentiment analysis, text translation, and question answering.
- Audio annotation — used for transcription, speaker recognition, sound detection and classification.
- Image annotation — can be done for pictures, geospatial images, radiological images, and they can be used for machine learning tasks such as classification, scene description, object detection and segmentation. These annotations can be bounding boxes, polygons, dots or segmentation depending on the use case.
- Video annotation — used for object tracking and activity recognition, as well as every task that can be done one images, because videos are a sequence of images.
Common Data Storage Formats
- Text data: CSV, JSON, JSONL, XML
- Image data: JPEG, PNG, SVG, TIFF
- Video data: MP4, AVI, MOV, WebM
- Audio data: WAV, MP3, AAC
Types of Data Annotation Tools
- Open Source — Free
- Proprietary — Paid
Choice of Data Annotation Tool Depends On
- Budget
- Relevance
- Availability
- Expertise
- Task
Examples of Open-Source Annotation Tools
- LabelImg — For computer vision tasks
- CVAT (Computer Vision Annotation Tool)
- VGG Image Annotator (VIA)
- Doccano — For NLP tasks
- TagEditor — For NLP tasks
- LightTag — For NLP tasks
- brat (Browser Based Rapid Annotation)
- Label Studio — For many types of labeling tasks
- Audacity — For audio
- Arduino — For audio
Proprietary Annotation Tools
- Roboflow: https://roboflow.com/
- Prodigy: https://prodi.gy/
- Amazon SageMaker ground Truth: https://aws.amazon.com/pm/sagemaker/
- Scale AI: https://cloud.vast.ai/
- Labelbox: https://labelbox.com/
Common Annotation Formats
Computer Vision
- COCO JSON
- Pascal VOC XML (Visual Object Classes)
- YOLO TXT
- ImageNet VID
- TensorFlow TFRecord
- LabelMe JSON
- Amazon SageMaker ground Truth manifest
- Google Cloud AutoML Vision CSV
- VGG Image Annotator JSON
- Microsoft VoTT CSV
Natural Language Processing
- spaCy’s JSON
- CoNLL
- OntoNotes XML
- brat
- TXT
There are many other tools and annotations as well.
Assessing the Data Annotation Quality
- Inter-Annotator Agreement (IAA)
- measures of Degree of agreement
- there are many Metrics for IAA
IAA Metrics
- Cohen’s kappa
- Fleiss’s kappa
- Pearson correlation coefficient
- Spearman’s rank correlation coefficient
- Kendall’s Tau
- Krippendorff’s alpha
- Fleiss’s multirater kappa
- Percentage agreement
- Interclass correlation coefficient (ICC)
2. Calculating Mean Opinion Score
- Expert raters
3. Gold standard evaluation
- Subset annotation as ground truth
- Assessment against the ground truth
4. Comparing to closely related Benchmark datasets
- Comparison with known benchmark
5. By measuring the Annotator reliability
- Consistency in annotation
- Real-time feedback
6. Random checks by looking at the dataset
Data annotation Web based and Software based Platforms
1. Python pigeon
- python pigeon: https://github.com/agermanidis/pigeon
2. CVAT (web based tool)
- cvat: https://www.cvat.ai/ (supports labeling images, and drawing bounding boxes, model assisted labeling using models like YOLO https://pjreddie.com/darknet/yolo/ , manual semantic segmentation, automatic semantic segmentation with segment anything model https://segment-anything.com/ )
3. RoboFlow (web based tool)
- In this tool, anything you do in free version is public.
- For a private workspace, you have to subscribe to them.
- Roboflow can be used for many tasks as you can see below
- As you can see above, we can even resize/ pre-process the images when exporting a labeled dataset from Roboflow.
- Roboflow allows collaboration
Important: We can use Roboflow with Ultralytics and YOLOv8 for easier and quick annotations of the real life projects: https://docs.ultralytics.com/#where-to-start
4. Using Google Sheets for Creating Text Datasets
5. Universal Data Tool
- Download the tool from here: https://universaldatatool.com/ But this can be use online as well.
- Universal data tool supports following annotation tasks
6. Prodigy (A fully scriptive tool)
- To use this you need to obtain a license from: https://prodi.gy/
- As this is a fully scriptive tool, you can use a comman line tool like Anaconda prompt to setup it
- Prodigy supports semi automatic text annotation for Named Entity Recognition (NER) using pre-trained models.
eg: python -m prodigy ner.correct sample_text en_core_web_sm “file path to the dataset”
model => en_core_web_sm
other annotation tasks that is supported by prodigy:
- manual annotation for named entity recognition
- semi automatic text annotation for NER
- Command line text classification
- Labeling for text classification
- Part of speech (POS) labeling
- Sentence boundary labeling
- Audio data labeling
- Audio data transcription
and much more…
if you want to read more, read the prodigy docs here: https://prodi.gy/docs
Data annotation Cloud based Platforms
1. AWS SageMaker
- 1. AWS SageMaker ground Truth (setup)
SageMaker supports many tasks as I have explained follows.
Typical SageMaker workflow
1. Label data
Set up and manage labeling jobs for highly accurate training datasets within Amazon SageMaker, using active learning and human labeling.
2. Build
Connect to other AWS services and transform data in Amazon SageMaker notebooks.
3. Train
Use Amazon SageMaker’s algorithms and frameworks, or bring your own, for distributed training.
4. Tune
Amazon SageMaker automatically tunes your model by adjusting multiple combinations of algorithm parameters.
5. Deploy
After training is completed, models can be deployed to Amazon SageMaker endpoints, for real-time predictions.
6. Discover
Find, buy, and deploy ready-to-use model packages, algorithms, and data products in AWS Marketplace.
- 2. Single label image classification annotation
- 3. Multi label image classification annotation
- 4. Image boundary box annotation for object detection
- 5. Image semantic segmentation annotation
- 6. Video object tracking annotation
- 7. Text labeling for classification
- 8. NER text labeling
2. Azure ML
Goto: https://ml.azure.com/
Important: One benefit that we have over AWS here is, we can control brightness and contrast while annotating for better annotation.
Other annotation tasks that can be done with Azure ML:
- Multi class image classification annotation- Multi label image classification annotation
- Image bounding box annotation for object detection
- Image instance segmentation annotation
- Text labeling for classification
- NER text labeling
- Audio transcription
and much more…
3. GCP Vertex AI
- Create a Cloud Storage Bucket (same as AWS S3 Bucket) to keep the dataset
- Setup Vertex AI
Important: GCP supports many data types compared to Azure ML (Specially for video annotation tasks)
- Multi label image annotation with Vertex AI
first of all create a image classification multi-label dataset as follows
Other annotation tasks that can be done with Vertex AI:
- Multi-label image annotation- Image bounding box annotation for object detection
- text entity annotation
- text sentiment scale annotation
- single-label text labeling
- video classification annotation
and much more…
Things to consider when selecting a Data Annotation Tool
- Complexity of your labeling task
- Cost of the each platform
- Pre trained model support
- Quality of the data samples in your dataset (if it is a computer vision based dataset)
- Brightness and Contract of your data samples (if it is a computer vision based dataset)
- Output data types that we can take after labeling a dataset
References:
🌐 Follow me on LinkedIn: https://www.linkedin.com/in/chanakadev/
👨💻 Follow me on GitHub: https://github.com/ChanakaDev