Open Source Data Annotation Platform for NLP, CV, Tabular, and Log Data

Julia Li
VMware Data & ML Blog
6 min readMay 10, 2021

An end-to-end collaborative data annotation platform for machine learning

Modern supervised machine learning (ML) models have achieved state-of-the-art accuracies on traditionally challenging tasks by relying on massive high-quality hand-labeled datasets. ML courses, tutorials and Kaggle challenges provide clean labeled data and focus on creative feature engineering and optimizing the model performance on an unrealistic (clean) dataset. In the real world, data is messy and lack high quality labels, requiring large annotation projects to ensure trained models are robust to a variety of input.

Within VMware (and many medium/large companies), applied ML teams often face multiple obstacles including getting stakeholders to understand when ML is applicable and the limitations of current techniques. But the more common blocker is access to high quality datasets that are suitable for modern ML techniques. How do you find annotators that are suitable for the specific dataset? What should annotators use to label your data? How can you ensure data annotation is a continuous process (ML often has to reflect current business data and processes, so you will always need new labels)?

Many teams took the route of least resistance: spreadsheets. Easy to create and share, spreadsheets provided a simple, but suboptimal interface for annotators limited to tabular and text data. We explored commercial solutions like MTurk, Figure Eight, Scale.ai and more, but data privacy concerns and the domain specificity of our data made it challenging for non-experts to label. We want to invest in creating quality datasets with internal annotators who are domain experts, not in a long term reliance on external annotators that we would have to retrain time after time. We also looked at open source projects that could support the data types we were interested in, but many at the time focused only on computer vision data. A suitable option was Prodigy from Spacy.io, but it lacked the ability to have multiple annotators collaborate on a single dataset.

Data Annotator for Machine Learning (DAML), an open source project from VMware seeks to address the above pain points. This all-in-one platform helps data and ML teams collaborate on the creation and management of data annotations to quickly build custom training datasets and maintain data quality. DAML supports all major file types including text, tabular, image, and log data.

Key Features

DAML is designed to enable a prescriptive data annotation process with the following features:

  • An intuitive user interface to focus on a single task to enable rapid annotations.
  • Data Management: input and export formats follow best practices to enable seamless integration with ML frameworks. Data is sharable to annotators, project owners and service users and you can remove or append new datasets to your annotation projects at any time.
  • Project Management: a project owner can modify existing projects (add or remove data from a project, edit project owners and annotators, and more), export and share data to service users, and have a full view of annotation progress and resolve data conflicts (Eg; annotators flag a entry as not fitting any labels).
  • Active Learning: as annotators make progress through the data set, DAML trains an active learning model working alongside your annotators by continuously training and improving using the most recently annotated data. The active learning model queries annotators to label the data that matters the most using pool-based sampling, therefore reducing the amount of labeled data to achieve successful modeling.
  • APIs: a swagger UI provides the playground for a set of common APIs to manage your data annotation projects. You can use the APIs for basic CRUD or to programmatically label your data.

Quickstart

Setup for DAML is easy with a few configuration steps for each component:

git clone https://github.com/vmware/data-annotator-for-machine-learning.git

DAML includes three components:

  • annotation-app: client side, browser-based UI for project management, data management and annotating data. Written in Angular.
  • annotation-service: node application for back-end services.
  • active-learning-service: Django service providing active learning API that relies on the modAL library’s support for pool-based uncertainty sampling to rank the unlabeled data.

DAML uses AWS S3 to save datasets and SQS for large datasets. Just follow the quick install and configuration steps from the README and you’ll be up and running in a few minutes.

Now, let’s dive in to see how you can use DAML to run an annotation project end-to-end

Create a New Annotation Project

To get started, you can choose a supported data type including: Text, Images, Tabular, NER, and Logs. Go ahead and create a new project for NER. DAML will ask you for a few project details before creating the project:

  • Project name & instructions
  • Upload your data file or chose an existing one
  • Preview your data and set your data
  • Add new labels and assign annotators. All your annotators will receive emails asking them to start to annotate.
Project set up for an NER annotation project

Annotating Your Data

Now you’re ready (along with anyone else assigned to the project) to start annotating! Navigate to the Annotate tab and click Start for your project.

You’ll see the following on the annotation interface:

  • Annotation details for each project
  • A project toggle between projects
  • Progress bar for annotators to track their own progress
  • Full history of labelled examples (you can click on them to return to a previous item)
NER interface for annotators

A single click will record the label. You can skip and return to a previous item at any time (you can also do this by clicking on a ticket under Progress).

Manage Your Data Annotation Project

As a project owner, you have the ability to:

  • Edit project details
  • Download project data
  • Share the project data
  • Track the progress of your annotation project

DAML aggregates all labels in real time. If you have multiple assigned annotators, you can view project progress by clicking on the project name under the Projects tab. You will see project details, # annotations per user, # annotations per category as well as a table view of all labels.

Overview of project management features

Data Export and Share

When you’re ready to export labeled data for experimentation, you can generate a file for download in the Projects tab. The following download formats are supported:

  • Standard: CSV includes project columns and adds a new column for each category along with their tabulated counts
  • Top Label: CSV includes project columns and top label only
  • Probabilistic: CSV includes project columns and ratio of labels relative to total

A notification will be sent to your email for large datasets when data exporting is done.

That wraps up a whirlwind tour of the key features for DAML. If you’re looking for a full overview of all features and functionality, please check out the user guide.

What‘s Next

We’re excited to open source DAML to support your data annotation projects. In the second half of the year, we’ll be adding support for additional annotation interfaces and enable seamless deployment to any cloud. If you want to get involved in the project, please see our contribution guide.

We hope you’ll give DAML a try and are looking forward to your feedback to help shape the long term roadmap.

This is the first of many blogs we’ll be publishing on data and ML at VMware, so come back soon to this publications to get the latest updates!

--

--