Versioning and Labeling — Better Together

Label Studio + Pachyderm

Jimmy Whitaker
Pachyderm Community Blog
7 min readFeb 9, 2021

--

(Source: Image by author, Label Studio, and Pachyderm)

The key to building powerful machine learning models is learning “the right things from the right data.” Just as we humans constantly take in new information and update what we think about the world, ML models must continually learn from new data to keep their insights sharp and relevant. Continuous improvement is utterly crucial if you expect your model to work in the real world.

“The Two Loops” in the machine learning life cycle. Iterating through the data loop is one of the most expensive and time consuming stages in Machine Learning development. (Source: Image by author, Completing the Machine Learning Loop)

But for ML systems to get better we need to make sure we’re feeding them good data, and that often means well-labeled data. Label your data wrong, and your model learns the wrong insights about the world. It may seem simple to get properly labeled data, but anyone who’s spent even a little time in data science knows that good data is hard to come by. There are dozens of things that can go wrong, from loose labels, incorrectly formatted data, outliers, and more.

Labeling data is hard to get right. The seemingly trivial task to “draw bounding boxes around the animals” can have numerous outcomes, depending on the human performing the task.

(Left) One bounding box around the animals. (Middle) One box around the entire dog and a nested box around the cat. (Right) A box around the dog’s face and a separate box around the cat. (Source: Author and European Wilderness Society)

Even experts might disagree on how data should be labeled due to a variety of factors. These difficulties make curating data one of the most expensive and time consuming stages in ML development. Guidelines need to be clearly defined and constantly redefined to build a highly accurate, production machine learning model.

It takes time to apply our human expertise and understanding of the world to individual data points. Every label captures our understanding at the time of labeling. Every action is a product of the data points we’ve seen before. The more we see, the more we understand the landscape of our data.

Our understanding changes over time, and the dynamic nature of our data affects how we need to manage it. In order for our datasets to be reliable and trustworthy, we need to version them, iterate on them, and evaluate the effects of our changes on our model’s predictions.

For such a crucial step in the machine learning loop, we have very few tools and platforms that focus exclusively on data labeling and data management. Even fewer are open source.

Worst of all, none of them seem to have strong data versioning and data lineage baked into them. That means mistakes are much harder to correct, and as we’ve already seen that mistakes are virtually inevitable in labeling pipelines. You need a quick way to go back in time and correct mistakes or branch off to try a new strategy for getting those tricky labeling tasks right.

That’s why in this blog I’m going to help you bring together two powerful open source machine learning platforms to help you get your labeling right. In this article we’ll combine the labeling prowess of Label Studio and the robust data versioning and lineage capabilities of Pachyderm.

What is Label Studio?

Label Studio annotation examples (Source: Label Studio)

Label Studio is an open source data labeling environment for image, audio, text, time series, and many other data types. It’s a self-contained web application that facilitates labeling and exploration of your data, and can even be extended for Active Learning or Online Learning.

All you need to get started labeling and annotating your data is to set up a project in Label Studio.

What is Pachyderm?

Pachyderm overview (Source: Pachyderm)

Pachyderm is a data science and processing platform with built-in versioning and lineage. It gives you an immutable, copy-on-write file system that fronts on object stores like S3, which allows you to keep all changes to your data as they happen. It also keeps track of the journey of that data, recording a git-style commit with each transformation that allows you to roll out and roll back every single step of your data transformation from beginning to end.

What makes Pachyderm unique among all other ML platforms is that immutable file layer. It keeps every single change to your data. Other versioning systems are limited because they don’t enforce immutability, which can easily leave you in a bad state when your underlying data changes. Imagine you overwrote a bunch of your labeled images with new labels and didn’t save the originals. Now every single experiment that points to the original set of files can no longer be recreated. The commits point to a state that doesn’t exist. Pachyderm keeps every change and that means you can always roll back in time if you find an earlier version of your files was the one you really need to train your production model right.

Pachyderm as Label Studio’s Storage Layer

Diagram of the label studio integration. Label Studio is using Pachyderm’s S3 gateway as the source and target locations for raw and labeled data respectively. (Source: Image by author)

Combining these two platforms is straightforward. We use Label Studio to interact with data that gets stored and versioned in Pachyderm. Pachyderm handles changes submitted to it from Label Studio, versioning them and running pipelines to kick off new training iterations in your machine learning loop.

Label Studio’s cloud storage interoperability means that Label Studio can read data from and write labeled data to cloud storage. We can configure Label Studio to read source data from an Amazon S3 bucket, automatically pulling in new data to be labeled whenever it’s added to the bucket. We can also write our annotated data to a “target” S3 bucket, still retaining a reference to the original data.

Pachyderm does many powerful things to manage and version data, while also providing a clean S3 gateway. This means we can treat Pachyderm as a cloud storage system. When we put files into Pachyderm’s “buckets” (versioned data repositories), we are actually committing files to be versioned. Versioned files ensure that if we delete or change a file in Pachyderm, the new state of the bucket reflects the change, while the old file still exists in a previous commit that can always be recovered.

Label Studio continually pulls new data from its source bucket, so whenever we commit data to our source repository in Pachyderm, Label Studio imports it as tasks to be labeled. After a task is completed (labeled), Label Studio writes the completion to our target bucket location, which commits the completion file to the target data repository.

Later, If we decide to change a task’s label, Label Studio modifies the existing completion file. Pachyderm treats this change as a new commit to our labeled data repository, giving us a record of any changes that were made.

In a more mature use case, we would also use Pachyderm’s repository branches, so we can iterate on our data in a development branch without impacting pipelines that may depend on the master branch of our labeled data. This serves a highly useful purpose, allowing us to test a new version of our dataset before releasing/promoting it as the master version.

Run It Yourself

Find a full example of Label Studio configurations for Pachyderm in this GitHub repo. The example configurations show you how to incorporate Label Studio for labeling text data for a sentiment classification task, but can easily be extended and configured for any labeling project. You can also add Pachyderm pipelines for pre-processing or even to automate data ingestion. The general setup requires:

1. Give Label Studio access to Pachyderm. Generate a Pachyderm security token (not necessary if running locally), configure the S3 gateway endpoint, and create an environment file with the configuration. A sample .env file is shown below:

Example .env file for configuring the Endpoint URL.

2. Edit the source and target buckets in the Label Studio configuration to point to the Pachyderm versioned data repositories. The bucket names follow the convention <repo_branch>.<data_repo>/<file_name> . An example configuration is shown below.

Label Studio config with Pachyderm’s S3 backend.

Conclusion

Data labeling and data versioning provide a rock solid bedrock to build your machine learning models on now and in the future. On their own, versioning and labeling are useful but when you bring them together, Label Studio and Pachyderm can dramatically increase the efficiency and dependability of the labeling cycle in your machine learning loop. That means better models, faster and more reliably and that’s what every data science team on the planet is pushing towards. The stronger their foundation, the more they can focus on making their models better rather than wrangling data.

For more information on Label Studio, check out their blog that has a lot of information about their newest features.

For more information on Pachyderm, check out our docs or connect with us on slack to learn how to apply these techniques in production.

I’ve written an extensive post on managing the machine learning life cycle, Completing the Machine Learning Loop, that explores this topic in more depth or for some of my work in NLP and Speech Recognition, see my book Deep Learning for NLP and Speech.

--

--

Jimmy Whitaker
Pachyderm Community Blog

Applying AI the right way | Chief Scientist — AI & Strategy @HPE | Computer Science @UniOfOxford | Published @SpringerCompSci