Pachyderm + Label Studio

Simplified Storage and Configuration

Jimmy Whitaker
Pachyderm Community Blog
4 min readJun 30, 2022

--

Figure 1: Pachyderm + Label Studio Integration Diagram (Image by author and Pachyderm)

Tweaking your algorithm or model architecture is a complete waste of time unless you have high quality, labeled data. More than ever before, continuous improvement in machine learning relies on labeled data. And labeled data usually requires a human in the loop.

Tools like Label Studio have become popular for this exact reason — they allow you to incorporate human decisions into your labeling process. Whether you’re extending your applications for Active Learning, guiding Online Learning, or tweaking synthetic data, your whole focus should be on curating high quality data. But whenever data is changing, knowing where you’ve come from and how you got here is crucial. You have to track and manage changes.

The changes are crucial. “Who labeled that image?” “What was our accuracy before this round of synthetic data?” “Should we roll back to the last working model?” These questions inevitable come up when we’re really trying to get the best out of our model. That’s why we need to add versioning and lineage to our labeling and that’s where Pachyderm comes in.

Pachyderm goes beyond just versioning your labels. It’s a platform that tracks every change to any file or pipeline you create. Because, in the real world, you not only have a labeling environment, but also preprocessing, training configurations, deployment approaches, skew tests. This is where immutable versions of your source data, labels and everything else becomes important. We need to be able to see what’s changed and how it affected things.

A little while back we created a “light” integration with Label Studio and Pachyderm to incorporate data versioning with our labeling process. In our new integration, we’ve made the setup easier than ever.

Easier to Configure

The original, “lite” integration used the Pachyderm’s S3 gateway to commit a label anytime it was created/updated. Configuring it was a little involved (Figure 2), but it did have the advantage of using something that was already built into Label Studio.

Figure 2: Configuration of the Label Studio integration with the S3 gateway. (Image by author)

It also had the benefit of capturing everything as soon as it changed — you label a task and it immediately gets committed to Pachyderm. The downside was that there were many commits in Pachyderm. Every change to a label was a new data commit, which made it more difficult to reason about with lineage and provenance.

In our newest integration, we’ve added a dedicated Pachyderm cloud storage backend into Label Studio. This does two things:

  1. Easier configuration — Directly reference Pachyderm data repositories and branches inside Label Studio in a user friendly way (Figure 3).
  2. Batch task labels into a single commit —Instead of having a commit happen every time something changes, we allow a labeler to choose when the data they’ve labeled gets committed to Pachyderm (Figure 4). (More on this in the next section.)
Figure 3: The new dedicated Pachyderm Storage type. (Image by author)

Overall, this change makes the integration much easier to get started, more responsive, and provides a lot simplicity when working with versioned data.

How it works

Under the hood, we’re using a data mounting server that we originally created for the Pachyderm JupyterLab Mount Extension. This server treats a versioned data repository in Pachyderm as a file system running on your system (or in a container). In Label Studio, it simulates a mounted drive into your Label Studio environment, appearing as if the versioned data is directly in your file system. We’ll save the full details of the mount server for a future blog.

When you label your data, all labels are stored locally, so nothing is committed to Pachyderm until you’re happy with your progress. When we want to push all of our labels, we sync our storage and all our changes are pushed to Pachyderm as a single commit, automatically kicking off any downstream pipelines we may have created.

Figure 4: “Sync” storage to commit all new and updated labels to Pachyderm in one commit.

We’ve also made the integration really easy to get started with. Once you have a Pachyderm cluster, you can run my pre-built Docker container, passing your Pachyderm configuration to the mount server and everything works from there.

$ docker run -it --rm -p8080:8080 -v ~/.pachyderm/config.json:/root/.pachyderm/config.json --device=/dev/fuse --cap-add SYS_ADMIN --name label-studio --entrypoint=/usr/local/bin/label-studio jimmywhitaker/label-studio:pach2.2-ls1.4v

The mount server will use our configuration and spin up Label Studio. Once everything is running, navigate tohttp://localhost:8080and you’re all set.

Try it out

All in all, this new integration makes it much easier to get started with Label Studio + Pachyderm. You can version your source data, labels, and even trigger pipelines when your data changes, giving you all kinds of MLOps-y capabilities. Check out the new Label Studio + Pachyderm integration on GitHub, and let us know what you think about it!

--

--

Jimmy Whitaker
Pachyderm Community Blog

Applying AI the right way | Chief Scientist — AI & Strategy @HPE | Computer Science @UniOfOxford | Published @SpringerCompSci