Efficient Data Labeling for NLP with Argilla Spaces on the 🤗 Hub

Daniel Vila Suero
5 min readFeb 10, 2023

--

This article provides a step-by-step guide to deploying Argilla’s NLP data labeling tool on Hugging Face Spaces for efficient labels and feedback collection.

Argilla is an open-source, data labeling tool, for highly efficient human-in-the-loop and MLOps workflows. Argilla is composed of (1) a server and web app for data labeling, and curation, and (2) a Python library for building data annotation workflows in Python. Argilla nicely integrates with the Hugging Face stack (datasets, transformers, hub, and setfit).

You can check out the Argilla documentation to learn about its features and check out the Deep Dive Guides and Tutorials or the live demo on Spaces.

In the next sections, you’ll learn to deploy your own Argilla app and use it for data labeling workflows right from the Hub.

One-click deployment of the Argilla Data labeling tool.

Launching Argilla Spaces

To get started with Argilla, you should deploy its server. With Argilla Hugging Face Spaces, you can launch your own Argilla Server quickly and without any cost, without the need for any local setup. Simply launch in a matter of minutes:

1. Deploy on HF Spaces. If you plan to use the Space frequently or handle large datasets for data labeling and feedback collection, upgrading the hardware with a more powerful CPU and increased RAM can enhance performance.

2. Optionally, set up your user credentials and API keys. The default user and password are argilla and 1234.

3. Copy the Space’s direct URL. You can find this URL under the “Embed this Space” button. You’ll use this URL with the argilla library for reading and writing data or to connect with the no-code data manager Streamlit app, also powered by Hugging Face Spaces.

4. Open your favorite Python editor and start building amazing datasets!

How to use Argilla Spaces

For more details, check out the step-by-step guide on the Hugging Face Hub Docs. Now, let’s explore some exciting use cases and applications.

Label a dataset and build a sentiment classifier with SetFit

The core application of Argilla is to efficiently label your datasets. This process can be further streamlined using pre-trained models and few-shot libraries like SetFit. You can learn how to label a dataset and train a SetFit model by following the step-by-step tutorial on Argilla docs. I recommend running the tutorial on Colab or Jupyter Notebooks, but here’s most of the code you’ll need:

Create a dataset for data labeling

import argilla as rg
from datasets import load_dataset

# You can find your Space URL behind the Embed this space button
# Change it
rg.init(
api_url="https://dvilasuero-argilla-template-space.hf.space",
api_key="team.apikey"
)

banking_ds = load_dataset("argilla/banking_sentiment_setfit", split="train")

# Argilla expects labels in the annotation column
banking_ds = banking_ds.rename_column("label", "annotation")

# Build argilla dataset from datasets
argilla_ds = rg.read_datasets(banking_ds, task="TextClassification")

rg.log(argilla_ds, "banking_sentiment")

After this step, you can go to your Space URL to label your data with the Argilla UI. The dataset already contains labels, so you can just get a sense of how to label data with Argilla.

Load the annotated dataset and train the SetFit model

labelled_ds = rg.load("banking_sentiment").prepare_for_training()
labelled_ds = labelled_ds.train_test_split()

model = SetFitModel.from_pretrained(
"sentence-transformers/paraphrase-mpnet-base-v2"
)

# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=labelled_ds["train"],
eval_dataset=labelled_ds["test"],
loss_class=CosineSimilarityLoss,
batch_size=8,
num_iterations=20,
)

trainer.train()
metrics = trainer.evaluate()

Human feedback collection with Gradio apps

Another application of Argilla is data and feedback collection from third-party apps. We designed Argilla to be seamlessly integrated into existing tools and workflows. If you want to build datasets from custom apps or services, you can now easily connect Argilla with Gradio, Streamlit, or Inference endpoints.

In this example, we connect a Gradio Space with an Argilla to collect Flan-T5 inputs and predictions. This data can be used for gathering human feedback with Argilla to fine-tune Flan-T5 for your use case or build an RLHF (Reinforcement Learning from Human Feedback) workflow with TRL. Here’s all you need to add to your app.py:

import argilla as rg

class ArgillaLogger(FlaggingCallback):
def __init__(self, api_url, api_key):
rg.init(api_url=api_url, api_key=api_key)

def setup(self, components: List[IOComponent], flagging_dir: str):
pass

def flag(
self,
flag_data: List[Any],
flag_option: Optional[str] = None,
flag_index: Optional[int] = None,
username: Optional[str] = None,
) -> int:
text = flag_data[0]
inference = flag_data[1]
# build and add record to argilla dataset
rg.TextClassificationRecord(
inputs={"answer": text, "response": inference}
)
rg.log(
name="i-like-tune-flan",
records=record,
)

io = gr.Interface(
allow_flagging="manual",
flagging_callback=ArgillaLogger(
api_url="https://dvilasuero-argilla-template-space.hf.space",
api_key="team.apikey",
),
# other params
)

The resulting app looks like this:

Gradio app with Argilla human feedback logging.

Whenever a user submits text and pushes the Flag button, both the input and the generated response are recorded in Argilla. With Argilla, you can evaluate the responses generated by Flan-T5 as shown below:

Argilla Dataset for ranking Flan-T5 generated responses

You can access the dataset here:

Use Streamlit to upload and download Argilla datasets

Not in the mood for coding? Check out this simple Streamlit app. It allows you to create Argilla datasets from CSV files and download annotated datasets in either CSV or JSON format.

No-code Data Manager

Embed Argilla on Jupyter Notebooks or Colab

Do you want to label data directly from a notebook? Just run the code snippet below on a cell and start labeling:

%%html
<iframe
src="https://your-argilla-space.hf.space"
frameborder="0"
width=100%
height="700"
>
</iframe>

Run Argilla Tutorials on Colab or Noteboks

Finally, you can run all Argilla tutorials with our new Open in Colab and view source buttons. These tutorials are organized by ML lifecycle stage, libraries, techniques, and NLP tasks. Hopefully, you will find one that fits your needs!

Summary

The potential applications of combining Argilla Spaces with other tools and services are limitless. With Argilla you can involve more humans in the AI development process, and we look forward to seeing what you create with Argilla on Spaces. If you want to share feedback, showcase your creations, or discuss future plans, join Argilla’s Slack community!

--

--

Daniel Vila Suero

Co-founder of Argilla, the open-source data labeling platform for data-centric NLP