Automated and denoised labels for NLP with weak supervision
Weak Supervision is a technique that enables Data Scientists to create large-scale, denoised labels for training data using heuristics.
Introduction
In this blog post, we will learn how the technology essentially works, what it looks like for natural language, and why it may help you a lot during dataset creation.
Before getting into the details of the technology, let me shortly tell you the concrete benefits of weak supervision:
- You can use it to automate large parts of your data labeling.
- It helps you document the steps performed during labeling, which ultimately makes it easier to later debug your model.
- Via its common interface, it enables you to “debug” your data. You essentially treat data as code.
With current frameworks, it is mostly relevant for natural language. We showcase it for classification, but it can also be used for extraction tasks such as Named Entity recognition.
If that’s something you’d be interested in, continue reading :)
Heuristics from regular expressions to active learning and zero-shot modules
Let’s start from the basics. If there is a “:-)” in your sentence, its sentiment is most likely positive, isn’t it? Of course, that is not going to be correct all the time — but for now, we don’t care about a perfect indicator. We want a heuristic, as they are the bread and butter of weak supervision. To be a valid heuristic, heuristics only require to fit the following interface:
heuristic(document) -> label indication, a.k.a. noisy label
A heuristic takes some document (e.g. a Python dictionary or String) as input, and returns/yields some indication for a label. You can easily implement different types of heuristics. Common options are:
- Labeling functions, e.g. in the form of Python code. For example:
def lkp_positive(record: Dict[str, Any]) -> str:
my_list = [“:-)”, “awesome”, …]
for term in my_list:
if term.lower() in record[“text”]:
return “positive” - Active learning models, e.g. Sklearn models applied on transformer embeddings
- Zero-shot classifiers
- 3rd party systems and legacy systems
- Human annotators
Also, such heuristics don’t have to hit every possible record. It is perfectly fine if they only cover 2% of the general data. We don’t aim to automate the full data labeling, but we try to come up with a reasonable amount of heuristics that help us to automate larger parts, so that we can focus our manual effort on more difficult samples. We’ll explain the different types of heuristics in detail for another post. Check out other relevant content here .
Applying weak supervision
As soon as you have set up multiple of such heuristics (e.g. 10), you can gather a noisy label matrix. This matrix will hold the data for each record-heuristic pair, and might look as follows:
Now, weak supervision is essentially about reducing this matrix down to a single, denoised label vector. Depending on the chosen algorithm, this process can take into account various metadata of each heuristic. Ranging from majority voting, bayesian expectation-maximization to informed majority voting.
Let’s say we’ve manually labeled parts of the available data, such that we can estimate the precision of each heuristic. We could then create a noisy label matrix as follows:
In our soon to be released open-source software, we take into account the precision estimation and frequency of each heuristic. This way, we can come up with a final denoised label estimation:
Or, as you can see in our labeling session:
Weak supervision for documentation and debugging
Now, automating large parts of the data annotation process is already quite helpful, right? Still, weak supervision comes with two more helpful features.
First, you essentially enrich your records with highly valuable metadata. For weakly supervised records, you can explain why they have been labeled with the respective class. Imagine how great that is during model evaluation.
Second, it just helps a ton during developing and debugging your dataset. Used for instance in data management, you can slice records that have conflicting heuristic outputs. You can narrow down your heuristics, so that they grow in precision. This way, you get a much better feeling for your data faster.
From our experience, the documentation and debugging feature is at least as valuable as scaling the data labeling itself. Because of that, you can use weak supervision not only for starting from the scratch, but also for continuous data quality improvement (e.g. combined with confident learning, a technique we will cover in another post).
We’re going open-source, try it out yourself
Ultimately, it is best to just play around with some data yourself, right? Well, we’re launching our system soon that we’ve built for more than the past year, so feel free to install it locally and play with it. It comes with a rich set of features, such as integrated transformer models, neural search and flexible labeling tasks.
We are releasing new features every other week. So if you want to stay updated regarding our product, company and content, subscribe to our newsletter for free here, so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch :)