Snorkel: Build ML Models without Labeled Data

By LP Cheung, Data Scientist

HK01 Product & Technology team
HK01 Tech Blog
5 min readJun 9, 2020

--

With bunches of hands-on tools, building models on labeled data has already been an easy task for data scientists. However, in the real world, many tasks are not well-formatted supervised learning problems: labeled data may be expensive or even impossible to obtain. An alternative approach is to leverage cheap and low-quality data to achieve supervision, which is the topic of this article: weak supervision

In the following sections, I will go through the concepts of weak supervision. I will also introduce a tool called Snorkel, which is developed by Stanford. Finally, I will show you how HK01 uses Snorkel to capture the trend topics on Facebook, and therefore enhance our recommender engine.

There are several paradigms of algorithm to remedy the situation if a large amount of high-quality, hand-labeled training data is not available. As you can see in the following diagram, if you don’t have enough data, you have to find another source of knowledge to achieve a comparable level of supervision to traditional supervision.

source: http://ai.stanford.edu/blog/weak-supervision/

Choosing one among these paradigms is pretty tricky. It depends on what you have on your hands. Transfer learning is great for tasks with a well-trained model in similar tasks, like fine-tuning ImageNet model with your own categories; while you may have some presumptions on the topological structure, such as the shape of clusters, you may prefer semi-supervised learning.

So, what kind of situation is the best suit for weak supervision?

You may have some ideas after reading the definition of weak supervision. Yes, if you have plenty of domain experts but lack of data, weak supervision is your pick.

The reason behind is revealed in the definition: weak supervision enables learning from low-quality and noisy labels. In other words, you can still find patterns, just like what supervised learning do, unless you should supplement multiple noisy labels for each training sample so that the model can generalize knowledge from them.

weak supervision enables supervision by multiple noisy labels

The rationale of weak supervision relies on the fact that noisy data is usually much easier and cheaper to obtain than high-quality data. Imagine you are working for an insurance company and your boss asks for a recommender engine of a whole-new product line which, of course, you don’t have data. With sales experts, we can set up some rules which are “mostly correct” like the new product is more attractive to the elderly. These rules are not perfectly correct; but, they are good enough to provide your models collective intelligence. And, most importantly, these rules are easier to obtain than perfectly hand-labeled data.

So, the next question is: how can we inject these rules into our ML models? The answer is Snorkel.

Snorkel is a system developed by Stanford which allows you to program the rules into ML models. The key idea of Snorkel is to build the generative model which represents the causal relationship between the true label and the noisy labels.

The left-hand side of the above diagram is the probabilistic model representing the generative process from the true label to the noisy labels. Although the true label is unobservable, we can still learn the accuracies and correlations by the agreements and disagreements from different noisy labels. Hence, we can estimate the P(L|y) of each noisy label, which is essentially an indicator of quality. By aggregating the noisy labels, we get the estimated true label and use it to train our model.

In Snorkel, noisy labels are programmed as labeling functions. A label function is basically a python function which hard-codes a rule to determine the label. For example, if you’re writing a program to determine which an email is spam, the program should be something like:

In this toy example, you can see the basic elements of Snorkel.

  • define the labels. In this example, the labels are SPAM, NORMAL & ABSTAIN. ABSTAIN is the label used when you cannot determine the label.
  • define labeling functions. Add a decorator @labeling_function() to declare.

After defining the labels and labeling function, you can apply the labeling functions to your dataset and estimate the best label:

  • declare a list of labeling functions lfs
  • create an applier and apply the labeling functions to your dataset to get the noisy labels overall_result
  • create a generative model label_model to resolve the conflicts between noisy labels
  • get the predicted probability score of the training samples

Then, the score can be served as the estimated label to train your model.

In HK01, one of our missions is to capture the trending topics in social media. Success in this task can improve the quality of our recommendations. Also, it provides us with additional information on the importance of the news.

One of the challenges is to match the trending topics with our existing news in the pool. Obviously, it is highly inefficient to hand-label the matchings as we have to go through hundred-thousands of articles for each topic. However, experienced editors can apply their domain knowledge in filtering irrelevant articles.

We also apply machine learning and natural language processing algorithm to build labeling functions. For example, we use named-entity recognition and tokenization to quickly identify the keywords so that we can calculate the distance between trending topics and existing articles.

Originally published at https://medium.com on June 2, 2020.

--

--